key: cord-0752062-buu7zxae
authors: Taboada, Blanca; Zárate, Selene; Iša, Pavel; Boukadida, Celia; Vazquez-Perez, Joel Armando; Muñoz-Medina, José Esteban; Ramírez-González, José Ernesto; Comas-García, Andreu; Grajales-Muñiz, Concepción; Rincón-Rubio, Alma; Matías-Florentino, Margarita; Sanchez-Flores, Alejandro; Mendieta-Condado, Edgar; Verleyen, Jerome; Barrera-Badillo, Gisela; Hernández-Rivas, Lucía; Mejía-Nepomuceno, Fidencio; Martínez-Orozco, José Arturo; Becerril-Vargas, Eduardo; López, Susana; López-Martínez, Irma; Ávila-Ríos, Santiago; Arias, Carlos F.
title: Genetic Analysis of SARS-CoV-2 Variants in Mexico during the First Year of the COVID-19 Pandemic
date: 2021-10-26
journal: Viruses
DOI: 10.3390/v13112161
sha: 3e82151b1b28ac4ef4320d467272f6d6ad60d611
doc_id: 752062
cord_uid: buu7zxae

During the first year of the SARS-CoV-2 pandemic in Mexico, more than two million people were infected. In this study, we analyzed full genome sequences from 27 February 2020 to 28 February 2021 to characterize the geographical and temporal distribution of SARS-CoV-2 lineages and identify the most common circulating lineages during this period. We defined six different geographical regions with particular dynamics of lineage circulation. The Northeast and Northwest regions were the ones that exhibited the highest lineage diversity, while the Central south and South/Southeast regions presented less diversity with predominance of a certain lineage. Additionally, by late February 2021, lineage B.1.1.519 represented more than 89% of all circulating lineages in the country.

The Mexican government reported the first case of SARS-CoV-2 in Mexico on 28 February 2020 [1] ; the first death was officially registered on 17 March 2020. One year

The genome sequences used in this study and their associated metadata were obtained from residual RNA positive samples that are part of the national response to COVID-19, collected and processed under the Mexican Official NOM-017-SSA2-2012 for epidemiological surveillance of Viral Respiratory Disease emitted and approved by the CONAVE (National Counsel of Epidemiology Surveillance) of the Ministry of Health of the Government of Mexico, so neither Institutional Review Board (IRB) approval nor informed consent from patients were required.

Clinical specimens were collected in laboratories and hospitals belonging to the Ministry of Public Health of Mexico (Red Nacional de Laboratorios Estatales de Salud Publica (RNLSP); Instituto Nacional de Enfermedades Respiratorias (INER); and Instituto Mexicano del Seguro Social (IMSS) in all 32 states in Mexico from 27 February 2020, to 28 February 2021. Oropharyngeal or nasopharyngeal swabs and tracheal aspirates of PCR-confirmed SARS-CoV-2 cases were placed in virus transport medium following official procedures for further sample processing [14] . The procedure was performed using protocols validated by Instituto de Diagnóstico y Referencia Epidemiológicos (InDRE), Secretaria de Salud, Mexico, as approved by the World Health Organization [15] . In general, for RT-PCR, 5 µL of RNA were used in a 25 µL reaction using the Superscript III one-step RT-PCR system (Invitrogen, Darmstadt, Germany). Reverse transcription was performed at 55 • C:10 min, followed by a PCR reaction of 95 • C:3 min and then 95 • C:15 s-58 • C:30 s × 45 cycles. Samples selected for sequencing had a cycle threshold (Ct) of SARS-CoV-2 quantitative real-time (RT)-qPCR value of less than 26.

Viral nucleic acid extraction was performed using a MagNa Pure LC 2-0 system (Roche, Indianapolis, OH, USA) or ExiPrep 96 Viral DNA/RNA Kit (LaunchWorks CDMO, Beverly, MA, USA). Total cDNA was synthesized by Superscript III Reverse Transcriptase (Thermo Fisher, Waltham, MA, USA) with primer 5 -GTTTCCCAGTAGGTCTCN9-3 . The second strand was generated by two rounds of synthesis with Sequenase 2.0 (Affymetrix, USB, Ohio, USA). The resulting product was used as input for an amplicon-based protocol with the POLAR method (https://www.protocols.io/view/pathogen-oriented-low-costassembly-amp-re-sequenc-bearjad6.html, accessed on 20 March 2020 and 5 December 2020). Later, the cDNA was separated into two pools for amplification using two sets of Artic v3 primers covering the whole SARS-CoV-2 genome. PCR reactions were performed on both pools as follows: 98 • C:30 s × 1 cycle; 98 • C:15 s; 65 • C:5 min × 30 cycles. PCR cleaning was performed using AmpPure (Beckman Coulter, Indianapolis, IN, USA) beads for quantification prior to Illumina library preparation. Amplicon libraries were pooled after normalization and loaded into an Illumina 300 cycles mid-output sequencing kit in the NextSeq500 sequencing platform (Illumina, San Diego, CA, USA) with a paired-end read configuration (2 × 150 bp reads).

Quality control, removal of duplicates, and off-target reads were performed as described previously [13] . These preprocessed reads were then mapped using Bowtie2 v2.3.4.3 (Johns Hopkins University, Baltimore, MD, USA) [16] against Wuhan-Hu-1 (MN908947) reference sequence, followed by consensus calling by iVar v1.3.1 (The Scripps Research Institute, La Jolla, CA, USA), with consensus minimum read depth of 20 [17] . In total, 1515 full virus genome sequences with a coverage ≥ 90% and a mean depth ≥ 1000×, were deposited in the Global Initiative on Sharing All Influenza Data (GISAID) platform [18] and GenBank (OK435170-OK436684; Supplementary Table S1).

In addition, 2400 complete Mexican genomes and their metadata, available in the GISAID platform, with collection dates between 17 February 2020 and 28 February 2021, were downloaded on 14 May 2021. The metadata of the 3915 Mexican sequences used in this work are reported in Supplementary Table S2 .

All 3915 Mexican sequences were aligned against the NCBI reference sequence from Wuhan-Hu-1 (NC_045512.2) using MAFFTv7 (option-addfragments) (Research Institute for Microbial Diseases, Osaka, Japan) [19] . An in-house python script was used to identify polymorphic sites and classify them as synonymous and nonsynonymous. Further, the number of sequences where these SNPs appeared was determined, discarding the SNPs presented in less than three sequences. A bar plot of the number of polymorphic sites in each ORF was built using R v.4.1.0 (R Developing Core Team) [20] . Additionally, a heatmap of nonsynonymous mutations through time was built for those with at least one frequency per month higher than 10%, using the package ComplexHeatmap in R.

Phylogenetic Assignment of Named Global Outbreak LINeages (PANGOLIN v2.4.2 of 4 May 2021) software suite (https://github.com/cov-lineages/pangolin, accessed on 14 May 2021) was used to classify the 3915 Mexican sequences and identify the most common lineages of SARS-CoV-2 circulating during the first year of the pandemic. PANGO lineage and collection date were used to build a stacked density plot in R v4.1.0 with the package ggplot2 [21] . Lineages that represented less than 1% of the reported sequences were grouped and reported as rare. Additionally, in order to reduce sampling bias and limited sequencing depth in some states, the data were grouped by geographical regions, as follows: Figure S1 ). For each region, a stacked density plot was built as described above.

To build a tree containing the circulating lineages in Mexico during the pandemic's first year, we carried out the following strategy. From the 3915 Mexican genomes, we selected 3030 complete genomes with high coverage, having less than 1% of Ns. An inhouse python script (available upon request) was used to subset the sequences to evenly represent lineage diversity by region of Mexico and month. Lineages for which only one sequence was available were excluded. We used the following sampling strategy to select 200 sequences per region. Given that region W had fewer than 200 sequences, the remainder was evenly distributed between the rest. For each region, the sequences were divided evenly by month; if a month did not fill its quota, the rest was distributed among the others. Then, the available spots were distributed by lineage for each month, excluding lineages for which a single sequence was present and considering the frequency of each lineage. The sequences were then randomly selected from the pool of candidate sequences. A set of global worldwide reference sequences was constructed by sampling the oldest available sequence of the 750 most common PANGO lineages. An alignment of Mexican and reference sequences was carried out using MAFFT v7 [19] , and then was manually edited to exclude the UTRs. A maximum-likelihood phylogeny was estimated in iqtree2 (IQ-TREE Development Team) [22] using the GTR + F + R2 substitution model [23] . The tree was later scaled using LSD2 and tip dates [24] . Tree visualization was performed in R using the packages ggtree and treeio [25, 26] .

To analyze the evolution of B.1.1.519 lineage, all high-coverage complete genome sequences for B.1.1.222 and B.1.1.519 lineages were downloaded from GISAID. An in-house python script (available upon request) was used to subset the sequences available from Mexico and the USA to no more than 20 per month per country for each lineage. All available sequences from other countries were also included in the analysis. The sequences were aligned using MAFFT as above, and the alignment was edited to eliminate the UTRs. The phylogeny was estimated as described above.

Statistical analyses were conducted in Rv.4.1.0 statistical environment [20] , using the Vegan package v.2.5.7 [27] . Richness, diversity (Invsimpson index), and Pielou's evenness were calculated using the frequencies of SARS-CoV-2 variants. For beta diversity, Bray-Curtis distance metric was used. Then, frequencies were also used as input for a Canonical Correlation Analysis (CCA) [28] using Bray-Curtis distances. All results of metrics described above were used to compare geographical groups with a nonparametric multivariate permutation test (PERMANOVA), using the Adonis function with 999 permu-tations, and Mann-Whitney test for measures [28] ; homogeneity variances between groups were verified in all comparisons. All statistics were considered significant if p < 0.05.

In total, 1515 sequences with a coverage higher than 90% of the SARS-CoV-2 genome were obtained in this study, with a mean depth of 2865×. These sequences were part of the 3915 sequences deposited in the GISAID database, collected in Mexico from 27 February 2020 to 28 February 2021 (Table S1 ). The temporal distribution of genomes correlated with the confirmed, positive COVID-19 cases (Spearman's rho = 0.552, p = 0.05), suggesting that the number of sequences was proportional to the number of infections in Mexico during that time ( Figure 1A ). In agreement with international recommendations, these results excluded the last month because the genomes generated and deposited in GISAID in February 2021 increased significantly as a result of the implementation of national genomic surveillance programs. Regarding sampling, genome sequences were obtained from all 32 states in the country, although it was not homogeneous. Virus sequences from Baja California and Mexico City were significantly overrepresented ( Figure 1B) , while states such as Michoacán (7 genomes), Nayarit (10 genomes), Tabasco (13 genomes), Campeche (17 genomes), and Sinaloa (19 genomes) were poorly sampled. 

Of the total samples analyzed, 42.4% were from females, 46.8% from males, and no information was available for 10.8% of the patients (Table S2 ). The median age was 44 years (IQR 32-58; minimum under 1 year of age and maximum 105 years). Regarding clinical status, 35.7% of the samples were obtained from hospitalized patients, 21.1% from ambulatory patients, and 4.5% were labeled as deceased; the clinical status for 38.7% of the patients was unknown (Figure 2A ). The proportion of males was larger in hospitalized (chi squared test, p = 0.01) and deceased patients (chi squared test, p = 4.3 × 10 −7 ), but not in ambulatory patients ( Figure 2B -D). 

The single-nucleotide polymorphisms (SNPs) in the genome of the 3915 analyzed viruses were determined using as reference the sequence of the original Wuhan-Hu-1 genome ( Figure S2 ). ORF8, ORF3a, ORF7b, N, and S exhibited the largest proportion of nonsynonymous substitutions, and an excess of these substitutions compared with synonymous substitutions. Evidence for positive selection has been reported previously for ORF8 and ORF3a [30] . On the other hand, M and ORF7a showed an excess of synonymous substitutions. ORF1a and N had a similar frequency of both types of substitutions, although the frequency was four times higher in N. No SNPs were observed for ORF6 or E.

In total, 4278 different amino acid changes were found in the various ORFs of all lineages. Interestingly, only 65 changes were identified in more than 1% of the sequences and only 17 in more than 5%, implying a possible role for positive selection at these sites, although founder effects or hitchhiking events cannot be ruled out. As expected, the most frequent mutations corresponded to lineages that circulated commonly in the country during the first year of the pandemic. The two most frequent SNPs that led to an amino acid change were P314L in nsp12 protein (99%) and D614G in S (98%). Other frequent mutations in S were T732A (50%), P681H (41%), and T478K (38%). The high frequency observed for some of these mutations reflected their presence in viruses that circulated early after the pandemic started in the country-notably lineage B.1 and other lineages derived from it. In other cases, the predominance of a particular lineage that bore those mutations, such as lineage B.1.1.519 that contains the substitution S:T478K and is responsible for its high frequency, was as described in the following section.

To understand the dynamics of these nonsynonymous mutations, we constructed a frequency heatmap. Figure 3 shows the monthly frequency of those nonsynonymous mutations that reached at least 10% in one month. Changes S:D614G and ORF1b:P314L were the most prevalent, starting in March 2020, congruent with their appearance in lineage B.1, from which many circulating lineages inherited them. Additionally, seven mutations jointly increased their frequency from November 2020, three of which were in the spike protein. These changes corresponded to lineage B.1.1.519, the most prevalent in Mexico, which was identified in November (see below). Two mutations in the N gene circulated throughout the year 2020 but increased their frequency after November. In general, Figure 3 shows temporal variations in the circulation of different substitutions and a steady increase in a group of mutations in the last months of the study. 

To determine the genetic diversity of SARS-CoV-2 that circulated in the country during the first year of the pandemic, the sequences were first phylogenetically characterized with the global PANGO lineage tool. The selected genomes were assigned to 103 different global lineages, of which more than half (54 lineages) had only one or two associated genomes (Supplementary Table S3 ). In most cases, the low distribution of some lineages was not due to a sampling bias, because they were similar to a general tendency observed worldwide, being also at low abundance. Interestingly, we found that the worldwide prevalence of these low-frequency lineages was below 0.5%, and they were only identified for a limited period and then disappeared ( Table S3 ), suggesting that the mutations they bear might reduce their ability to compete with other virus lineages; however, we cannot rule out the action of genetic drift resulting in the loss of these lineages by sampling effects independent of their fitness, given their low frequency. Regarding the variants most frequently identified in the country during the year of study, lineages B.1 (n = 530, 13.5%), B. We built a stacked density plot to visualize the proportion of the detected lineages along time ( Figure 4B ). Lineages whose frequency was less than 1% were grouped and labeled as rare lineages in the plot. An exception was made for lineages classified by the WHO as variants of concern (VOC), variants of interest (VOI), and variants under monitoring (VUM), given their importance. Although sequence sampling was neither spatially nor temporally homogeneous, some interesting trends could be seen ( Figure 4B ). In the first 3 months of the epidemic (March-May 2020), lineages B. Figure 4) . B.1.427 and B.1.429 lineages were the earliest VUM identified in Mexico, in July 2020, and by February 2021 they had been detected in 19 different states. Variant Alpha was detected at the end of December for the first time and was found in six different states by February. Of particular interest is lineage B.1.1.519, which is considered a VUM; it was first detected in Mexico in November 2020, and then started to increase its frequency dramatically to become the dominant lineage by February 2021, representing 70% of the sequences at the national level, as previously reported [31] . Moreover, B.1.2 increased slowly, passing from less than 0.5% before December 2020 to more than 2% of all sequences in February 2021. 

Given the sampling bias in some states, for further analysis we clustered the data into six geographical regions, as described in the Methods section ( Figure S1 ). To determine the phylogenetic relationships and distribution of lineages in the country, we built a maximum likelihood scaled phylogenetic tree using a sample set of 1000 sequences from Mexico and, as reference, 700 sequences from the most common worldwide lineages. Figure 5A shows the tree with the tips colored by region, with the sequences sampled outside of Mexico colored in grey. In Figure 5B , the tips are colored by lineage; lineages whose frequencies in Mexico were less than 1% are shown in grey, except for VOC and VUM.

The phylogenetic reconstruction shown in Figure 5 revealed that there was no evident clustering of lineages by region, indicating a dynamic distribution and possibly multiple introductions into the country. However, exceptions to this observation were lineages B.1.1.222 and B.1.1.519, two of the most frequent in the country, which appeared first in the CS region, followed by their presence in S/SE, and spreading widely to other regions later. To analyze the regional lineage circulation in more detail, we built a stacked density plot for each region and evaluated their diversity and richness. As shown in Figure 6 (panels A to F), lineage B.1 predominated early in the pandemic in the six regions, which also can be observed in Figure 5B , where many earlier tips correspond to B.1. Lineage B.1.609 was frequent in the CS, W, S/SE, and NW regions in the early months but sharply decreased around July 2020. This is also observed in the phylogenetic tree that shows B.1.609 as a tight cluster early in the pandemic. Additionally, B.1.1 was more abundant in CS, CN, NE, and S/SE, almost disappearing by January 2021.

As stated before, lineage B.1.1.222 dominated the CS region from July to November 2020, but it was detected in all the other regions, albeit at lower frequencies or for limited periods of time. Lineage B.1.1.519 replaced B.1.1.222 in the CS (Figures 5 and 6) , rapidly rising to be the dominant lineage and representing 90% of all sampled genomes in this region by February 2021. The expansion of B.1.1.519 to other areas was delayed, increasing its frequency in January 2021 in the S/SE, CN, NE, and W regions. In contrast, in the NW region, the presence of this lineage was sporadic until February 2021. Notably, VUM B.1.427 and B.1.429 were frequently sampled in the NW starting from December 2020, coinciding with their increase in California, USA. These results corroborated the same geographical and temporal patterns presented in the phylogenetic analysis. The VOC B.1.1.7 (Alpha) appeared predominantly in the NE region, whereas P2 was present only in the W area. The initial geographical detection of some of the variants mentioned above can be explained as a product of the international USA-Mexico border mobility of workers as well as the tourist flow (an introduction effect). Then, a local workflow between the larger cities can explain the spread of those variants to other regions, although multiple introductions cannot be ruled out.

To evaluate a possible sampling bias at the regional level over time in the interpretation of our results, we carried out Spearman correlation tests between the monthly distribution of genomes with positive cases in each region. We found that these correlations were strongly related to what is observed at the national level of cases-genomes of SARS-CoV-2; all correlation tests except those for regions NW and W had a rho larger than 0.59, with a p < 0.05. In the region W, low sampling could explain this absence of correlation, whereas in the NW area, the sampling was homogenous through time regardless of the number of cases. Additionally, a Kolmogorov-Smirnov two-sample test of lineage distribution was applied between the months preceding and following the low sampling period. We found that there was no statistically significant difference in lineage distribution in all regions, which we interpret as a low impact on sequencing bias in the lineage dynamics depicted in Figure 6 .

Overall, Figures 5 and 6 reveal a difference in lineage diversity when comparing the six areas in their number, identity, and lineage replacement dynamics. The small number of sequences available for the W region may explain the intermittent detections of some lineages. The results show that although some lineages were widespread, they were present in different regions at different times, congruent with an observed abundance of rare lineages in some regions. Thus, we statistically evaluated the changes in the frequency of the different lineages among regions (Figure 7 ). The canonical correlation analysis ( Figure 7A ) identified different patterns in population lineages between regions based on the genetic variation of the sequenced samples. Clearly, regions CS and S/SE and NE were significantly different (adonis with p < 0.5 and beta dispersion > 0.5 in all cases) from all other regions, as observed in Figure 7A . In the CS and S/SE regions, the lineage diversity was lower, as pointed out in the density plots of Figure 6 . These regions were characterized by some dominant lineages, while in NE the diversity was the highest. Moreover, to understand better the diversity within and between regions, the frequency and abundance of the variants per region were evaluated and compared using InvSimpson's diversity index ( Figure 7B ). The InvSimpson metric expresses effective diversity with less prone to biases introduced by incomplete or biased sampling. The lowest values obtained for the CS and S/SE regions ( Figure 7B ) confirmed, as shown in Figure 6A ,B, a lower and dominant number of circulating lineages.

In the case of the CS region, the relative abundance or homogeneity of variants was also lower ( Figure 7D ), indicating that after one or more new variants appeared, one of them took over, reducing the variability. Moreover, the S/SE region had the lowest number of variants identified ( Figure 7C ). However, their relative abundances were high, suggesting that the low diversity value was probably due to a low sampling bias ( Figure 1B ) that resulted in a limited variant identification.

Conversely, NE was the region with the highest diversity and number of different lineages ( Figure 7B ,C), suggesting that different variants coexisted at a comparable relative frequency, followed by the NW and CN regions. (Table 2) [5, 32] . As shown in Table 3 , lineage B.1.1.222 at its peak constituted about 14% of all sequenced genomes circulating in Mexico between July and November 2020. For the rest of the year, it was found at frequencies between 1.1% and 12%. Even though most reported sequences of B.1.1.222 are from the USA, it never accounted for more than 1% of the lineages reported in that country. [31] , with the acquisition of six additional nonsynonymous substitutions (Table 2) . Of these, two amino acid changes in S have been reported to have phenotypic effects. S:T478K was reported to increase receptor binding affinity and viral entry [10] , and S:P681H [9] was reported to increase spike cleavage efficiency. To gain better insight into the evolution of B.1.1.519, we reconstructed a phylogeny using available sequences of lineages B.1.1.222 and B.1.1.519 from samples collected through February 2021. The phylogeny in Figure 8 shows that lineage B.1.1.222 circulated in Mexico and the USA and was rarely detected in other countries. Interestingly, while this lineage was the most frequently sampled in the center and south of Mexico until December 2020, it was not detected at high frequency in the north ( Figure 6A-E) , where the lineage diversity was the largest. Lineage B.1.1.519 was first detected in Mexico in November 2020, and sequences from Argentina and the USA predate this earliest detection by 4 and 12 days, respectively. In November, B.1.1.519 accounted for 6.1% of the total reported sequences from Mexico, while it accounted for 4.1% in Argentina and only 0.02% in the USA (Table 3 ). The frequencies of B.1.1.519 rapidly increased in Mexico City, indicating a possibly undetected circulation in September and October 2020, because the number of sequences obtained in these 2 months was considerably lower (see Table 3 ). Most of the B.1.1.519 sequences from November and December 2020 reported in GISAID are from Mexico, suggesting that the origin of this lineage may be in central Mexico (see Figure 5A) , where it spread rapidly afterwards, reaching the CN and S/SE regions in January 2021. Additionally, by February 2021, a sharp increase in frequency in the NW and NE regions was observed. The spread pattern could be explained by the workflow between the largest cities between regions. By February 2021, the overall frequency of this lineage in Mexico increased up to 62.3%, whereas it reached only 2.7% in the USA and 3.5% in Argentina.

In this study, we aimed to characterize the genomic diversity of SARS-CoV-2 variants that circulated in Mexico during the first year of the pandemic. To this end, we analyzed 3915 genomes, of which 1515 were sequenced in this work. Metadata analysis revealed that a more significant burden of disease was observed in men, representing a larger proportion of hospitalized and deceased patients, as has been previously reported [35] . The median age of hospitalized patients was 50, and 58 years for the deceased, an age younger than expected in comparison to other countries [36] . The high prevalence of co-morbidities such as obesity and diabetes in the Mexican population may explain this younger age for severe disease [37] . However, it cannot be ruled out that the lower median age in Mexico (28 years) as compared to European countries (>40 years) and the USA (38 years) also impacted the observed ages for hospitalized and deceased patients.

The number of sequenced genomes throughout the year roughly corresponds with the incidence curve at the national level; however, a strong geographical bias is present, given that in some states the sampling was very limited. Therefore, we clustered sequences by geographical region to attain enough power to compare lineage diversity and homogeneity between regions. It is important to highlight that the W region had the lower number of sequences, and consequently, the results in this region should be interpreted with caution.

Interestingly, the CS region exhibited the lowest lineage diversity of SARS-CoV-2. This region includes the two states with the highest population density per km2, Mexico City, and the State of Mexico, with more than 26 million inhabitants (Supplementary Table S4 ). Additionally, Mexico City harbors the main international airport, being the busiest airport in Mexico and Latin America. However, land communication is the main route between cities and states; airports are mainly used for connections abroad and do not exceed road traffic. For example, in 2019, the number of passengers traveling by plane in Mexico was 149,178,000 and by land, 3,773,000,000 [38] . This observation suggests that initial entry points for SARS-CoV-2 lineages could be by air travelers; most likely, the dispersion between cities occurred by land. In addition, the other states in the CS region are very interconnected with Mexico City, contributing to the wide transmission of some variants and reducing the spread of novel lineages. This phenomenon completely contrasts with what has been reported in other countries, in which the regions with the highest diversity of SARS-CoV-2 lineages corresponded to the cities with the highest population, frequent local travelers, and where the main international airports are located [39] [40] [41] [42] [43] . Moreover, these reports showed a modest diversification of lineages in less densely populated areas, with less intense interconnections with other countries. In contrast, regions NE and NW, which have the lowest population density, exhibit the highest viral diversity in Mexico, probably due to low competition among the lineages, allowing the persistence of low fitness viruses. Even more critical is probably the fact that the states in the NW region have high connectivity with the USA, facilitating many viral introductions into these states. For instance, the most prominent lineages in these regions are those present in the USA border states, as is the case for lineages B.1.427 and B.1.429 that originated in California and were commonly detected in the NW of Mexico. After one year of the pandemic in Mexico, we observed distinct patterns of lineage distribution and dissemination throughout the country. In particular, the larger number of sequences obtained from the central and north regions allowed us to determine their different dynamics. [32] ; these changes in the nucleocapsid seem also to be associated with an increase in disease severity [32] . The substitution S:T732A has not been associated with changes in phenotype.

B.1.1.519 harbors six additional amino acid substitutions, four in ORF1a and two in the spike protein. Change nsp3:T4175I has been identified as evolving under pervasive and episodic positive selection [44] , but the phenotypic importance of this mutation remains to be established. The two amino acid changes in the spike protein seem to be of relevance. Substitution S:P681H, shared with VOC B.1.1.7 and B.1.351, has also been identified as evolving under strong pervasive and episodic positive selection and overlaps with a CTL epitope [44] . More importantly, this mutation has been shown to increase cleavage of the spike, but the impact on viral infectivity remains unclear [9] . Perhaps the most interesting change regarding the rapid spread of the virus is S:T478K, located in the receptor binding domain RBD. This amino acid change has been reported to have a significant increase in the binding free energy to ACE2 of around 1 kcal/mol according to a modeling prediction [45] ; interestingly, in the same work the observed binding free energy for N501Y and L423R mutants increased only by 0.55 and 0.58 kcal/mol, respectively. The substitution T478S was also identified when RBD mutations with increased affinity for ACE2 were selected by in vitro evolution [46] ; in that work the dominant mutations were E484K and N501Y. Additionally, amino acid substitution T478I showed a reduced neutralization by two monoclonal antibodies and by convalescent serum from two patients [47] .

The combination of these amino acid substitutions, including the T478K mutation, as well as the potential ability of B.1.1.519 to evade the partial immunity of the population, may have influenced the rapid dispersal of the virus, eliminating other circulating variants. Despite becoming the dominant variant in Mexico and harboring mutations of interest, B.1.1.519 was not declared a VOC or VOI. However, on 2 June 2021, it was designated by the WHO as a VUM. Interestingly, lineage B.1.1519 did not become dominant in any other country. Given the apparent advantage in the spread of this lineage, it is puzzling that it never attained a frequency higher than 3% in the USA, which was the country with the second-highest prevalence. It is important to consider that viral fitness of a lineage is relative to other lineages circulating in the region. The diversity of lineages in the USA was probably different at the time of B.1.1.519 s introduction. Moreover, stochastic transmission events, especially while lineage frequency is low, also impact lineage fate, and high fitness variants can be fortuitously lost. However, characterization of the B.1.1.519 lineage in vitro and in animal models is needed to further explore its real relevance.

The dynamics of lineage diversity in the CN, NE, and NW regions was markedly different from that in the CS, since at almost all time points a large diversity of lineages was observed in these regions. Notably, starting in December 2020, a relatively high frequency of lineages B.1.427 and B.1.429 was found in the NW. This observation is not surprising given that these lineages were first detected in California, and most of the Mexican sequences in this region were from Baja California, particularly from the border city of Tijuana. Furthermore, lineages B.1.241, B.2.243, B.1.2, and B.1.561 that circulated at different times in the NW were reported concomitantly in California. In addition, the CN, NE, and NW regions share many lineages that seem to have circulated primarily in Mexico. One interesting example is lineage B.1.243, which started circulating in the CN region in April 2020 at frequencies around 30% until October, when it started to decline. This lineage appeared in the NE in September and has been maintained at intermediate frequencies of around 15%. Moreover, it was present in the NW throughout 2020 at frequencies that oscillated between 5% to 15%.

The limitation of our analysis, as mentioned before, is that the sequenced samples represent a small proportion of all confirmed cases in Mexico during the first year of the pandemic, which might not be representative of all circulating lineages. Thus, our findings on epidemiological patterns or the spread of variants should be interpreted with caution. However, our results highlight the importance of genomic surveillance to determine the geographic dissemination of virus variants, information that can guide public health policies at the regional and local levels. In the light of the several VOC and VOI that have been recently described, including the Alpha, Beta, Gamma, and Delta variants, it is of utmost importance to continue the genomic surveillance in Mexico as well as in all other countries, and to identify viral mutations that may be associated with differences in the epidemiological or clinical features of these viruses.

Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/v13112161/s1, Figure S1 . The six regions of Mexico used in the analyses. Figure S2 . Number of synonymous and nonsynonymous single nucleotide polymorphisms (SNPs) per site by ORF. Table S1 . Description of full-length SARS-CoV-2 Mexican genomes generated and used in this study. Table S2 . Patient demographic and clinical characteristics. Table S3 . List of variants that appears one or two times throughout the year in Mexico. Table S4 . Population density in each federal state of Mexico. Table S5 . GISAID acknowledgements table. Funding: This work was partially supported by grants "Epidemiología Genómica de los Virus SARS-CoV-2 Circulantes en México" (to CFA) and "Caracterización de la diversidad viral y bacteriana presente en la enfermedad respiratoria aguda" to (JAVP) from the National Council for Science and Technology of Mexico (CONACyT), and grant 057 from the Ministry of Education, Science, Technology and Innovation (SECTEI) of Mexico City (to CFA), as well as the epidemiological surveillance system.

It was waived because all samples used in this study were unlinked from any personal identifiers prior to commencement of the study.

The generated sequences of SARS-CoV-2 used in this study have been publicly shared through the Global Initiative on Sharing All Influenza Data (GISAID) repository and have also been deposited in the Genbank NCBI database. Accession numbers are listed in Supplementary Table S1. 

Johns Hopkins Coronavirus Resource Center Mortality Analyses -Johns Hopkins Coronavirus Resource Center

A Global Database of COVID-19 Vaccinations

Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on Transmissibility and Pathogenicity

SARS-CoV-2 D614G Variant Exhibits Efficient Replication Ex Vivo and Transmission in Vivo

N501Y Mutation of Spike Protein in SARS-CoV-2 Strengthens Its Binding to Receptor ACE2

Functional Evaluation of Proteolytic Activation for the SARS-CoV-2 Variant B.1.1.7: Role of the P681H Mutation

Antibody-Mediated Neutralization of Authentic SARS-CoV-2 B.1.617 Variants Harboring L452R and T478K/E484Q. Viruses

Transmission, Infectivity, and Neutralization of a Spike L452R SARS-CoV-2 Variant

Genomic Evidence of SARS-CoV-2 Reinfection Involving E484K Spike Mutation

Genomic Analysis of Early SARS-CoV-2 Variants Introduced in Mexico

Detection of 2019 Novel Coronavirus (2019-NCoV) by Real-Time RT-PCR. Eurosurveillance 2020

Fast Gapped-Read Alignment with Bowtie 2

An Amplicon-Based Sequencing Framework for Accurately Measuring Intrahost Virus Diversity Using PrimalSeq and IVar

Global Initiative on Sharing All Influenza Data-from Vision to Reality

Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

R Core Team R. A Language and Environment for Statistical Computing

R Foundation for Statistical Computing

Elegant Graphics for Data Analysis; ggplot2

IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era

Fast Model Selection for Accurate Phylogenetic Estimates

Fast Dating Using Least-Squares Criteria and Algorithms

Ggtree: An r Package for Visualization and Annotation of Phylogenetic Trees with Their Covariates and Other Associated Data

Treeio: An R Package for Phylogenetic Tree Input and Output with Richly Annotated and Associated Data

A New Method for Non-Parametric Multivariate Analysis of Variance

Positive Selection of ORF1ab, ORF3a, and ORF8 Genes Drives the Early Evolutionary Trends of SARS-CoV-2 During the 2020 COVID-19 Pandemic

Emergence and Spread of the Potential Variant of Interest (VOI) B.1.1.519 of SARS-CoV-2 Predominantly Present in Mexico

Nucleocapsid Mutation R203K/G204R Increases the Infectivity, Fitness and Virulence of SARS-CoV-2

Reduced Neutralization of SARS-CoV-2 B.1.617 by Vaccine and Convalescent Serum

Characterization of SARS-CoV-2 Variants B.1.617.1 (Kappa), B.1.617.2 (Delta) and B.1.618 on Cell Entry, Host Range, and Sensitivity to Convalescent Plasma and ACE2 Decoy Receptor

Gender Differences in Patients With COVID-19: Focus on Severity and Mortality. Front. Public Health

Age-Specific Mortality and Immunity Patterns of SARS-CoV-2

Impact of Comorbidities in Mexican SARS-CoV-2-Positive Patients: A Retrospective Analysis in a National Cohort

Genomic Epidemiology of SARS-CoV-2 Reveals Multiple Lineages and Early Spread of SARS-CoV-2 Infections in

Early Transmission Dynamics, Spread, and Genomic Characterization of SARS-CoV-2 in Panama

Establishment and Lineage Dynamics of the SARS-CoV-2 Epidemic in the UK

Spread of a SARS-CoV-2 Variant through Europe in the Summer of 2020

Impact of International Travel and Border Control Measures on the Global Spread of the Novel 2019 Coronavirus Outbreak

Natural Selection Analysis of Global SARS-CoV-2/COVID-19

Vaccine-Escape and Fast-Growing Mutations in the United Kingdom, the United States

SARS-CoV-2 RBD in Vitro Evolution Follows Contagious Mutation Spread, yet Generates an Able Infection Inhibitor

Identification of SARS-CoV-2 Spike Mutations That Attenuate Monoclonal and Serum Antibody Neutralization

We gratefully acknowledge authors from the originating laboratories responsible for obtaining the specimens and the submitting laboratories from which genetic sequence data were generated and shared via the GISAID Initiative (Supplementary Table S5 ). We thank David Santiago Castañeda Carreón, Juan Manuel Hurtado and Roberto Pablo Rodríguez Bahena for their computer support. We also thank the Dirección General de Cómputo y de Tecnologías de la Información (DGTIC-UNAM) for providing supercomputing resources on MIZTLI through the projects LANCAD-UNAM-DGTIC-350 and LANCAD-UNAM-DGTIC-396.

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.