key: cord-311144-tumtzad8
authors: Franco-Muñoz, Carlos; Álvarez-Díaz, Diego A.; Laiton-Donato, Katherine; Wiesner, Magdalena; Escandón, Patricia; Usme-Ciro, José A.; Franco-Sierra, Nicolás D.; Flórez-Sánchez, Astrid C.; Gómez-Rangel, Sergio; Rodríguez-Calderon, Luz D.; Barbosa-Ramirez, Juliana; Ospitia-Baez, Erika; Walteros, Diana M.; Ospina-Martinez, Martha L.; Mercado-Reyes, Marcela
title: Substitutions in Spike and Nucleocapsid proteins of SARS-CoV-2 circulating in South America
date: 2020-09-17
journal: Infect Genet Evol
DOI: 10.1016/j.meegid.2020.104557
sha: 
doc_id: 311144
cord_uid: tumtzad8

SARS-CoV-2 is a new member of the genus Betacoronavirus, responsible for the COVID-19 pandemic. The virus crossed the species barrier and established in the human population taking advantage of the spike protein high affinity for the ACE receptor to infect the lower respiratory tract. The Nucleocapsid (N) and Spike (S) are highly immunogenic structural proteins and most commercial COVID-19 diagnostic assays target these proteins. In an unpredictable epidemic, it is essential to know about their genetic variability. The objective of this study was to describe the substitution frequency of the S and N proteins of SARS-CoV-2 in South America. A total of 504 amino acid and nucleotide sequences of the S and N proteins of SARS-CoV-2 from seven South American countries (Argentina, Brazil, Chile, Ecuador, Peru, Uruguay, and Colombia), reported as of June 3, and corresponding to samples collected between March and April 2020, were compared through substitution matrices using the Muscle algorithm in MEGA X. Forty-three sequences from 13 Colombian departments were obtained in this study using the Oxford Nanopore and Illumina MiSeq technologies, following the amplicon-based ARTIC network protocol. The substitutions D614G in S and R203K/G204R in N were the most frequent in South America, observed in 83% and 34% of the sequences respectively. Strikingly, genomes with the conserved position D614 were almost completely replaced by genomes with the G614 substitution between March to April 2020. A similar replacement pattern was observed with R203K/G204R although more marked in Chile, Argentina and Brazil, suggesting similar introduction history and/or control strategies of SARS-CoV-2 in these countries. It is necessary to continue with the genomic surveillance of S and N proteins during the SARS-CoV-2 pandemic as this information can be useful for developing vaccines, therapeutics and diagnostic tests.

Forty-three sequences from 13 Colombian departments were obtained in this study using the Oxford Nanopore and Illumina MiSeq technologies, following the amplicon-based ARTIC network protocol. The substitutions D614G in S and R203K/G204R in N were the most frequent in South America, observed in 83% and 34% of the sequences respectively. Strikingly, genomes with the conserved position D614 were almost completely replaced by genomes with the G614 substitution between March to April, 2020. A similar replacement pattern was observed with R203K/G204R although more marked in Chile, Argentina and Brazil, suggesting similar introduction history and/or control strategies of SARS-CoV-2 in these countries.

It is necessary to continue with the genomic surveillance of S and N proteins during the SARS-CoV-2 pandemic as this information can be useful for developing vaccines, therapeutics and diagnostic tests.

The recently emerged SARS-CoV-2 responsible for the coronavirus disease 2019 pandemic, has increased significantly in the number of cases and deaths, so that daily, about 70,000 new cases are reported globally (WHO, 2020a (WHO, , 2020c . The first case of COVID-19 in South America was reported in Brazil on February 26, in a 61 years old man traveling from Italy (gob.br, 2020) .

In Colombia, the first case of COVID-19 was announced on March 6, in a traveler from Italy, after which the number of patients has exceeded 43,000 and over 1500 deaths (INS, 2020) .

The SARS-CoV-2 genome consist of a single, positive-stranded RNA (ssRNA[+]), with 29,903 nucleotides long. The virus has shown to be highly infectious and easily transmitted among human populations , even infecting other vertebrate species under laboratory conditions (Shi et al., 2020) . The SARS-CoV-2

genome has nine open reading frames (ORFs); the first one, subdivided in ORF1a and ORF1b by ribosomal frameshifting, encodes the polyproteins pp1a and pp1ab which are processed into non-structural proteins involved in subgenomic/genome length RNA synthesis and virus replication. Structural proteins, Spike (S), Envelope (E), Membrane (M), and Nucleocapsid (N) are encoded in subgenomic mRNA transcripts within ORFs 2, 4, 5, and 9, respectively (SIB, 2020; Yount et al., 2005) Spike protein, a type I membrane glycoprotein, is the most exposed viral protein recognized by the cellular receptor angiotensin-2-converting enzyme (ACE2) during the infection of the lower respiratory tract and considered the main inducer of neutralizing antibodies. The N protein is associated with the RNA genome to form the ribonucleocapsid and is abundantly expressed during infection. Both N and S proteins are highly immunogenic and most commercial COVID-19 diagnostic tests (molecular and immunologic) target these proteins (Álvarez-Díaz et al., 2020; Lee et al., 2020) . J o u r n a l P r e -p r o o f Journal Pre-proof Furthermore, non-synonymous mutations in the S and N proteins have been reported, their implications in the potential emergence of antigenically distinct and/or more virulent strains remain to be studied, although it was reported that mutations in the receptor-binding domain (RBD) at the S protein of SARS-CoV related viruses disrupt the antigenic structure and binding activity of RBD to ACE2 (Du et al., 2009 ) Similarly, how non-synonymous mutations could impact the antibody response and the specificity and sensitivity of serological tests for COVID-19 diagnosis is unknown. Thus, identifying variable sites at these proteins can provide a valuable resource for choosing the target antigens for the development of SARS-CoV-2 vaccines, therapeutics, and diagnostic tests (Du et al., 2009; Jacofsky et al., 2020) . The objective of this study is to describe the frequency of substitutions in S and N proteins of SARS-CoV-2 in South America.

This work was developed according to the national law 9/1979, decrees 786/1990 and 2323/2006, which establishes that the Instituto Nacional de Salud (INS) from Colombia is the reference lab and health authority of the national network of laboratories and in cases of public health emergency or those in which scientific research for public health purposes as required. The INS is authorized to use the biological material for research purposes, without informed consent, which includes the anonymous disclosure of results. This study was performed following the ethical standards of the Declaration of Helsinki 1964 and its later amendments.

The information used for this study comes from secondary sources of data that were previously anonymized and do protect patient data.

J o u r n a l P r e -p r o o f Journal Pre-proof Nasopharyngeal swab samples from patients with suspected SARS-CoV-2 infection were processed for RNA extraction using the automated MagNA Pure LC nucleic acid extraction system (Roche Diagnostics GmbH, Mannheim, Germany) and viral RNA detection was performed by real-time RT-PCR using the SuperScript III Platinum One-Step Quantitative RT-qPCR kit (Thermo Fisher Scientific, Waltham, MA, USA), following the Charité-Berlin protocol (Corman et al., 2020) for the amplification of the SARS-CoV-2 E (betacoronavirus screening assay) and

RdRp (SARS-CoV-2 confirmatory assay) genes.

NGS of SARS-CoV-2 from 43 patients was performed using the amplicon-based Illumina and Nanopore sequencing approaches, ARTIC network protocol (Quick, 2020a) . Following cDNA synthesis with SuperScript IV reverse transcriptase (Thermo Fisher Scientific, Waltham, MA, USA) and random hexamers (Thermo Fisher Scientific, Waltham, MA, USA), a set of 400-bp tiling amplicons across the whole genome of SARS-CoV-2 were generated using the primer schemes nCoV-2019/V3 (Quick, 2020a) .

SARS-CoV-2 specific oligonucleotides were used for the generation of amplicons by means of a high-fidelity DNA polymerase (Q5® High-Fidelity DNA Polymerase -(New England Biolabs Inc., UK, EB), in order to avoid the introduction of artificial substitutions.

Reads were mapped to the Wuhan-Hu-1 reference genome (NC_045512.2) using BWA and BBmap (brian-jgi, 2020); then, assembled sequences were submitted to GISAID. Substitution matrices of nucleotides and amino acids of S and N proteins were generated from a multiple sequence alignment with the reference genome against the 43 assembled Colombian SARS-CoV-2 genomes (Table 1) using the Muscle algorithm (Edgar, 2004) in MEGA X (Kumar et al., 2016) . Subsequently,461 SARS-CoV-2 sequences from South American countries, including Argentina, Brazil, Ecuador, Peru, Uruguay and other sequences from Colombia available on the GISAID, NCBI, and GSA databases were analyzed (Supplementary Table S1, and Supplementary Table S2 ).

Several non-synonymous substitutions were observed in the S and N proteins of the Colombian SARS-CoV-2 sequences generated in this study. Three amino acid substitutions were observed in the S protein, D614G was present in 81% (35/43) of the sequences. Furthermore, substitutions G181V and D936Y were found in low frequencies of 2.3% (1/43) and 2.3% (1/43) respectively (Table 1 ). In the N protein, five amino acid substitutions were found; the most frequent being R203K and G204R in 13.95% (6/43) of the sequences. Amino acid substitutions, R191C, R209I and G238C were found in 4.65% (2/43), 4.65% (2/43) and 6.97% (3/43) of J o u r n a l P r e -p r o o f Journal Pre-proof the Colombian sequences, respectively (Table 1) . Some nucleotide substitutions were synonymous.

Genomic resource databases, NCBI, GISAID and GSA were consulted to determine the substitutions in S and N proteins of SARS-CoV-2 from South America. A total of 504 genomes reported as of June 3 Th 2020, were analyzed, 126 from Colombia (including the 43 genomes reported in this study), 29 from Argentina, 145 from Brazil, 153 from Chile, 4 from Ecuador, 2 from Peru and 45 from Uruguay. Fifty sequences of S and 27 of N were excluded from the analysis because the presence of undetermined bases that did not allow the proper identification of the S and N ORFs in the amino acid substitution matrices.

Twenty-eight and twenty-two non-synonymous substitutions were identified in the sequence of S and N proteins respectively, in genomes of South America (Table   S1 and S2). The most frequent in S were D614G (83%) V1176F (2.2%) and P1263L (1.5%), while the most frequent in N were R203K (34.5%), G204R (34.3%), I292T (15.8%) and S197L (3.3%). The remaining substitutions in both, S and N occurred in less than 1% of the sequences. These included G181V and D936Y in S, and R191C and G238C in N, as observed in the Colombian genomes ( Fig. 1 ).

The analysis of substitution frequencies by country shows that D614G substitution in the S protein was frequent in Argentina, Brazil, Chile, Colombia and Peru, with J o u r n a l P r e -p r o o f Journal Pre-proof 80-100% of the reported sequences ( Fig. 2A) . In Ecuador and Uruguay D614 position was predominant by March, however by April the G614 substitution reached 80% in Uruguay. In general, the percentage of genomes in South America with this substitution augmented nearly to 100% from March to April (Fig. 2B ).

Non-synonymous substitutions R203K and G204R, which are the hallmarks of the B.1.1 lineage, were the most frequent in the N protein of South American sequences. Both substitutions were frequent in Argentina and Brazil with 55% and 74% of the reported sequences respectively (Fig. 3A) . In Ecuador and Chile the frequency of these substitutions was about 20%, while in Uruguay the frequency was similar to Colombia. Furthermore, the proportion of genomes with this double substitution augmented in Chile, Argentina and Brazil from March to April. In contrast, this proportion increased slightly in Colombia and Uruguay, and remained below 20% (Fig. 3B ).

The substitution I292T in the N protein was rare in Argentina (10.7%), Chile (4.6%) and Uruguay (2.2%); and absent in Colombia, Peru and Ecuador. In contrast, this substitution was very frequent in Brazil (56.3%) (Fig. 3C) . The spatiotemporal distribution pattern of this substitution was similar to that of R203K and G204R, increasing from March to April in Chile, Argentina and Brazil in contrast to Colombia and Uruguay where this substitution was almost absent in genomes registered on April (Fig. 3D ).

J o u r n a l P r e -p r o o f

The first COVID-19 case in Colombia was confirmed on March 6, 2020, from a traveler who entered the country from Italy on February 26, 2020 (EPI_ISL_418262). By June 11, 2020, a total 43,810 confirmed cases and 1,505 deaths have been reported (INS, 2020) . This study evidenced the presence of the This lineage has been reported in samples from travelers with connection to Italy (Gupta and Mandal, 2020) , also observed in the first confirmed case of SARS-CoV-2 in Colombia (EPI_ISL_418262) and another patient with travel connection to Spain (EPI_ISL_456149) ( Table 1) . Furthermore, multiple countries outside Italy have reported this lineage among their samples including, Belgium, Switzerland, Vietnam, India, Nigeria and Mexico, demonstrating a wide distribution worldwide (Gupta and Mandal, 2020) .

RNA viruses are known to possess high substitution rates compared to DNA viruses, leading to high genetic variability and the rapid action of evolutionary mechanisms of natural selection and genetic drift Tang et al., 2020) . However, SARS-CoV-2 and others coronaviruses have proteins with exonuclease activity, as nsp14, with error correcting capacity (Romano et al., 2020; Subissi et al., 2014) . Despite some evolutionary changes may be in fact adaptive, it is important to be careful with conclusions in the absence of an experimental model Table S3 ). Recombinant proteins or synthetic peptides of SARS-CoV-2 are widely explored as alternatives to be used in serological tests and therapeutics against SARS-CoV-2 and related Betacoronavirus (Du et al., 2009; Jacofsky et al., 2020) , considering that S and N proteins are the major immunogenic proteins of SARS and MERS coronavirus and the first choice for producing recombinant antigens (Yan et al., 2020) .

Amino acid changes were found in the S and N proteins of SARS-CoV-2 circulating in South America, the most frequent being D614G in S, R203K-G204R and I292T in N. It is necessary to continue with genomic surveillance of changes in these proteins during the SARS-CoV-2 pandemic, even more considering that these proteins are the most commonly used in serological and molecular tests.

The identification of nucleotide substitutions, amino acid changes and their frequencies in circulating viruses, can be useful for public health decision-making, including vaccine design efforts, design of SARS-CoV-2 diagnostic tests, and therapeutic compounds.

The authors thank the National Laboratory Network for routine virologic J o u r n a l P r e -p r o o f 

Molecular analysis of several in-house rRT-PCR protocols for SARS-CoV-2 detection in the context of genetic variability of the virus in Colombia

SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate

Global Spread of SARS-CoV-2 Subtype with Spike Protein Mutation D614G is Shaped by Human Genomic Variations that Regulate Expression of TMPRSS2 and MX1 Genes. bioRxiv. brian-jgi

Distinct Viral Clades of SARS-CoV-2: Implications for Modeling of Viral Spread

Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin 25

The spike protein of SARS-CoV-a target for vaccine and therapeutic development

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Loss of Epitopes from SARS-Cov-2 Proteins for Nonsynonymous Mutations: A Potential Global Threat

Temporal dynamics in viral shedding and transmissibility of COVID-19

Coronavirus (COVID -2019) en Colombia. Instituto Nacional de Salud

Understanding Antibody Testing for COVID-19

A Novel Synonymous Mutation of SARS-CoV-2: Is This Possible to Affect Their Antigenicity and Immunogenicity? Vaccines 8

MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets

Serological Approaches for COVID-19: Epidemiologic Perspective on Surveillance and Control

Bayesian phylodynamic inference on the temporal evolution and global transmission of SARS-CoV-2

The global emergences of multiple SARS-CoV-2 sub-strains: Digital annotations for human behaviors may assist automated retracing of symptomatic features and origins

nCoV-2019 sequencing protocol. protocols.io

A Structural View of SARS-CoV-2 RNA Replication Machinery: RNA Synthesis, Proofreading and Final Capping

Susceptibility of ferrets, cats, dogs, and other domesticated animals to SARScoronavirus 2

SIB, 2020. Betacoronavirus. Swiss Institute of Bioinformatics

One severe acute respiratory syndrome coronavirus protein complex integrates processive RNA polymerase and exonuclease activities

On the origin and continuing evolution of SARS-CoV-2

Phylogenetic interpretation during outbreaks requires caution

WHO Director-General's opening remarks at the media briefing on COVID-19 -11

Novel Coronavirus (2019-nCoV) technical guidance: Laboratory testing for 2019-nCoV in humans. World Health Organization

Laboratory testing of SARS-CoV, MERS-CoV, and SARS-CoV-2 (2019-nCoV): Current status, challenges, and countermeasures

Severe acute respiratory syndrome coronavirus groupspecific open reading frames encode nonessential functions for replication in cell cultures and mice

The authors declare no competing interest.

This study was funded by the National Institute of Health, Bogota, Colombia