key: cord-335492-od3c25qg
authors: UGUREL, Osman Mutluhan; ATA, Oguz; TURGUT-BALIK, Dilek
title: An updated analysis of variations in SARS-CoV-2 genome
date: 2020-06-21
journal: Turk J Biol
DOI: 10.3906/biy-2005-111
sha: 
doc_id: 335492
cord_uid: od3c25qg

A novel pathogen, named SARS-CoV-2, has caused an unprecedented worldwide pandemic in the first half of 2020. As the SARS-CoV-2 genome sequences have become available, one of the important focus of scientists has become tracking variations in the viral genome. In this study, 30366 SARS-CoV-2 isolate genomes were aligned using the software developed by our group (ODOTool) and 11 variations in SARS-CoV-2 genome over 10% of whole isolates were discussed. Results indicated that, frequency rates of these 11 variations change between 3.56%–88.44 % and these rates differ greatly depending on the continents they have been reported. Despite some variations being in low frequency rate in some continents, C14408T and A23403G variations on Nsp12 and S protein, respectively, observed to be the most prominent variations all over the world, in general, and both cause missense mutations. It is also notable that most of isolates carry C14408T and A23403 variations simultaneously and also nearly all isolates carrying the G25563T variation on ORF3a, also carry C14408T and A23403 variations, although their location distributions are not similar. All these data should be considered towards development of vaccine and antiviral treatment strategies as well as tracing diversity of virus in all over the world.

is caused by a novel pathogen that is pursued closely by all over the world after the worldwide pandemic was declared by WHO on March 11th. It has become one of the most important health problems by causing nearly 5M confirmed cases and over 300K deaths from 213 countries/regions in a couple of months (WHO, 2020a) 1 . More than 17,000 papers by the query "COVID-19" indexed in PubMed/NCBI and nearly 200 of them, as of 29th May 2020, about genome of severe acute respiratory syndrome-COronaVirus-2 (SARS-CoV-2) which is the agent responsible for the disease.

SARS-CoV-2, is an enveloped, +ssRNA virus belonging to the Coronaviridae family that are classified into 4 major genera: Alphacoronavirus, Betacoronavirus, Gammacoronavirus, and Deltacoronavirus by phylogenetic studies and classified in the Betacoronavirus genus, Sarbecovirus and grouped as SARS-Like CoVs (Gorblenya et al., 2020) . CoVs can infect many of animal 1 WHO (2020a). Coronavirus disease (COVID-2019) situation reports. [online] . Website: https://www.who.int/emergencies/diseases/ novel-coronavirus-2019/situation-reports accessed 17 May 2020]. species from different genus including mammals, avians and reptiles (Gorblenya et al., 2006; Gorblenya et al., 2020) . Until December 2019, there were 6 coronavirus species known as human pathogen. Four of them (229E, OC43, NL63, and HKU1) have caused common cold and others 2 worldwide outbreaks (SARS 2003; MERS 2012) over the last 20 years (Su et al., 2016; Zhu et al., 2020) . Recently, a 7th coronavirus species has been discovered in December 2019 and first named as 2019 novel Coronavirus (2019 nCoV) and then as SARS-CoV-2 (Gorblenya et al., 2020).

An outbreak of severe acute respiratory syndrome (SARS) was reported in November 2002. Despite patients carried symptoms of a viral infection, no pathogen causing pneumonia was identified, and in a few months, it was revealed that a novel CoV had caused this syndrome (Peiris et al., 2003) . The filiation studies about SARS showed that the early cases were mostly seen among restaurant workers in Guandong Province, China and this information led researchers to suspect that transmission source of the virus might be an animal like bat or civet, that frequently consumed in that province (Zhong et al., 2003; Su et al., 2016) . This was the first-known CoV outbreak that caused 8096 cases and 774 deaths in 37 countries or areas until July 2003 (WHO, 2004) 

In the summer of 2012, 9 years after SARS outbreak was controlled, a novel coronavirus disease called Middle East Respiratory Syndrome (MERS) was reported in Saudi Arabia (WHO, 2019) 3 . A 60-year-old man with severe fever and cough, was diagnosed as pneumoniae and, as in SARS, no pathogen causing pneumonia was detected, however, fragments that amplified from some of PCR assays for the detection of coronaviruses were sequenced and analysis of the results indicated a novel coronavirus named MERS-CoV relative with HKU4 and HKU5 (Zaki et al., 2012; de Groott al., 2013; Yin and Wunderink, 2013) . Despite the transmission from animal to human has not been clearly verified, the dromedary camels have been proposed as the main reservoir host for the virus (Azhar et al., 2014; WHO, 2019) . This epidemic was mostly effective in Saudi Arabia and in the gulf countries, caused nearly 2500 confirmed cases with 35% death rate (WHO, 2019).

Transmission mechanism of both SARS-CoV and MERS-CoV from animal to human was reported to be the direct contact with host animals or consumption of raw milk, meat or urine (Yin and Wunderink, 2013; WHO, 2019) . The studies listed the 3 major causes of SARS epidemic termination: the public health strategies, scientific development in biology and medicine, and hygiene practices (Chew, 2007) .

It was already known that the Spike (S) glycoprotein plays a determining role in CoV infections (Sanchez et al., 1999) and is effective in viral entry and pathogenesis (Gallagher and Buchmeiert, 2001) . During the ongoing SARS outbreak, it has been discovered that the novel virus has developed a new viral entry mechanism by binding S protein to angiotensin-converting enzyme 2 (ACE2) receptors , as the identification mark of SARS-CoV adaptation which has been caused by the mutations on S protein residues between 318-510 that named as receptor binding protein (RBD) (Wong et al., 2004; Li et al., 2005) . After the SARS outbreak, animals such as bats, civets, and pangolins, have been seen as the main reservoir hosts of the SARS-related CoVs that has similar entry mechanism and scientists have focused on 2 WHO (2004 studying viruses hosted by these animals (Ge et al., 2013; Li, 2016; Cui et al., 2019; Liu et al., 2019) . Despite RBD of MERS-CoV-2 binds to dipeptidyl peptidase 4 (Raj et al.,2013) , it has been revealed that RBD of SARS-CoV-2 binds ACE2, like in SARS-CoV (Tai et al., 2020) . Although it has been lasted about 1 year to resolve this mechanism for the SARS-CoV , this timing was much shorter, as nearly 3 months, with the research ability that have been gained with the developing DNA sequencing technologies and over years of experience in this area (Tai et al., 2020; Ou et al., 2020) .

There are more than 7,000 complete genome entry uploaded to Nucleotide/NCBI databases from Coronaviridae family between 2002-2020, more than half of them is being SARS-CoV-2 (www.ncbi.nlm.nih.gov/ nuccore/, accessed 29th May 2020) and it is increasing in parallel to the development of sequencing technology. The coronavirus genome structure has been characterised by various studies (Kocherhans et al., 2001; Brian and Baric, 2005; Gorblenya et al., 2006; Yang and Leibowitz, 2015; Madhugiri et al., 2016) with the genome size around 30kb. CoVs are referred to have the largest known genome size among RNA viruses (Brian and Baric, 2005) . All CoV genomes contain a large gene region (named ORF) encoding nonstructural proteins (Nsp) which are responsible for mostly replication and the genes encoding spike (S) glycoprotein, envelope (E) protein, membrane (M) glycoprotein and nucleocapsid (N) protein have also been found in common (Brian and Baric, 2005; Gorblenya et al., 2006) .

The genome sequence of SARS-CoV-2 was first characterised by Wu et al. in December 2019 and appointed as reference genome of SARS-CoV-2 (NC_045512.2; 1-29870); consist of 11 gene regions; ORF1ab Although this reference genome sequence is commonly used in most studies, it is extremely important to monitor the variations in the virus genome to understand the evolution and spread of the virus and also to use this information in the development possible treatments and vaccines accordingly. For this purpose, GISAID (GISAID, 2020; Elbe and Buckland-Merrett, 2017; Shu, and McCauley, 2017) which collects and shares genome sequences and related clinical/epidemiological data for monitoring annual influenza strains, established a new database called EpiCoV and the first SARS-CoV-2 genome was shared on 10th January 2020. More than 30,000 genome sequences of SARS-CoV-2 were uploaded to EpiCoV databases between 10th January and 20th May 2020 (EpiCoV, 2020) 4 .

In this study; we have used these data to analyse the mutations on SARS-CoV-2 genome using a software based on multiple sequence alignment (Strategy Based Local Alignment Tool: ODOTool) that have been originally developed for bacterial SNP determination in our studies. Now, we targeted to analyse the mutations that have emerged in at least 10% of SARS-CoV-2 genomes in all 30366 sequences submitted in GISAID by May 20th, 2020 using the ODOTool in terms of date and location they occurred, the relationship with each other and their effect on the primary protein structure.

Despite the Strategy Based Local Alignment Tool (ODOTool) used in this study was originally developed by our group for bacterial single nucleotide polymorphism (SNP) determination, it was reasonable to test the abilities of the tool using a different dataset with the emergence of SARS-CoV-2 causing COVID-19 pandemic and this is applied in the present study to analyse variations in viral genome.

The first genome sequence of SARS-CoV-2 has been identified and submitted with the accession number of MN908947.1 to Genbank by Wu et al. in December 2020 , then curated by NCBI staff, reviewed by RefSeq (O'Leary et al., 2015) , and appointed as the reference genome of SARS-CoV-2 (NC_045512.2) (Genbank, 2020). Genomic sequences (30366) of SARS-CoV-2 isolates submitted on GISAID/ EpiCoV database by 21st May 2020, have been downloaded in FASTA file format, into our local database for alignment and analysis.

Workflow of ODOTool is given in Figure 1 . The main purpose of developing the ODOTool was analysing bacterial gene sequences for single nucleotide polymorphism (SNP) determination. It was designed to download bulk data of gene sequences from open access databases such as Genbank/NCBI and EMBL-EBI, to prepare data first for prealignment and then for alignment and analysis of the aligned sequences. Prealignment and generating universal consensus sequences module (shown in grey in Figure 1 ) is an additional step for analysis of the sequences among different genus, hence it is not used in the present study.

The ODOTool has been coded with Python programming language by Biophyton library (Cock et al., 4 EpiCoV (2020) 2009). After the genome sequence downloading process, all data were stored in our local database with annotations such as isolate name, location, collection, and submission dates etc. The data downloaded from different databases were not in a standard format. Therefore, an additional data conversion step was applied to remove any undesired features or information like FASTA comments, line feeds, blank spaces and all sequences set to upper cases etc. to standardise all data prior to further analyses. Modified Needleman-Wunsch algorithm, modified BLOSUM 62 scoring matrix and adjusted gap penalties (match = 4, mismatch = -1, reference genome open gap score = -1000, reference genome extend gap score = -1000, isolate genome open gap score = -20, isolate genome extend gap score = -4) were used for alignment step. Aligned nucleotide sequences were then converted to protein sequences to determine missense mutations using standard codon table (Peabody, 1989) . Aligned nucleotide and protein sequences of all downloaded isolate genomes were stored in local database for computational, visual, and statistical analysis.

The alignments performed by ODOTool were validated with multiple alignment program MAFFT v7 (Kuraku et al., 2013; Katoh et al., 2017) to evaluate the accuracy and performance of ODOTool. Data size suggested for accuracy by MAFFT v7 is 200 genome sequences . Therefore, 200 SARS-CoV-2 genomes were randomly selected and the same sequences were analysed by using both MAFFT v7 and ODOTool. The genomic positions and frequencies of variations were compared for validation. The genomic positions of variations were determined by homology and frequencies were calculated by Jalview v2.10.5 consensus calculation algorithm (Waterhause et al., 2009).

Alignment results obtained using the software developed in this study were verified with the Nextstrain platform (Hadfield, 2018; Nextstrain, 2020 5 ) that has been created to monitor virus evolution in real time. In this study, occupancy of sequences in alignment were considered in calculation of variation frequencies. Variations occurring over 10% frequency in clinical isolate genomes that have been downloaded into our database were evaluated and discussed specific to the genes. It must be noted that availability of these variations is also previously reported in other studies (Nextstrain, 2020; Pachetti et al., 2020; Wang et al., 2020; Phan et al., 2020) with different aspects. Despite the variations occurring over 10% frequency were discussed in the present study, it was noticed that SARS-CoV-2 isolate genomes were harbouring some of the variations over 20% frequency and causing missense mutations and so focused on these mutations in more detail and some charts were produced to show increasing EpiCoV entries, rate changes of the most frequent changes and rates of the most frequent mutations by continents.

The isolates with mistyped location and date annotations, nearly 2-3% of all downloaded entries, were ignored, while charts were created. Venn diagram presents simultaneously carried mutations and has been generated by Jvenn web tool (Bardou et al., 2014) .

The alignments of randomly selected 200 SARS-CoV-2 isolate genome sequences were evaluated for validation according to genomic positions and frequencies of variations. It is well known that different datasets produce different results. As seen on Table 1 , frequency calculation result by using both tools on the same dataset are consistent, showing validity of the results obtained by using the ODOTool. Further technical properties of ODOTool is excluded as these details are out of this Special Issue's scopes.

In this study, 30366 SARS-CoV-2 isolate genome sequences were downloaded, standardised, aligned and in silico protein translation was performed. All processed data stored in a database for analysis.

When the variations on SARS-CoV-2 genome is evaluated in general, several uneven used synonymous codons encode most of amino acids and this situation was defined as codon usage bias (CUB) that could have specific causes and consequences in different organisms (Belalov and Lukashev, 2013) . CUB was explained by 2 primary circumstances; translational selection that means choosing the most suitable codon for translation and mutational pressure being gained by distinct probability of different substitution types like GC content, deoxycytidine methylation (C-phosphate-G), or subsequent deamination (C-T substitution) (Bulmer, 1987; Sharp et al., 1993; Belalov and Lukashev, 2013) .

Cytosine deamination has been identified as an important source of synonymous mutations (Duncan and Miller, 1980) managing the GC contents of RNA viruses (Pyrc et al., 2004) . Because of cytosine deamination has been observed in all coronavirus genomes and proposed as a significant biochemical effect on coronavirus evolution , it is a predictable result that C-to T exchange has the most prominent numbers in all nucleotide change directions in SARS-CoV-2 genome.

In this study, 11 variations, with the incidence of over 10% have been detected in 30366 clinical SARS-CoV-2 isolate genomes (Table 2 ). It has been observed that 8 of these 11 variations; C1059T, G11083T, C14408T, A23403G, G25563T, G28881A, G28882A, and G28883C cause amino acid substitutions. Three variations that do not cause amino acid exchange were named as synonymous mutations in the literature (Kimura, 1977) .

When the 30366 SARS-CoV-2 genome sequences were aligned, the first encountered variation was observed to be C241T in the 5' untranslated region (5' UTR) being occurred nearly in 70% of SARS-CoV-2 isolates. Studies conducted on some viral genomes reported that the variations on UTRs may affect the activity, replication, and packaging of genomes, immune modulation and expression of genes (Silveria et al., 1995; De Lorenzo et al., 2016; Ng et al., 2017) . This variation should be evaluated towards findings of these research studies with further in vitro studies as its frequency rate is 70.99% in whole genomes from all over the world.

Rest of the 10 variations on nonstructural, structural, and accessory proteins are discussed below.

ORF1ab region of SARS-CoV-2 genome is an important polyprotein gene common in all CoVs and encodes 16 Nsps that include enzymes vital for the lifecycle of the virus, such as RNA depended polymerase, helicase and 3C-like proteinase (Brian and Baric, 2005; Gorblenya et al., 2006) . Because of the critical role of these proteins on virulence and life cycle of the virus, some of these proteins have been proposed as potential target for antiviral therapy (Kwong et al., 2005; Briguglio et al., 2011; Zhou, 2020) . Four variations over 10% were available on ORF1ab region. The first variation C3037T causes a synonymous mutation and seen in frequency of 29.3% in gene region encoding Nsp3 that is an important unit of the replication/ transcription complex in CoVs (Lei et al., 2018) . The other 3 variations, C1059T, G11083T, and C14408T causing amino acid substitutions are in gene regions of Nsp2, Nsp6, and Nsp12, respectively, in ORF1ab; with the incidence of over 10%.

C1059T variation has been caused T266I amino acid exchange on Nsp2. But it is difficult to evaluate the effect of this exchange on the protein function, as the function of Nsp2 has not been resolved yet (Graham et al., 2005; Gadlage et al., 2008; Chen et al., 2020) .

G11083T variation (reported previously by van Dorp et al., 2020) causing L36F exchange was present on Nsp6 gene region that is known with its role in inducing vesicles located around the microtubule regulation centre and ensuring membrane proliferation (Angelini et al., 2013) . This variation was suggested as a homoplastic mutation (van Dorp et al., 2020) and suggested to be evaluated with these characteristics in further studies. C14408T (first reported by Pachetti et al., 2020) and C14805T found to be present on Nsp12 gene region with a frequency of 70.42% and 10.01% of isolates, respectively. Nsp12 is a vital protein for replication and pathogenesis and potential target for antiviral candidates in CoVs (Perlman and Netland, 2009; Yin and Wunderink, 2018) . C14408T variation observed to be responsible from P323L exchange causing a missense mutation. The isolate genome sequence harbouring C14408T variation was first submitted to the GISAID from Lombardy, Italy, on February 20th, 2020 (EPI_ISL_412973), 20 days after the first COVID-19 case confirmed in Italy. Although C14805T (submitted on February 9th, 2020 England; EPI_ ISL_412116) variation is rarer, it was emerged earlier than C14408T variation. Despite C14408T variation occurred later than C14805T, the incidence of this variation has increased sharply and reach over 70%, recently ( Figure 2 ). This observation may be evaluated as a remarkable data about the effect of C14408T variation on the spread of the virus.

The increasing number of isolate genome entries in EpiCoV and rate changes of the most frequent variations by isolate collection dates are also given in Figure 2 for G25563T variation on accessory protein and A23403G variation on S protein and 3 consecutive variations (G28881A, G28882A, and G28883C) named as GGG to AAC on N protein.

G25563T variation is on the gene region of Orf3a which is a unique membrane protein with its 3-membrane structure, the largest protein in the SARS related CoVs accessory protein family and is essential for the pathogenesis of the disease (Lu et al., 2010; Issa et al., 2020) . G25560T variation also causes amino acid exchange of glutamine to histidine in residue 57 (Q57H).

There are also 4 structural proteins named as Spike (S), Envelope (E), Membrane (M), and Nucleocapsid (N) proteins and all has been encoded in all CoV genomes, including the SARS-CoV-2 (Brian and Baric, 2005; Gorblenya et al., 2020) . In the current study, 4 variations were determined to cause amino acid substitutions in regions encoding S and N proteins with an incidence of over 10%.

A23403G variation is one of the most important variations that have been reported previously (Phan et al., 2020) caused D614G substitution on S protein. As in the C14408T of Nsp12 variation, A23403G variation of S protein is also available in 70.46% frequency in SARS-CoV-2 genomes isolated all over the world (Figure 2 ). S protein has important role in viral entry into the host cells (Gallagher and Buchmeiert, 2001) . Viral entry and pathogenesis have been reported to be managed by a couple of mutations in its RBD of S protein in SARS-CoV (Wong et al., 2004; Li et al., 2005) aligned on residues between 331 and 524 for SARS-CoV-2 (Tai et al., 2020; Ou et al., 2020) . RBD is located on the outer membrane of the virus and these properties of RBD make the S protein a suitable target for new treatment approaches (Tai et al., 2020) and the most of the ongoing protein subunit vaccine studies against SARS-CoV-2 (WHO, 2020b) 6 . Therefore, any mutation on S protein of SARS-CoV-2 should be carefully tracked and evaluated, as this protein is the key target especially in the current vaccine development studies (Kiyotani et al., 2020) .

Perhaps one of the most interesting SARS-CoV-2 variations occurred between genomic positions of 28881-28883. The GGG sequence of reference SARS-CoV-2 genome at these positions are converted to AAC in nearly 26% of rest of the isolate genomes. These 3 variations (G28881A, G28882A and G28883C) were seen simultaneously in about 99% of all isolates harbouring the variations. G28881A substitutions on the N protein. N protein is an essential structural protein, playing very different roles in the regulation of infected cell metabolism and packaging of the viral genome important for both replication and transcription (Kang et al., 2020) .

When the C14408T, G25563T, A23403G, and GGG to AAC variations with over 20% frequency were analysed, dramatic observations were seen in terms of appearance of these mutations in different continents (Figure 3 ). Figure  3 clearly shows that; G25563T mutation is developed by the isolates in N. America especially, GGG to AAC by the isolates in Europe, the same mutation is rare in N. America. C14408T and A23403G is common all over the world but especially in Africa, then S. America and this is followed by Europe (Figure 3 ). All these data could be used to trace diversity of virus in all over the world in combination with filiation studies.

Our mutation analysis revealed that most of the isolates carry C14408T and A23403 variations simultaneously ( Figure 4 ) in all continents (Figure 3 ). Nearly all isolates which carry G25563T, also carry C14408T and A23403G variations although their location distributions are not similar. Nearly no isolates carried GGG to AAC variations and G25563T variation simultaneously (Figure 4) . The relation between these variations shown in Figure 4 may indicate that these mutations are coevoluating. Therefore, these results should be followed and evaluated by affiliation studies.

In conclusion; since the COVID-19 pandemic emerged globally, all the world has been searching ways to control the spread of the disease. As the virus's adaptation strategies come from its nature, scientists put a massive effort on finding and development of suitable strategies for both effective prevention and treatment and on monitoring its virulence. Despite all these efforts, the virus continues to survive and spread, and this necessitates tracking any variations that occur on its genome to conduct further studies to find a way to control the disease. SARS-CoV-2 isolate genomes (30366) were aligned using the Strategy Based Local Alignment Tool (ODOTool) developed by our group and 11 variations in SARS-CoV-2 genome observed in over 10% of whole isolates from all over the world were discussed according to the date and location they occurred, the relationship with each other and their effect on the primary protein structure in the present study. Data were obtained as a result of evaluation of massive amount of genomic sequences in this study and expected to enlighten studies towards overcoming the SARS-CoV-2 infections. 

Severe acute respiratory syndrome coronavirus nonstructural proteins 3, 4, and 6 induce double-membrane vesicles

Evidence for camel-to-human transmission of MERS coronavirus

jvenn: an interactive Venn diagram viewer

Causes and implications of codon usage bias in RNA viruses

(editor) Coronavirus Replication and Reverse Genetics

Inhibition of RNA Helicases of SsRNA + Virus Belonging to Flaviviridae, Coronaviridae and Picornaviridae Families

Coevolution of codon usage and transfer RNA abundance

Emerging coronaviruses: genome structure, replication, and pathogenesis

Severe acute respiratory syndrome coronavirus as an agent of emerging and reemerging infection

SARS: how a global epidemic was stopped

Biopython: freely available Python tools for computational molecular biology and bioinformatics

Origin and evolution of pathogenic coronaviruses

Commentary: Middle East respiratory syndrome coronavirus (MERS-CoV): announcement of the Coronavirus Study Group

An inhibitory motif on the 5'UTR of several rotavirus genome segments affects protein expression and reverse genetics strategies

Mutagenic deamination of cytosine residues in DNA

Data, disease and diplomacy: GISAID's innovative contribution to global health

An arginine-toproline mutation in a domain with undefined functions within the helicase protein (nsp13) is lethal to the coronavirus infectious bronchitis virus in cultured cells

Understanding helicases as a means of virus control

Murine coronaviruses encoding nsp2 at different genomic loci have altered replication, protein expression, and localization

Coronavirus spike proteins in viral entry and pathogenesis

Isolation and characterization of a bat SARS-like coronavirus that uses the ACE2 receptor

The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2

Nidovirales: evolving the largest RNA virus genome

The nsp2 replicase proteins of murine hepatitis virus and severe acute respiratory syndrome coronavirus are dispensable for viral replication

Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China

Nextstrain: real-time tracking of pathogen evolution

SARS-CoV-2 and ORF3a: nonsynonymous mutations, functional domains, and viral pathogenesis

Isolation of Inhibitory RNA Aptamers against Severe Acute Respiratory Syndrome (SARS) Coronavirus NTPase/Helicase

Crystal structure of SARS-CoV-2 nucleocapsid protein RNA binding domain reveals potential unique drug targeting sites

MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization

Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution

Bioinformatic prediction of potential T cell epitopes for SARS-Cov-2

Completion of the porcine epidemic diarrhoea coronavirus (PEDV) genome sequence

Facilitates ondemand exploration of metazoan gene family trees on MAFFT sequence alignment server with enhanced interactivity

Viral and cellular RNA helicases as antiviral targets

Nsp3 of coronaviruses: Structures and functions of a large multi-domain protein

Structure, function, and evolution of coronavirus spike proteins

Angiotensin-converting enzyme 2 is a functional receptor for the SARS coronavirus

Receptor and viral determinants of SARS-coronavirus adaptation to human ACE2

Viral metagenomics revealed sendai virus and coronavirus infection of Malayan pangolins (Manis javanica)

SARS Accessory Proteins ORF3a and 9b and Their Functional Analysis

Coronavirus cis-Acting RNA elements

The 5' and 3' untranslated regions of the Flaviviral Genome

Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its immune cross-reactivity with SARS-CoV

Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant

Translation initiation at non-AUG triplets in mammalian cells

Coronavirus as a possible cause of severe acute respiratory syndrome

Coronaviruses post-SARS: update on replication and pathogenesis

Genetic diversity and evolution of SARS-CoV-2. Infection

Genome structure and transcriptional regulation of human coronavirus NL63

Dipeptidyl peptidase 4 is a functional receptor for the emerging human coronavirus-EMC

Targeted recombination demonstrates that the spike gene of transmissible gastroenteritis coronavirus is a determinant of its enteric tropism and virulence

Codon usage: mutational bias, translational selection, or both?

GISAID: Global initiative on sharing all influenza data -from vision to reality

5' UTR of hepatitis A virus RNA: mutations in the 5'-most pyrimidine-rich tract reduce its ability to direct internal initiation of translation

The Nonstructural Proteins Directing Coronavirus RNA Synthesis and Processing

Epidemiology, genetic recombination, and pathogenesis of coronaviruses

Characterization of the receptor-binding domain (RBD) of 2019 novel coronavirus: implication for development of RBD protein as a viral attachment inhibitor and vaccine

Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection

The establishment of reference sequence for SARS-CoV-2 and variation analysis

On the mechanisms of Bananin activity against severe acute respiratory syndrome coronavirus

Jalview Version 2-a multiple sequence alignment editor and analysis workbench

A 193-amino acid fragment of the SARS coronavirus S protein efficiently binds angiotensin-converting enzyme 2

Cytosine deamination and selection of CpG suppressed clones are the two major independent biological forces that shape codon usage bias in coronaviruses

Genome composition and divergence of the novel coronavirus (2019-nCoV) originating in China

Analysis of therapeutic targets for SARS-CoV-2 and discovery of potential drugs by computational methods

The structure and functions of coronavirus genomic 3' and 5' ends

MERS, SARS and other coronaviruses as causes of pneumonia

Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia

Epidemiology and cause of severe acute respiratory syndrome (SARS) in Guangdong, People's Republic of China

Networkbased drug repurposing for novel coronavirus 2019-nCoV/ SARS-CoV-2

This research received no specific grant from any funding agency in the public, commercial, or not for profit sectors. The authors declare that there is no conflict of interest. No ethics committee approval is necessary for this study. The genome sequences analysed in this study are available

in the GISAID (https://www.gisaid.org/) 7 and GenBank (https://www.ncbi.nlm.nih.gov/nuccore/NC_045512) 8 repositories. Osman Mutluhan Uğurel and Oğuz Ata have made equal contribution in this study. The alignment of 30366 SARS-CoV-2 isolate genome sequences produced in this study will be publicly available for scientific use in http://www.dilekbalik.com/SARS-CoV-2_ODOTool/. This present paper should be referenced when the data in this alignment is used. 7 GISAID (2020