key: cord-0954910-r8dygehu authors: Pereson, Matías J.; Mojsiejczuk, Laura; Martínez, Alfredo P.; Flichman, Diego M.; Garcia, Gabriel H.; Di Lello, Federico A. title: Phylogenetic Analysis Of SARS-CoV-2 In The First Months Since Its Emergence date: 2020-08-29 journal: bioRxiv DOI: 10.1101/2020.07.21.212860 sha: 8d47062cbf7a17df29c6707addef0ceb8cb06f56 doc_id: 954910 cord_uid: r8dygehu During the first months of SARS-CoV-2 evolution in a new host, contrasting hypotheses have been proposed about the way the virus has evolved and diversified worldwide. The aim of this study was to perform a comprehensive evolutionary analysis to describe the human outbreak and the evolutionary rate of different genomic regions of SARS-CoV-2. The molecular evolution in nine genomic regions of SARS-CoV-2 was analyzed using three different approaches: phylogenetic signal assessment, emergence of amino acid substitutions, and Bayesian evolutionary rate estimation in eight successive fortnights since the virus emergence. All observed phylogenetic signals were very low and trees topologies were in agreement with those signals. However, after four months of evolution, it was possible to identify regions revealing an incipient viral lineages formation despite the low phylogenetic signal, since fortnight 3. Finally, the SARS-CoV-2 evolutionary rate for regions nsp3 and S, the ones presenting greater variability, was estimated to values of 1.37 × 10−3 and 2.19 × 10−3 substitution/site/year, respectively. In conclusion, results obtained in this work about the variable diversity of crucial viral regions and the determination of the evolutionary rate are consequently decisive to understand essential feature of viral emergence. In turn, findings may allow characterizing for the first time, the evolutionary rate of S protein that is crucial for vaccines development. Coronaviruses belong to Coronaviridae family and have a single strand of positive-sense 48 RNA genome of 26 to 32 kb in length [1] . They have been identified in different avian hosts as 49 well as in various mammals including bats, mice, dogs, etc. [2, 3] . Periodically, new 50 mammalian coronaviruses are identified. In late December 2019, Chinese health authorities 51 identified groups of patients with pneumonia of unknown cause in Wuhan, Hubei Province, China [4] . The pathogen, a new coronavirus called SARS-CoV-2 [5] , was identified by local 53 hospitals using a surveillance mechanism for "pneumonia of unknown etiology" [4, 6, 7] . The 54 pandemic has spread rapidly and, to date, more than 22 million confirmed cases and nearly 55 750,000 deaths have been reported in just over a six months period [8] . This rapid viral 56 spread raises interesting questions about the way its evolution is driven during the 57 pandemic. From the SARS-CoV-2 genome, 16 non-structural proteins (nsp1-16), 4 structural 58 proteins [spike (S), envelope (E), membrane (M) and nucleoprotein (N)], and other proteins 59 essential to complete the replication cycle are translated [9, 10] . The large amount of 60 information currently available allows knowing, as never before, the real-time evolution 61 history of a virus since its interspecies jump [11] . Most studies published to date have 62 characterized the viral genome and evolution by analyzing complete genomes sequences 63 [12, 13, 14, 15] . Despite this, until now, the viral genomic region providing the most accurate 64 information to characterize SARS-CoV-2, could not be established. This lack of information 65 prevent from investigating its molecular evolution and monitoring biological features affecting 66 the development of antiviral and vaccines. Therefore, the aim of this study was to perform a 67 comprehensive viral evolutionary analysis in order to describe the human outbreak and the Criterion obtained with the JModelTest v2.1.10 software [18] . Bayesian inference with MrBayes v3.2.7a [19] . Each gene was analyzed independently with 119 the same dataset used for the phylogenetic signal analysis so that non-identical sequences 120 were included in the analysis. Analyses were run for five million generations and sampled Finally, the obtained parameters for real data and the randomized replicates were compared. As regards the phylogenetic signal, several simulation studies has proven that for a set of 215 sequences to be considered robust, the central and lateral areas representing the 216 unresolved quartets, must not be greater than 40% [16] . In this regard, none of the nine CoV-2 in these months. For this reason, it is expected that trees generated from SARS-CoV-223 2 partial sequences in the first months of the pandemic are unreliable for defining clades. Therefore, they should be analyzed with great caution. Since Bayesian analysis allows to infer phylogenetic patterns from tree distributions, it 226 represents a more reliable tool to compare different evolutionary behaviors. Bayesian 227 analysis helps to obtain a tree topology that is closer to reality in the current conditions of This would indicate, as other authors previously report, the frequent circulation 244 of polymorphisms due to significant positive pressure [13, 27, 31] . Additionally, since S and N are 245 among candidates to be used in the formulation of vaccines and antibody treatment, it will be 12 important to monitor these substitutions in different geographic regions in order to improve 247 treatment and vaccination efficacy [32, 33, 34] . In particular, the appearance of the D614G variant 248 in the third week and its rapid increase until reaching a prevalence of 88% in the eighth week 249 could reflect an improvement in viral fitness, as several studies reported [35] . Contrarily, in regions nsp1, nsp14, E, and Orf6 no substitutions were selected and lasted 251 during the first 4 months of the pandemic. This would suggest that these are regions with 252 constraints to change due to the great negative selection pressure, as it has been recently 253 reported [13] . In the present study, the evolutionary rate for SARS-CoV-2 genes was estimated by 255 analyzing a large number of sequences, which were carefully curated and had a good 256 temporal and spatial structure. Additionally, the most phylogenetically informative regions of 257 the genome (nsp3 and S) were used for analysis, reinforcing the results confidence. Previous studies on SARS-CoV-2 have reported similar data ranging from 1.79 x 10 −3 to 259 6.58 x 10 −3 s/s/y for the complete genome [6, 36] . However, in both articles, small datasets of 260 complete genomes were used (N=32 and 54, respectively). As studies were performed early 261 in the outbreak and due to datasets temporal structure, analysis could have led to less 262 precise estimates of the evolutionary rate [ Even though we should be cautious with these results interpretation, the date-randomization 275 analysis indicated a robust temporal signal. In addition, the importance of separately studying the evolutionary rate in S region arises Epidemiology, genetic recombination, and pathogenesis of jModelTest 2: more models, new heuristics and 371 parallelcomputing MrBayes 3.2: efficient Bayesian 373 phylogenetic inference and model choice across a large model space Posterior summarization in Bayesian 376 phylogenetics using Tracer 1 Bayesian phylogenetic and phylodynamic data 379 integration using BEAST 1.10 Creating the CIPRES Science Gateway for 382 inference of large phylogenetic trees Temporal signal and the 385 phylodynamic threshold of SARS-CoV-2 Mathematical models of infectious disease transmission tipdatingbeast: an r package to assist the implementation of 390 phylogenetic tip-dating tests using beast Relaxed phylogenetics and dating with 393 confidence On the origin and continuing evolution of SARS-CoV-2. 395 Genotyping coronavirus SARS-CoV-2: methods and implications A snapshot of SARS-CoV-2 genome availability 399 up to 30th March, 2020 and its implications Median-joining network 402 analysis of SARS-CoV-2 genomes is neither phylogenetic nor evolutionary SARS-CoV-2 and ORF3a: Nonsynonymous 406 Mutations, Functional Domains, and Viral Pathogenesis. mSystems 2020 Preliminary Identification of Potential Vaccine 409 Targets for the COVID-19 Coronavirus (SARS-CoV-2) Based The race for coronavirus vaccines: a graphical guide Emergence of Drift Variants That Affect COVID-19 Vaccine Development and Antibody Treatment The Impact of Mutations in SARS-CoV-2 Spike on Viral 417 The first two cases of 2019-nCoV in Italy: 420 Where they come from The Chinese SARS Molecular Epidemiology Consortium. Molecular Evolution of the 423 SARS Coronavirus During the Course of the SARS Epidemic in China Mutational dynamics of the SARS coronavirus in cell 426 culture and human populations isolated in 2003 Moderate mutation rate in the SARS coronavirus genome and 429 its implications From molecular genetics to phylodynamics: evolutionary relevance of 432 mutation rates across viruses Human neutralizing antibodies elicited by SARS-CoV-2 435 infection The total number of sequences is variable depending on 446 the analyzed region