key: cord-315760-9g8901v6 authors: Teng, Xufei; Li, Qianpeng; Li, Zhao; Zhang, Yuansheng; Niu, Guangyi; Xiao, Jingfa; Yu, Jun; Zhang, Zhang; Song, Shuhui title: Compositional Variability and Mutation Spectra of Monophyletic SARS-CoV-2 Clades date: 2020-08-30 journal: bioRxiv DOI: 10.1101/2020.08.26.267781 sha: doc_id: 315760 cord_uid: 9g8901v6 COVID-19 and its causative pathogen SARS-CoV-2 have rushed the world into a staggering pandemic in a few months and a global fight against both is still going on. Here, we describe an analysis procedure where genome composition and its variables are related, through the genetic code, to molecular mechanisms based on understanding of RNA replication and its feedback loop from mutation to viral proteome sequence fraternity including effective sites on replicase-transcriptase complex. Our analysis starts with primary sequence information and identity-based phylogeny based on 22,051 SARS-CoV-2 genome sequences and evaluation of sequence variation patterns as mutation spectrum and its 12 permutations among organized clades tailored to two key mechanisms: strand-biased and function-associated mutations. Our findings include: (1) The most dominant mutation is C-to-U permutation whose abundant second-codon-position counts alter amino acid composition toward higher molecular weight and lower hydrophobicity albeit assumed most slightly deleterious. (2) The second abundance group includes: three negative-strand mutations U-to-C, A-to-G, G-to-A and a positive-strand mutation G-to-U generated through an identical mechanism as C-to-U. (3) A clade-associated and biased mutation trend is found attributable to elevated level of the negative-sense strand synthesis. (4) Within-clade permutation variation is very informative for associating non-synonymous mutations and viral proteome changes. These findings demand a bioinformatics platform where emerging mutations are mapped on to mostly subtle but fast-adjusting viral proteomes and transcriptomes to provide biological and clinical information after logical convergence for effective pharmaceutical and diagnostic applications. Such thoughts and actions are in desperate need, especially in the middle of the War against COVID-19. The recent threats from SARS-CoV-2, SARS-CoV, and MERS-CoV are different from 51 those of earlier human coronaviruses (CoVs), including alphacoronaviruses, such as hsa-CoV-52 229E, hsa-CoV-NL63, hsa-CoV-OC43 and hsa-CoV-HKU1 [8-10], in at least two aspects. 53 First, the recent groups of betacoronaviruses appears to come more frequently in the past two 54 decades as compared to the early comers where new members may be discovered as technology 55 become more efficient and accurate [11] . The current SARS-CoV-2 is also different from both 56 4 intermediates that allow mismatch repair albeit existence of short and extremely rare double-98 stranded RNA fragments involved in interference-based immunity [20, 21] . Third, the 99 existence of Wobble basepairing for secondary structures is of essence for operational 100 functions of all RNA molecules in addition to genetic information inheritance [22] . That said, 101 we can now look at how the RNA genome of SARS-CoV-2 and related CoVs take advantage 102 lower G+C contents. GC3-associated mutations often reflect directional mutation patterns as 132 observed strongly in certain lineages of plants and warm-blooded vertebrates as negative 133 gradients from the transcription starts, and such trends are attributable to a special DNA repair 134 mechanism, transcription-coupled DNA repair [23] [24] [25] . The notion here is to remind ourselves 135 that transcription-centric mutations may be accounted for some of the mutation events in RNA RaTG13 and mja-betaCoV-P4L, which are considered to be distantly related but most closely are considered here, half of the codons are not sensitive to cp3 changes, and most of them are 166 smaller amino acids ( Figure S2B ; [16-19]) . Second, at the cp1, G and C contents are both 167 pulled apart toward extremity but not A or U, while the two pyrimidines and two purines appear 168 stretched to separate directions; these trends suggest strong selective pressure at the first codon 169 position over the entire genome. It is indeed that cp1 codons shoulder the most mutation 170 pressures since they fall into all 4 negative-sense strand permutations (known as R1-derived 171 permutations, C-to-U, G-to-A, U-to-C and A-to-G). Third, the cp2 contents are most row-172 flipping changes referenced to the genetic code organization [18] . These alterations are very 173 useful for alternating chemical characteristics between related amino acids, and in terms of 174 flexibility, cp2 codons are less stringent than cp3 but more flexible than cp1. The balancing 175 power becomes more obvious when ORFs or proteins are examined individually for their 176 composition dynamics ( Figure 1C ). Finally, it is conclusive that the more similar the CoVs in 177 composition dynamic parameters, the closer they are genetically and phylogenetically in 178 principle. However, primary parameters, such as G+C and purine contents are necessary but 179 may not be sufficient. For instance, there has been a CoV genome isolate from a wild vole 180 captured in northeastern China, whose G+C and purine contents overlap with SARS-CoV-2 181 completely (Rodent coronavirus isolate RtMruf-CoV-2/JL2014; 0.38, 0.496; [27]) but its 182 genome sequence is different (sharing 61.87% identity with SARS-CoV-2). Therefore, we have 183 yet to find a within-population immediate animal host of SARS-CoV-2 albeit best similarity 184 of composition dynamics seen among them. 185 Our subsequent study is focused on composition dynamics within CoV genomes. It is 186 interesting to see uniformity among all codon position contents of all CoV genomes, increased 187 G+C content from antigenomes to subgenomes. However, this trend is an illusion where the 188 real trend is the lower G+C content of antigenomes but higher G+C contents of subgenomes 189 due to stronger selection over structural proteins. This observation becomes clearer when all 190 ORFs and proteins are scrutinized one by one ( Figure 1C and Figures S1). SARS-CoV-2 has 191 an exceptionally short subgenome 9 (sg9) which only contains ORF10, but we have no 192 evidence that it is either functional or non-functional. These results collectively remind us that 193 SARS-CoV-2 and its two most-closely-related CoVs, unlike in the case of many other known 194 by stratifying the data into structured and non-structured clades; the former can be analyzed 244 first and the rest await further ideas. The next is even more troublesome. Assuming that we 245 have 5 or more genome sequences per CoV isolate and variations identified among them are 246 still a miniscule fraction of the total virions produced in a patient body (medians and means of 247 variations per CoV isolate among C01 to C09, see Table S3 ), since the viral load per patient 248 sample, such as sputum [28, 29] , is equivalent to a 5-person or more sampling of the entire 249 human population on earth, 1 out of 10 9 . Even so, we have still been able to find shared 250 variations among patient samples occasionally and even more lucky to have some clade 251 structures, by and large due to the relatedness of the patients in the transmission network. 252 Finally, we have to admit that many assumptions have to be made about these samples and 253 their genome sequences above sequence and assembly errors for phylogeny and genetic studies. 254 Nevertheless, we have constructed a somewhat stable phylogenetic tree-and-branch 255 structure for further analysis ( Figure 4A ). It is composed of 8 monophyletic clades and 1 non-256 monophyletic clade based on both orders of sample collection date and highly-shared mutations. 257 Among the clades, C02 shares two landmark mutations, C8782U in ORF1ab and U28144C in 258 ORF8, and earlier date (2019/12/30). C04 shares three more mutations (C17747U, A17858G, 259 and C18060U in ORF1ab) than what C02 have, and a late collection date (2020/02/20). Clades 260 C03, C05, and C07 are also distinguishable by some major mutations, so are C06, C08, and 261 C09; the latter clades are clustered together based on four shared and other clade-associated 262 mutations. The leftover large number of isolates that lack all landmark mutations are grouped 263 into C01, which have the earliest collection date on 2019/12/24. According to the literature and 264 our discussion, we have further grouped the clades into three clusters, S (C02 and C04), G 265 (C06, C08 and C09), and L (all the rest) since phylogeny shows clear divergence among them. between-clade analysis of high major allele frequency (MAF) variations reveals that some 268 clade-associated signature mutations are also shared among clades. For instance, C14805U in 269 ORF1ab and A24034G in Cluster S have recurred in other clades of different clusters, which 270 are excellent landmarks for subclade definition. Another notion is that higher MAF within-271 clade mutations (such as MAF>0.2) are mostly non-synonymous mutations, indicating 272 selection at work ( Figure S4 ). Our neighbor-joining tree based on distances from 9 clades 273 suggests that SARS-CoV-2 appears originated from multiple zoonotic reservoirs instead of a 274 single direct ancestor ( Figure S4 ). In addition, our classification rationales are largely in unrooted phylogenetic tree is shown in Figure 4B . 279 To look for clade-associated compositional and functional features, we have first built a 280 consensus sequence for each clade and subsequently calculated frequencies for each within-281 clade permutation (Table S2 ; Figure 5A and 5B). A key assumption behind this is that certain 282 functional mutations may have clade-specific effects on mutation spectrum, to close a loop 283 where sequence mutations through genetic coding principles alter the viral proteome function. 284 Our observations are of importance in establishing logics about compositional dynamics 285 between nucleic acids and proteins. First, permutations among clades are indeed variable 286 according to their proportions calculated from genome variants, and aside from 5 high-287 proportion permutations, 4 R1 and 1 R2 permutations, two other R2 and one R12 permutations 288 appear also joining in, which are U-to-G and A-to-C, as well as A-to-U, respectively. Second, 289 the variable permutations, where some may represent effect of mutation pressure and others 290 may exaggerate selection pressure, are unique to clades and clade clusters. For instance, clade 291 cluster S has the lowest G-to-U fraction as compared to those of L and G; in addition, among 292 the S clades, C04 has the lowest value of G-to-U. Similarly, C03, C05, C06, C08, and C09 293 have relatively higher G-to-U permutations. Third, based on the disparity of permutations or 294 simply mutation spectra, we have taken a rather radical step to assume RTC statuses in favor 295 of either tight or loose statuses for binding to purines and pyrimidines (see Figure S5 ). Since 296 purines are larger than pyrimidines in size, the purine-or R-tight must be different from 297 pyrimidine-or Y-tight. The results are strikingly predictable in that the R-tight status suggests 298 have discrete definitions for these so-called tight statuses but the less trendy R-loose and Y-302 loose statuses also support a similar idea. 303 We have further examined the compositional subtleties among the clades and clusters with 304 a focus on G+C and purine content variability as both contents appear drifting toward optima 305 in SARS-CoV-2 and its relatives ( Figure 5C and 5D). Different clades exhibit distinct 306 compositional features and such dynamics are very indicative for the existence of feedback 307 loops connecting RNA variables to protein variables. Two directions have to be advised for 308 understanding these features albeit in absence of between-clade statistics. The first direction is 309 driven by strong mutations, perhaps coupled to tight-loose switches in the catalytic pocket of 310 RdRPs in RTCs. It is clear that except C01, the G or C06-C08-C09 cluster has the lowest G+C 311 (0.37929, based on a C08 CoV sampled in Australia) and the lowest purine contents (0.49527, 312 based on a C08 CoV collected in Bangladesh and a C09 CoV collected in England). Both lower 313 G+C and purine contents are indicative of mutation pressure and signal this fast-evolving 314 cluster of CoVs. Since this cluster has the largest collection of CoVs, it is also not surprising 315 to see a more complex median diversification within clades ( Figure 5C and 5D). The second 316 direction is the drive from selection or both selection and mutation in balance or imbalance, as 317 well as in modes of fine-tuning or quick-escaping. Some results from our analyses are shown 318 here for briefing purposes ( Figure 5E to 5G). For instance, G+C and purine contents at cp3 are 319 informative for mutation drives and other measures are less clear cut ( Figure 5F ), given the 320 evidence that even MAFs among clades are not stably distributed among clades as lower MAF 321 variations are rather sporadic and hard to analyze even binned into groups (data not shown). 322 Based on our clade and clade cluster analysis, it is tempting for us to speculate that there 323 are plenty of rooms for further investigations into mutation spectra among large clades and 324 even smaller clades or closely related individual CoV genomes for several reasons. First, all 325 high-frequency MAFs should be identified and classified and these variations are candidates 326 for highly selected mutations. Second, all within-clade minor but not rare alleles (less than 327 1/10,000), such as those of MAFs in a range of 0.01% to 10% should also be identified; they 328 provide basis of within-clade sequence analysis. Third, all non-structured CoV genomes must 329 be also classified based on shared variations, as they are not only valuable for within-clade but 330 also for clade-cluster analyses since there is a large background of genome variations not yet 331 brought into the databases. 332 Within-clade compositional dynamics can also be very informative, especially for covering 335 and predicting future functional changes, such as identifying mutated and diversified forms of 336 CoVs for drug and vaccine designs. It is also of essence for nucleic acid-based diagnostics, 337 such as clade-specific identifications. We are in a process of developing an interactive database 338 and mutation-function predicting algorithms based on our results to interpret novel sequence 339 variations in real time. Within-population variations are identified based on clade consensus 340 sequence after alignment and extracted from datasets that have hundreds and thousands of 341 genome sequences. The analysis of within-population variations relies on structured phylogeny 342 and proportion change of permutations. The changes, based on functional relevance, can be 343 classified into either copy number-related or RTC-specificity related, or sometimes both. 344 We have taken two steps to extract information in order to distinguish the underlining 345 mechanisms (Figure 6 ). In the first approach, we identify key mutations based on MAF of 346 mutations with a consideration of relatively even distribution among subclades and name the 347 subclades in a sequential order based on the absence of a subset ( Figure 6A ). In the second step, 348 we plot out permutations to track changes among subclades ( Figure 6B ). For instance, clade 349 C02 can be divided into 8 subclades and its variable permutation fractions are clearly 350 recognizable. An immediate discovery is the trends of descending C-to-U, ascending A-to-G, 351 and wavy G-to-U that initially goes up with A-to-G but rides down with C-to-U afterward. 352 Taking the two smaller clades, such as C03 and C05, as examples ( Figure 6D and 6E), we 353 first find that their trends of permutation variables show opposite directions, where the 354 increasing C-to-U accompanies with the decreasing G-to-U. A closer examination reveals that 355 the increasing C-to-U in C03 is also accompanied by descending U-to-C. The only permutation 356 showing an increasing trend in C03 is G-to-A. The take-home message from these trends is 357 that RNA synthesis of this subclade is biased toward producing more negative-sense strands 358 or its mutation spectrum exhibits increasing mutations generated during the negative-sense Table S3 . CoV-related (cdr-betaCoV-B73), SARS-CoV-2 related (raf-betaCoV-RaTG13), and NL63-455 related (taf-alphaCoV-NL63; both species and CoV genera were labelled for clarity). Third, 456 we also added more informative CoV genome sequences to enrich lineage-associated 457 information, which are a pangolin coronavirus genome (mja-betaCoV-P4L) reported to be 458 closed to SARS-CoV-2 and 3 non-beta-coronaviruses that infect animals (e.g., ave-gamaCoV 459 from gamma-coronavirus genus and smu-alphaCoV-WS from alpha-coronavirus genus). 3' end of the genome, which is rather a result of, in terms of mechanism, the increased U 743 content and C-to-U permutation. The negative gradient of U is also obvious from the 5' end to 744 the 3' end. 745 We use a 300-bp sliding window with a 21-bp step to show dynamic changes of genome G+C, 812 purine, GC1 to GC3, and GC skew (G-C/G+C) contents. The complete genome sequences and 813 data sources are listed in Table S1 . 814 (colored and uncolored backgrounds) and cp1 and cp2 relative to their permutations sensitivity 821 and changes are indicated with half parentheses with color-coding: C-to-U|U-to-C and A-to-822 G|G-to-A, red; G-to-U|U-to-G and A-to-C and C-to-U, blue; and A-to-U|U-to-A and G-to-C|C-823 to-G, green. Note that cp1 and cp2 are sensitive to column and row codon swaps, respectively. 824 Cp3 is in a unique position where only half of the codons are sensitive to its changes, and the 825 other half is so organized that some codons are more permissive than others. Coronaviruses lacking 600 exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading and 601 potential therapeutics Rampant C-->U Hypermutation in the Genomes of SARS-CoV-2 and Other 603 Coronaviruses: Causes and Consequences for Their Short-and Long-Term Evolutionary Evidence for host-606 dependent RNA editing in the transcriptome of SARS-CoV-2 A scenario on the stepwise evolution of the genetic code A content-centric organization of the genetic code On the organizational dynamics of the genetic code The pendulum model for genome compositional dynamics: from the four 614 nucleotides to the twenty amino acids An accessory to 616 the 'Trinity': SR-As are essential pathogen sensors of extracellular dsRNA, mediating entry and 617 leading to subsequent type I IFN responses SARS coronavirus pathogenesis: host innate immune responses and 619 viral antagonism of interferon Codon--anticodon pairing: the wobble hypothesis The transcript-centric mutations in human genomes Distinct contributions of replication 624 and transcription to mutation rate variation of human genomes Compositional gradients 627 in Gramineae genes Asymmetric substitution patterns in the two DNA strands of bacteria Comparative analysis of rodent and small 631 mammal viromes to better understand the wildlife origin of emerging infectious diseases Viral load of SARS-CoV-2 in clinical 634 samples Detection of SARS-CoV-2 in Different 636 Types of Clinical Specimens A dynamic 638 nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology On the origin and continuing evolution 641 of SARS-CoV-2 disease and diplomacy: GISAID's innovative 643 contribution to global health A Novel Bat Coronavirus Closely 645 Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike 646 A pneumonia outbreak 648 associated with a new coronavirus of probable bat origin SARS-CoV-2 viral spike G614 mutation exhibits higher 650 case fatality rate The D614G mutation in SARS-CoV-2 Spike increases 652 transduction of multiple human cell types Spike 654 mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2 The D614G mutation 657 in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity Nanopore target sequencing for accurate 660 and comprehensive detection of SARS-CoV-2 and other respiratory viruses Transmission and evolution of the Middle East respiratory syndrome coronavirus in Saudi 669 Arabia: a descriptive genomic study Genomic 671 surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Reconstructing the initial global spread of a human 674 influenza pandemic: A Bayesian spatial-temporal model for the global spread of H1N1pdm Origins and 677 evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic Decoding the evolution and transmissions of 680 the novel pneumonia coronavirus (SARS-CoV-2 / HCoV-19) using whole genomic data Comparative analysis of coronavirus genomic RNA structure reveals conservation in SARS-684 like coronaviruses The China National GeneBank 688 horizontal line owned by all, completed by all and shared by all World data centre for 693 microorganisms: an information infrastructure to explore and utilize preserved microbial 694 strains worldwide MUSCLE: multiple sequence alignment with high accuracy and high FastTree 2--approximately maximum-likelihood trees for 704 large alignments Interactive Tree Of Life (iTOL) v4: recent updates and new 706 developments phangorn: phylogenetic analysis in R Using ggtree to Visualize Data on Tree-Like Structures Full names of the human CoVs are listed in the legend of Figure 2