key: cord-1039199-qt1bhnh1 authors: Yu, Jun title: From Mutation Signature to Molecular Mechanism in the RNA World: A Case of SARS-CoV-2 date: 2020-07-30 journal: Genomics Proteomics Bioinformatics DOI: 10.1016/j.gpb.2020.07.003 sha: fb6ef6991f70441f6c768cc29163f77c1a71e141 doc_id: 1039199 cord_uid: qt1bhnh1 nan 248 of genome compositional dynamics to protein structural dynamics via sequence mutations in the 249 context of permutations. The two-division and four-division models are in complete agreement 250 with our previous genetic code and its codon arrangement models [11] [12] [13] [14] . The two categories represent the compositional (or informational) and structural (or 252 operational) variables as well as their interplays that are interconnected through the genetic code 253 ( Figure 3C ) [13, 14] . The essence of such relationship is best manifested by CoVs, especially 254 through their transmission and host jumping scenarios. From the SARS-Cov-2 dataset, we observe 255 very little selection but strong mutation tendency toward G+C content decrease, as seen in a much 256 lower A-to-G, higher G-to-U, and even less selection on U-to-C. As mentioned before, within-257 population and between-population variations are rather distinct, even though sometimes genetic 258 distances are hard to measure for genomes as small as a few tens of kilobases and fast-mutating. 259 The two closely-related bat CoVs (RaT13G and NY02) and the pangolin CoVs, in contrast, show 260 a rather balanced trend where all four dominant G+C content altering permutations have similar 261 values, which is a typical between-population comparison (mutations are mapped to the SARS-262 CoV-2 reference genome). In addition, R12 permutations appear to also contribute to the G+C 263 content balance in a significant way. For a more distant comparison, SARS-CoV and its civet 264 counterpart (sequence referenced to SARS-Cov-2) show almost identical mutation spectra 265 between the two, but different from the SARS-Cov-2 cluster that are truly based on within-266 population variations ( Figure 2B) . A note to add is that the R1 permutations A-to-G and G-to-A 267 of these two datasets are not as deviated as C-to-U and U-to-C, and the fact is also seen in the 268 within-population permutations in the SARS-CoV-2 dataset. From a structure point of view, 269 molecular weight differences (Table 1) The two essential, apparently simple but extremely informative genomic sequence parameters, 295 G+C and purine contents, exhibit an overall compositional homeostasis of genome sequences. The 304 Therefore, the genomic G+C content of CoVs has to drift below or above these particular 305 boundaries. In reality, they appear to have G+C contents deviating around a mammalian genomic 306 average and their purine content is also in a narrower range, perhaps due to a limitation of the In summary, once we place a viral genome on a three-dimensional space, several pillars drive 347 its compositional and structural parameters to fit the cellular niche of its best host. Compositional mismatches with the non-canonical purine A, and it is this particular 125 untidy action by the CoV RTCs (reasons for this type of action will be discussed in detail later in 126 this session) Since full-length negative-sense strand is always a minor 136 population in the viral cellular life cycle and only full-length positive-sense strands are to be 137 assembled into viral particles or virons, the third issue concerns copy-number sensitivity, where 138 the number of positive-sense strands are expected to be 50-100 fold more than that of negative-139 sense strands [8]. The third set of permutations (R12; A-to-U, G-to-C, U-to-A, and C-to-G) are 140 distinct from the two sets mentioned above, in a way that they have to go through the first set 141 permutations (R1 permutations) in order to reach their final destinies. For instance, to achieve C-142 to-G alteration, the first permutation has to be C-to-U when the first negative sense strand is 143 synthesized (realized R1 permutation) There are more to be discussed on the definition of a mutation spectrum R12 147 permutations) and there would be, in theory, more Tv permutations than Ts permutations if every 148 mutation occurs by equal chances. In reality, this ratio is determined by order of synthesis and 149 specificity that is governed by structural or conformational variables of the viral RTCs. Second, 150 there is a hidden mechanism where the predominant mutations should have mostly gone through 151 the Ts mutation intermediates, C-by-U or G-by-A replacement and the reverse (Figure 1B). For 152 instance, a R1-derived C-to-U mutation is a G-by-A replacement on the negative-sense strand and 153 its offspring, the positive-sense viral genome U becomes the dominant permutation in a viral genome, the permutation G-to-U must lead to the 156 permutation U-to-G if selection (often referring to changes classified into synonymous and non-157 synonymous; the latter by and large indicates amino acid alteration and thus functional alteration) 158 is not strong enough to override this effect. However, in the case of R12-derived permutations U routes but go through a U-161 by-C or A-by-G and a G-by-A or C-by-U double replacements, respectively. Therefore, the 162 mechanistic Ts/Tv ratio is both strand specific and order sensitive. Apparently, other qualitative 163 and even quantitative (more likely statistical) parameters have to be introduced in order to solve 164 this puzzle completely. Obviously, mathematical models and related algorithms, which theorize 165 such permutation dynamics, are of essence for computer-based simulation studies. Third, in order 166 to predict mechanistic principles, where the variability of permutations in a given mutation 167 spectrum fits certain empirical rules, these three sets of permutations and their fractions must be 168 mapped RTCs and other related dynamic constituents. Nevertheless, the rationales are two-fold, one is 170 related to mutation specificity and the other to strand specificity that includes the order of mutation If all relevant mutations keep accumulating, such 185 as the case of SARS-CoV-2, we will be able to associate precisely most varied amino acid 186 sequences with enzymatic functions and even virus-centric symptoms of infected patients. The 187 negative side of the story has to do with how mutations are mapped to structure and conformation 188 related to enzymatic function, and certainly, wet-bench efforts are required to validate proposals, 189 conjectures, and assumptions CoV genomes from both humans and the true intermediate hosts 196 in a single outbreak) and between-population variations (variations of CoV genomes from multiple 197 outbreaks of a minimal or the same lineage, such as within the lineage of betacoronaviruses), and 198 we calculate within-population permutations based on sequence alignment of all SARS-CoV-2 199 genomes and those that were referenced to the SARS-CoV-2 reference genome but not isolated 200 from COVID-19 patients (from other mammals, such as bats and pangolins) are classified as 201 between-population variations. Mutation spectra of SARS-CoV-2, containing a snapshot total and 202 the non-synonymous mutation fraction of it, show typical patterns of their permutations. Clearly, 203 all R1 permutations are dominant with a trend where stronger C-to-U (presented as CU in figures 204 for simplicity; others changes are simplified in the same way) and weaker A-to-G exceed the 205 reverse pair, U-to-C and G-to-A, respectively. Among R2 permutations, G-to-U and C-to-A are 206 both dominant over the opposing pairs due to the similar mechanism to R1 permutations C-to-U 207 and A-to-G but happen during the positive-sense strand synthesis. This C-to-U dominance appears 208 rather universal to all mammalian CoVs in terms of within-population permutations but not among 209 between-population permutations; we believe this is determined by RdRPs of the highly conserved 210 mammal-infecting CoVs. Similarly, of R12 permutations, the A-to-U and G-to-C pair occurs more 211 frequently than the other pair, U-to-A and C-to-G. These trends of permutation variability as 212 compositional signatures are very much preserved for the non-synonymous mutations since the 213 newly generated within-population mutations have not yet been subjected to strong or long-term 214 selections. Data from the two previous CoV evasions are also very informative (Figure 2B), where 215 within-population variations (dominated by R1 and R2 permutations) and between-population 216 variations We summarize this within-population SARS-CoV-2 mutation spectrum into a table (Figure Assigning all 12 permutations to the different 222 features, such as nucleobase-specific size, hydrogen-bonding, as well as nucleotide composition 223 dynamics, we divide the permutations into two categories, composition-centric and structure-224 centric. In the composition-centric category, for instance, neither C-to-U and A-to-G nor U-to-C 225 and G-to-A alter A+G (purine) content but only G+C content. In other words, if a RNA virus, such 226 as CoV, needs to have a higher (ideally balanced at 0.50 purine content, the permutations for 227 change are these four (see discussion in the next session). Similarly, the best permutations for 228 constant G+C In the structure-centric category, all 12 permutations are evaluated based on spatial parameters Another is a four-234 parameter model where two binary choices have to be made and purines and pyrimidines are 235 treated differently. Obviously, the latter is more realistic but the first is easier to understand and a 236 useful approximation. In the SARS-CoV-2 dataset, the absolute dominant C-to-U permutation, as 237 a major benchmark, is rather obvious. The underlying principle is the proposed specificity 238 principle, the favorite G-by-A replacement, which is a 16-Dalton difference in molecular weight 239 change dictated by the CoV RTC. Similar principle is applicable to A-to-G permutation, where a 240 U-by-C replacement represents a 17-Dalton difference. We have also realized that A-to-G and G-241 to-A are not as biased as C-to-U and U-to-C pairs among SARS-CoV-2 within-population 242 variations, and the indication here is that molecular weight difference is not a sole measure to 243 describe the relevant, perhaps mostly conformational and less so structural, characteristics of the 244 catalytic pocket and its milieus. The other pair of permutations In this study, we 406 sequenced (139 isolates), analyzed (189 isolates) HP H5N1 genomes, and discovered several 407 important facts. The first observation suggests that there had been two groups of highly pathogenic 408 avian influenza virus (HPAIV) H5N1; one is termed the Old group and the other the New. It took 409 a 23-year period (1983-2006) for the New group to slowly replace the Old group and to become 410 prevalent in China (Figure 5B). Mechanisms of this slow takeover are multifold. The first is re-411 assortment of the segmented viral genomes This process appeared so vivid that the strongest 1997-1998 El Niño had shown its mark in this 414 as seen a delayed timing of the increasing AIVs of the New group El Niño and La Niña 415 are two opposing global climate patterns with distinction among events based on oceanic surface 416 temperature changes, which are natural parts of the climate system and have strong impact on 417 wildlife and ecosystems worldwide, especially the unusual warming and cooling of surface waters 418 in the eastern Pacific Ocean For instance, the New group of HPAIV 421 H5N1 started to emerge after the first event, the rise of the virus was delayed by the second event, 422 and the third events may be linked to other AIVs Second, the reasons why the New group had replaced the Old are its potency of infection 424 rather than specificity to any particular hosts [48-54] and multiple environmental factors that 425 encourage the change, such as distinct yet understood migration networks and flyways concerted effort to understand 427 all major zoonotic and human viruses, as well as their hosts, in a broader scope and larger 428 landscape, which must include biodiversity [53], ecology, geography, genetics, cell biology What lies behind these observations is an assumption that there was a distant dynamic source 431 pool for the two viral genomes, and it was the slow taking-over process, the Old by the New, which 432 had been mirrored via the seasonal migrating birds afar over time. In other words, what we had 433 sampled in China was a mirror image of HPAIV H5N1 Old-by-New takeover in the source genome 434 pool not the real propagation in China. We did at the time start vaccine development [54,55], 435 together with other biological and cellular studies, but called it quits as uncertainty about other 436 deterministic factors that may delay the next outbreak. We did not anticipate that any El Niño 437 peaks would come in such a frequency COVID-19 came right at its recover phase 4 -5 years after this peak 442 some longitudinal studies on bat and suspected mammal populations (such as pangolins and 443 rodents) are most urgent. We certainly need to compare notes on AIV and CoV studies, since they 444 may be deeply related in terms of shared habitats, seasonal outbreaks, as well as similarity in RNA to be able to freely jump over any compositional and structural hurdles, as particularly 450 focused in this discussion. They may now be ready to evade many mammalian species constantly 451 in addition to bats and humans. A full-spectrum CoV defense plan is of importance to all nations, 452 including scientific and medical communities so that data 456 not only can be shared freely by all experts and laymen but also digested in correct and professional 457 ways [56]. Second, we need to understand and associate mutations (in terms of 458 synonymous/nonsynonymous mutations, permutations, mutation spectra, etc.) to genes and protein 459 structures, as well as clinical parameters and data (such as pathology and symptoms), by 460 developing mathematical models and bioinformatic algorithms. Of course, large-scale genomics 461 data (such as studies on genomes of related wild animals) and datasets (high-quality for in-depth 462 analysis) should be collected and housed by other databases/knowledgebases for multi-disciplinary 463 research activities. Third, we should make a full list of projects on viral biology, especially on the 464 removal of host-associated species barriers, including both wild and domestic animals as research 465 subjects. Finally, cellular and animal studies From 469 the current collection of genomes and mutations, we have yet to paint a portrait of the single 470 genome and what it gives rise to, the offspring clades. They may not come from a single virus, as 471 it seems at this point of time, but a population that we have sampled in a long period of time that 472 could be months SARS-CoV-2 will reemerge, and we may not have to wait another 17 years for sure. Both bats and 478 migrating birds are to be targeted for the surveys and a special focus should be the broader 479 territories of Southeast Asia. A new international organizational supporting model may be needed and Meng Zhang 488 for helpful discussion and critical reading of this manuscript. This work is supported by the 489 National Natural Science Foundation of China (Grant No. 31671350) and the Key Research 490 Program of Frontier Sciences Where, when, and how: context-dependent functions of RNA methylation 497 writers, readers, and erasers Free energy calculation of modified base-pair formation 499 in explicit solvent: a predictive model Nidovirus RNA polymerases: complex enzymes 501 handling exceptional RNA genomes Rates of spontaneous mutation The curious case of the 505 nidovirus exoribonuclease: its role in RNA synthesis and replication fidelity A novel twelve class fluctuation test reveals higher than 508 expected mutation rates for influenza A viruses DNA mismatch repair: functions and 510 mechanisms A contemporary view of coronavirus transcription A structure-function diversity survey of the RNA-dependent RNA polymerases 514 from the positive-strand RNA viruses Structure of the SARS-CoV nsp12 polymerase bound to nsp7 516 and nsp8 co-factors A content-centric organization of the genetic code A scenario on the stepwise evolution of the genetic code On the organizational dynamics of the genetic code The pendulum model for genome compositional dynamics: from the four 524 nucleotides to the twenty amino acids GC content variability of eubacteria is governed by the Pol III 527 alpha subunit Compositional dynamics of guanine and cytosine content in 529 prokaryotic genomes On the molecular mechanism of GC content variation among 531 eubacterial genomes The quest for a unified view of bacterial land colonization The bat genome: GC-biased small chromosomes 535 associated with reduction in genome size On the length, 537 weight and GC content of the human genome A pneumonia outbreak 539 associated with a new coronavirus of probable bat origin A novel Bat coronavirus closely related to 541 SARS-CoV-2 contains natural insertions at the S1/S2 cleavage site of the spike protein Isolation of SARS-CoV-2-related 544 coronavirus from Malayan pangolins Comparative analysis of rodent and small 546 mammal viromes to better understand the wildlife origin of emerging infectious diseases Molecular evolution of human coronavirus 549 Genomes Plasmodium parasites of birds have the most AT-rich genes of eukaryotes Tail wags the Dog? Functional gene classes driving genome-wide GC 553 content in Plasmodium spp Extreme 555 mutation bias and high AT content in Plasmodium falciparum Divergent pattern of genomic variation in Plasmodium falciparum and P. 558 vivax The evolution of genomic GC content undergoes 560 a rapid reversal within the genus Plasmodium Dissimilation of synonymous codon 562 usage bias in virus-host coevolution due to translational selection SARS and MERS: recent insights into 565 emerging coronaviruses Fatal swine acute diarrhoea 567 syndrome caused by an HKU2-related coronavirus of bat origin Host and viral 569 traits predict zoonotic spillover from mammals Large-scale sequencing and analysis of avian influenza viruses Fatal infection with influenza A (H5N1) 573 virus in China Evolution and variation of the SARS-CoV 575 genome Complete genome sequences of the SARS-577 CoV: the BJ Group (Isolates BJ01-BJ04) Tracing SARS-580 coronavirus variant with large genomic deletion Molecular evolution of the SARS 582 coronavirus during the course of the SARS epidemic in China Cross-host evolution of 584 severe acute respiratory syndrome coronavirus in palm civet and human Discovery of a 587 382-nt deletion during the early evolution of SARS-CoV-2 the climate event of the century Changes and disturbance in tropical rainforest 592 in South-East Asia United Nations Office for the Coordination 595 of Humanitarian Affairs (OCHA), Regional Integrated Multi-Hazard Early Warning System 596 for Africa and Asia (RIMES) Global disease 602 outbreaks associated with the Co-circulation of multiple reassortant H6 604 subtype avian influenza viruses in wild birds in eastern China Diversity of influenza 606 A(H5N1) viruses in infected humans Cross-species pathogen 609 spillover across ecosystem boundaries: mechanisms and theory Predicting virus emergence amid evolutionary noise Avian influenza H5N1 614 viral and bird migration networks in Asia Global 616 patterns of influenza a virus in wild birds Does the 618 impact of biodiversity differ between emerging and endemic pathogens? The need to separate 619 the concepts of hazard and risk Inactivated SARS-CoV vaccine prepared 621 from whole virus induces a high level of neutralizing antibodies in BALB/c mice Immunogenicity and protective 624 efficacy in monkeys of purified inactivated Vero-cell SARS vaccine The elements of data sharing Origin and evolution of pathogenic coronaviruses and G-to-U, due to mismatch between A and G or between U and C; all are transitional. These 639 mutations are carried over to the next synthesis (R2) without further sequence change, so that they 640 are dominant permutations in a CoV-specific mutation spectrum: U-to-C, C-to-U, A-to-G, and G-641 to-A. B. A summary of all possible permutations in a CoV mutation spectrum. This mutation 642 spectrum is true for all RNA genomes and some may start with a negative-sense genome, such as 643 in the case of influenza viruses Permutations leading to lower G+C content are underlined and labeled 647 as L. Permutations leading to higher G+C content are labeled as H. Four permutations that do not 648 alter G+C content are labeled as NC. Purine (A+G) content-sensitive permutations are also scored 649 in a similar way to the G+C content row. The molecular weight altering consequences are scored 650 as SL and LS for increase and decrease, respectively. The intermediates are also indicated in the 651 parentheses. For instance, a G-to-U mutation as a permutation of the mutation spectrum happens 652 when the positive sense strand is synthesized (R2) where G is supposed to pair with C but altered 653 as U, so that the mechanism becomes a C-by-U replacement. Another more complicated instance 654 is G-to-C mutation, which has to go through a first R1 mutation G-to-A (a negative-sense strand 655 mutation) and then a U-to-C mutation (a positive-sense strand mutation and thus labeled as SL, small-to-large LS, large-to-small Mutation spectra of SARS-CoV and its civet (Paguma larvata) intermediate host (plaCoV Manis javanica) 664 (mjaCoV-P4L) and two from bats (Rhinolophus affinis and Rhinolophus malayanus) (rafCoV-665 RaTG13 and rmaCoV-RmYN02), as well as SARS-CoV-2. The data for SARS-CoV-2 are a 666 collection from public databases with 12,642 full-length high-quality genome sequences. Note that 667 all SARS-CoV-2 data show clear C-to-U dominance in R1 permutations (red columns) and G-to-668 U dominance in R2 permutations (blue columns) Permutations based on synonymous mutations are indicated in dark red, blue and green for R1, R2 671 and R12, respectively. The corresponding permutations based on nonsynonymous mutations are 672 indicated in light colors. B. The within-population mutation spectra of SARS-CoV, MERS-CoV, 673 and their mammalian co-hosts. The permutations are calculated based on public collections with a 674 limited number of individual sequences. MERS-CoV data have 248 and 182 genomes from 675 humans and camels Permutations based on genome 677 sequences from mammalian co-hosts are indicated in dark red, blue and green for R1, R2 and R12, 678 respectively. The corresponding permutations based on genome sequences from humans are 679 indicated in light colors A table listing all permutations and their impact on G+C and purine contents and possible 683 influences on RTCs. R1, first replication that synthesizes the negative-sense strand; R2, the second 684 replication that synthesizes the positive-sense strand This 698 two-division model suggests that the tight (G-by-A and U-by-C replacements) status leads to 699 excessive U and the loose status (A-by-G and C-by-U) leads to A surplus. The second model (right 700 panel) describes a four-division model where purine (R)-centric and pyrimidine (Y)-centric tight 701 and loose model dictates a more intricate permutation variability. Arrow-headed dashed lines 702 connect R1 permutation to R12 permutations. Note that cross-column relationship is rather striking, 703 which re-routes some structural principles, navigating mutation forces on one hand and leaving 704 room for selection to work on Figure 4 Compositional variations of human CoVs and their closely-related CoVs Human CoVs (red solid circles) vs. non-human CoVs (blue solid circles) and (solid circles) vs dashed line) are SARS-CoV 715 and MERS-CoV (together with their representative within-population zoonotic counterparts, a 716 civet and a camel, respectively; blue + red = purple); one near the 0.500 purine content line (the 717 vertical dashed line) is the SARS-CoV-2 (0.380, 0.496; overlapping with a vole CoV; a red solid 718 circle surrounded by a purple open circle). The complete list of the samples and compositional 719 parameters are listed below. These include (1) alphacoronavirus bat-CoV/P 738 (36) shrew coronavirus isolate Shrew-CoV/Tibet2014: 0.366, 0.515; (37) thrush CoV Figure 5 The very strong El Niño events and viral outbreaks of CoVs and AIVs A diagram displaying the recent and historic CoV outbreaks over time based on assumptions 745 that these CoV outbreaks are not isolated events. The three recently recorded El Niño events (blue 746 triangles They were carried 752 to China through the flyway by migratory birds seen as sampling the local populations (shared by the strong El Niño event in 1997-1998 (red dashed rectangles). The process started 755 again after 2001 and reached its half way in 2003. The complete takeover of Old group's territory 756 happened after 2005. This takeover had struggled in a time period of ~10 years, with matching 757 point around 2003 (blue dashed rectangles). An obvious slow-down point is around 1997, a 758 showdown by nature, the yet-strongest El Niño event in the recent history Note: The median numbers of mutations per genome are 6-7, which are slightly different among 781 clades. Data are downloaded from the National Genomics Data Center The data are not an up-to-date collection so that it provides only a snapshot of the reality in 784 passing