key: cord-104162-fe51v2pt authors: Zhang, Chiyu; Forsdyke, Donald R. title: Potential Achilles heels of SARS-CoV-2 displayed by the base order-dependent component of RNA folding energy date: 2020-11-02 journal: bioRxiv DOI: 10.1101/2020.10.22.343673 sha: doc_id: 104162 cord_uid: fe51v2pt Base order, not composition, best reflects local evolutionary pressure for folding of single-stranded nucleic acids. The base order-dependent component of folding energy has revealed a highly conserved region in HIV-1 genomes that associates with RNA structure. This corresponds to a packaging signal that is recognized by the nucleocapsid domain of the Gag polyprotein. Long viewed as a potential HIV-1 “Achilles heel,” the signal can be targeted by a recently described antiviral compound (NSC 260594) or by synthetic oligonucleotides. Thus, a conserved base-order-rich region of HIV-1 may facilitate therapeutic attack. Although SARS-CoV-2 differs in many respects from HIV-1, the same technology displays regions with a high base order-dependent folding energy component, which are also highly conserved. This indicates structural invariance (SI) sustained by natural selection. While the regions are often also protein-encoding (e.g. NSP3, ORF3a), we suggest that their nucleic acid level functions – such as the ribosomal frameshifting element (FSE) that facilitates differential expression of 1a and 1ab polyproteins – can be considered potential “Achilles heels” for SARS-CoV-2, perhaps susceptible to therapies like those envisaged for AIDS. The region of the FSE scored well, but higher SI scores were obtained in other regions, including those encoding NSP13 and the nucleocapsid (N) protein. composition tends to reflect genome-wide evolutionary pressures. Just as a local arrangement of words conveys specific meaning to a text, so base order better reflects local evolutionary pressures. Base order is most likely to be conserved when encoding a function critical for survival. Assays of the base order-dependent component of the folding energy have shown that a highly conserved region, in otherwise rapidly mutating HIV-1 genomes, associates with an RNA structure corresponding, not to a protein-encoding function, but to an RNA packaging signal. The latter is specifically recognized by the nucleocapsid domain of the Gag polyprotein [8] and is now seen as a potential "Achilles heel" of HIV-1 that can be targeted by a recently described antiviral compound (NSC 260594) or by synthetic oligonucleotides [4] . We here report similar highly conserved structural regions of the SARS-CoV-2 genome, one or more of which should be susceptible to targeting [9, 10] . We identify certain open reading frames (ORFs) that, because of their conservation, have so far attracted therapeutic interest mainly related to their functions at the protein level [11, 12] , rather than at the level of the corresponding, yet highly structured, regions of the genome. The ribosomal frameshift element (FSE) that is among our results, is attracting attention [10, 13, 14] . Yet our analysis, consistent with recent reports [15, 16] , suggests there may be more suitable targets in other regions. These were obtained from the NCBI (Bethesda) and GISAID EpiCoV (Munich) databases. The Wuhan-Hu1 sequence (GenBank NC_045512.2), deemed taxonomically prototypic [17, 18] , was compared (regarding base substitutions and folding potential) with 381 Chinese isolates, 430 Italian isolates, and 932 isolates from New York, USA. Our "window" starting point was base 1 of the 29,903 base prototype sequence. We refer to windows by their centers. The center of the first 200 base window would be 100. In previous HIV-1 studies the base differences between just two individual sequences sufficed for the tabulation of a statistically significant set of base substitution frequencies [2] . 4 The lower mutation rates of SARS-CoV-2 strains [11] The energetics of the folding of a single-stranded nucleic acid into a stem-loop structure depend on both the composition and order of its bases. Base composition is a distinctive characteristic of a genome or large genome sector. A localized sequence (e.g. a 200 base window), which is rich in the strongly-pairing bases G and C, will tend to have a stable structure simply by virtue of its base composition, rather than of its unique base order. This high GC% value can obscure the contribution of the base order-dependent component of the folding energy, which provides a sensitive indicator of local intraspecies pressures for the conservation of function within a population (i.e. a mutated organism is eliminated by natural selection so no longer can be assayed for function in the population). In contrast, interspecies mutations tend to influence the genome-wide oligonucleotide (k-mer) pressure, of which base composition (GC%) is an indicator. This pressure can act to generate and/or sustain members of emerging species by preventing recombination with parental forms [3, 7, [19] [20] [21] . Elimination of this base composition-dependent component facilitates focus on local folding. 5 Early studies of RNA virus structure by Le and Maizel [22] were primarily concerned with the statistical significance of RNA folding, rather than with distinguishing the relative contributions of base composition and base order. However, with a pipeline between the various programs that were offered by the Wisconsin Genetics Computer Group, the base composition and base order-dependent components were separated and individually assessed ("folding of randomized sequence difference" analysis; FORS-D analysis). Departing from Le and Maizel, FORS-D values (see below) were not divided to yield Z-scores, but were simply plotted with statistical error bars [2, 3, 7, 23, 24] . The limits of the latter were generally close to the corresponding FORS-D values and, for clarity, are omitted here. A window of 200 bases is moved along a natural sequence in 20 or 50 base steps. A folding program [25] is applied to the sequence in each window to obtain "folding of natural sequence" (FONS) values for each window, to which both base composition and base order will have contributed. The four bases in each sequence window are then shuffled to destroy their order while retaining base composition, and folding energy is again determined. This shuffle-and-fold "Monte Carlo" procedure is repeated ten times and the average (mean) folding value is taken as The approach was employed by others who, rather than shuffling the four bases, favored retaining some base order information. Accordingly, they shuffled groups of bases (e.g. the sixteen dinucleotides). Following disparagement of the conceptual basis of four base shuffling, which was duly clarified [26] , the validity of single base level shuffling is now generally 6 accepted and is being applied routinely to various viral genomes [15, 16, 27] . The Monte Carlo procedure can also be simplified to decrease FORS-M computational time [28] using support vector machine-based technology [29] . Our original software ("Bodslp"), written by Professor Jian-sheng Wu [30] , retains the Monte Carlo approach and was further developed by Professor Shungao Xu as "Random Fold-Scan" for Windows-based systems [31] . In addition to assisting the study of infectious viruses and protozoa [32] , FORS-D analysis proved fruitful when applied to topics such as speciation [3, 19, 33] , the origin of introns [23, 24, 34] , relating structure to recombination breakpoints and deletions [35, 36] , and relying on a single sequence (rather than alignments) for the determination of positive Darwinian selection [37] . However, for a given sequence window, output can follow only from the base order in that window. Lost are higher order structures that might occur naturally through long-range interactions [14] . Furthermore, if the artificial demarcation of a window happens to cut between the two limbs of a natural stem-loop structure, then lost are what might have been contributed to the folding energetics had a larger window, or a different section point, been chosen. Variations in step size will generate differences in windows and hence variations in results. Such variations might be less when window margins correspond to natural section points, such as those demarcating an RNA transcript. There is also a kinetic aspect, particularly apparent with transcripts, due to the probability that the pattern of early 5' folding will influence later folding. High negative SI values indicate structural conservation among a set of genomes. The validity of this approach is supported by prior work with HIV-1. Fig. 1 shows application of the index to previously reported data on HIV-1 [2] . Here a high negative SI index value corresponds 7 to regions recognized as likely to offer Achilles heel-like vulnerability [7] . The region around the window centered on 500 nucleotide bases is the focus of recent work [4] [5] [6] . subtype (HXB2) [2, 3] . The region recently recognized as a potential Achilles heel (bases 380-640) [4] [5] [6] , has a high negative FORS-D value (black triangles) similar to those of the RRE (Rev response element) and the 3' untranslated region (UTR). The SI index (continuous red line) indicates highest sequence conservation in the regions of the RNA packaging signal and the 3'UTR. 2 . 1 . V a r i ation d u e t o b The degree of conservation in sequential windows of members of a set of sequences from China was evaluated as the number of substitutable base positions (polymorphism), relative to the prototype sequence (Fig. 3) . This was compared with corresponding FORS-D values, the profile of which differed a little from that of Fig. 2 due to different step values (see Materials and Methods). As with some previous studies [23] , there tended to be a reciprocal relationship between the For a more focused view of the reciprocal separation of high negative folding values and corresponding numbers of substitutions (ranging from zero to high positive values) the two were added to provide the "structural invariance" (SI) index (Fig. 4) . Here, despite having some substitutions, ORFs NSP13 and NSP3 were preeminent (scoring - a region with the highest number of substitutions (see Fig. 3 ) centered on base 28900 (scoring +7.6 SI units). (For plots of analogous HIV-1 data see Fig. 1 ). Windows 13400 and 13450 in the FSE region scored -12.5 and -13.7 units, respectively. However, there were many more windows with higher negative scores. The SI profile for China (Fig. 4) was, in broad outline, confirmed with corresponding data later downloaded from Italy and New York, USA (Fig. 5) . The high negative SI indices found in the NSP3, NSP13, ORF3a and N regions, were evident with sequences for all three locations. Other regions, notably the S region, were also corroborated. The high positive SI values, indicating regions likely to have poorly conserved structures, are also corroborated at some locations. A high intraspecies mutation rate for the N protein ORF (Fig. 3) is also seen when interspecies comparisons are made with other coronavirus species, with implications for early speciation mechanisms [38] . Fig. 4 (purple) , with SI indices for Italy (430 isolates; green) and New York, USA (932 isolates; red). Mutation rates of microbial pathogens are generally higher than those of their hosts. While a microbe spreading from host-to-host can "anticipate" that it will face a succession of broadly similar challenges, in the short-term those hosts cannot likewise "anticipate" that new microbial invaders will remain as they were in previous hosts. Thus, host immune defenses may be overwhelmed. (In the long-term there is a different scenario related to innate immunity; see below). Therapeutic challenges are, first, to locate a conserved, less-variable, part of a pathogen's genome that it will have inherited sequentially from a multiplicity of past generations, and so is likely to carry through to a multiplicity of future generations. Second, is to identify the corresponding primary function, be it at the genome, RNA transcript, or protein level. Third, from this knowledge (that may be incomplete; i.e. function not fully clarified), devise effective pathogen inhibition without imposing deleterious side-effects on the host. Viral vulnerability is often assumed to associate with protein-level functions [11, 12] . However, studies of the AIDS virus have identified genome structure itself as both functional and conserved, so signifying vulnerability [2, 3] (see Fig. 1 ). A genomic packaging signal for HIV-1, which is specifically recognized by the nucleocapsid domain of its Gag polyprotein, has long been recognized as a potential "Achilles heel" [2, 3, 7] so inviting therapeutic exploration [4] [5] [6] . Through targeting of specific RNA conformations, Gag not only influences the assembly of HIV-1 genomic RNA into virus particles, but also regulates HIV-1 mRNA translation [39] . permit easier switching between regulatory options. Indeed, small changes in target RNA structure can impede this [6] . Thus, unlike most other regions of the HIV-1 genome, mutations here would likely lead to negative Darwinian selection at an early stage -hence the high conservation. Application of the same bioinformatic technology to the SARS-CoV-2 virus genome has now revealed similar "Achilles heels." 1 4 Lacking the chronicity of HIV-1 infection, the genome of SARS-CoV-2 should have been shaped less by adaptations to counter long-term host immune defenses. It cannot hide within its host genome in latent DNA form. Yet, the larger SARS-CoV-2 genome contains many more genes than HIV, which require differential expression according to the stage of infection. Even more complex regulatory controls can be envisaged, likely requiring conserved genome conformations at appropriate locations. Be they synonymous or non-synonymous, mutations in these structured regions could result in negative selection of the viruses in which they occurredhence high conservation. The ribosome FSE located close to base 13468 (Figs. 2) would seem to exemplify this [13] , and a potential targeting agent is now available [10]. However, we have here identified other structurally important regions with more base order-dependence and higher degrees of conservation (Figs 3-5), that might, either singly or collectively, be better candidates for targeting. When determining folding energy, our approach depends on eliminating contributions of base composition which, as noted, plays an unusual role in the case of the FSE. More usually, base composition is a distinctive characteristic of entire genomes or large genome sectors that reflects their underlying oligomer ("k-mer") content. The slow genome-wide accumulation of mutations in oligomer composition (easiest documented as changes in base composition; GC%) can serve to initiate divergence into new species. By preventing that accumulation, potentially diverging organisms can stay within the confines of their species [19] [20] [21] . The presence or absence of synonymous mutations [16, 40] , which affect structure rather than amino acid composition, can have an important role in this process. The primary role of constancy in the base composition-related character is to prevent recombination with allied species (interspecies recombination) while facilitating the intraspecies recombination that can correct mutations, so retaining species individuality [3, 7] . Such recombination is initiated by "kissing" interactions between complementary unpaired bases at the tips of stem-loop structures [19] . Thus, we are here concerned with localized intraspecies mutations that affect fitness, so making members of a species carrying those mutations liable to natural selection. The mutations 1 5 facilitate within-species evolution rather than divergence into new species. And when that evolution has run its course, some of the polymorphic bases will have become less mutable, so will be deemed "conserved." Indeed, mutations of ORF NSP3 are high when the sequences of different coronavirus species are compared [41] , yet when, from intraspecies comparisons, mutations (in the form of base substitutions) are scored, they are very low (Figs. 3, 4) . Our technology (see Materials and Methods) removes the base composition-dependent component of mutational changes (that relates more to interspecies evolution) and focuses on the base orderdependent component (that relates more to intraspecies evolution). It best reflects localized functions, be they encoding protein or determining the potentiality for folding into higher order structure, of linear, single-stranded, nucleic acid sequences. Is conservation necessarily a good indicator of likely therapeutic success? We sought regions that were both high in stem-loop potential and bereft of mutations, following the premise that conserved functions would be best targeted therapeutically, assuming the availability of pathogen-specific therapeutic agents that would not cross-react with hosts. Interference with structural nucleic acid level functions would seem less likely to produce unforeseen host sideeffects than with protein-level functions. Yet, that possibility remains. Indeed, a conserved function in a pathogen could owe that conservation to the pathogen strategy of, whenever possible, mutating to resemble its host. This would make it less vulnerable to host immune defenses. To prevent autoimmunity, the generation of immune cell repertoires involves the negative selection of self-reacting cells so creating "holes" in the repertoire that pathogens can exploit by progressive mutation towards host-self, testing mutational effectiveness a step at a time. This would make it advantageous for the host, during the process of repertoire generation, not only to negatively select immune cells of specificity towards "self," but also to positively select immune cells of specificity towards "near-self." A high level of anti-near-self immune cell clones would constitute a barrier limiting the extent of pathogen mutation towards self. The existence of such positive selection is now generally accepted, with the implication that some pathogen functions might have approached so close to host-self that targeting them therapeutically would result in cross-reactivity [42] . 1 6 This unlikely caveat aside, we deem conservation a reliable indicator that a certain pathogen function is likely to be a suitable target for therapy. Attacking a short segment of a pathogen nucleic acid sequence is unlikely to ensnare a similar segment of its host's nucleic acid. In any case, to militate against this, the pathogen specificity of a potential therapeutic agent can be screened against the prototypic human genomic sequence (assuming it is unlikely that patient genomes will significantly depart from this). While prospects for the development of prophylactic vaccines against infection with SARS-CoV-2 are promising, methods to boost post-infection host immune defenses and to directly target SARS-CoV-2 are urgently needed. These require a better understanding both of viral interactions with host innate and acquired immune systems [42] , and of viral vulnerabilities. The latter enquiry proceeds in three steps: Find specific "Achilles heels." Design therapies to exploit them. Prove their clinical effectiveness. We have here been concerned with the first step. Although the bioinformatic technology related to this has long been available [2] , its claim to reveal viral "Achilles heels," as promulgated in successive textbook editions [7] , has only recently gained support [4] [5] [6] . This may be because the importance of removing redundant information and analyzing solely the contribution of base order to folding energy, was not fully appreciated. Thus, we have here repeated and expanded on past clarifications [26, 30, 31] of the conceptual basis of a technology that has contributed to the understanding of a many biological problems other than viral infections [7] . Meanwhile, it is pleasing to note that, with minimal evidence on targeting, progress is being made with the second step [9, 10] . It is hoped that by targeting one or more of the conserved regions in the SARS-CoV-2 genome here identified, rapid cures will be achieved. Authors declare no conflict of interest. Low genetic diversity may be an Achilles heel of SARS-CoV-2 Reciprocal relationship between stem-loop potential and substitution density in retroviral quasispecies under positive Darwinian selection Implications of HIV RNA structure for recombination, speciation, and the neutralism-selectionism controversy An RNA-binding compound that stabilizes the HIV-1 gRNA packaging signal structure and specifically blocks HIV-1 RNA encapsidation The heart of the HIV RNA packaging signal? Identification of the initial nucleocapsid recognition element in the HIV-1 RNA packaging signal Evolutionary bioinformatics HIV-1 Gag protein with or without p6 specifically dimerizes on the viral RNA packaging signal A small interfering RNA (siRNA) database for SARS-CoV-2 Targeting the SARS-CoV 2 RNA genome with small molecule binders and ribonuclease targeting chimera (RIBOTAC) degraders. ACS Cent Sci Example study of a highly conserved sequence motif in Nsp3 of SARS-CoV-2 as a therapeutic target SARS-CoV-2 and ORF3a: Nonsynonymous mutations, functional domains, and viral pathogenesis Structural and functional conservation of the programmed −1 ribosomal frameshift signal of SARS coronavirus 2 (SARS-CoV-2) The short-and long-range RNA-RNA interactome of SARS-CoV-2 An in silico map of the SARS-CoV-2 RNA structurome Pervasive RNA secondary structure in the genomes of SARS-CoV-2 and other coronaviruses An evolutionary portrait of the progenitor SARS CoV 2 and its dominant offshoots in COVID 19 pandemic Assessing uncertainty in the rooting of the SARS-CoV-2 phylogeny Different biological species "broadcast" their DNAs at different (G+C)% "wavelengths Success of alignment-free oligonucleotide (k-mer) analysis confirms relative importance of genomes not genes in speciation Hybrid sterility can only be primary when acting as a reproductive barrier for sympatric speciation A method for assessing the statistical significance of RNA folding Conservation of stem-loop potential in introns of snake venom phospholipase A 2 genes: an application of FORS-D analysis Computer prediction of RNA secondary structure Calculation of folding energies of single-stranded nucleic acid sequences: conceptual issues ScanFold: an approach for genome-wide discovery of local RNA structural elements -applications to Zika virus and HIV A computational procedure for assessing the significance of RNA secondary structure Fast and reliable prediction of noncoding RNAs Evaluation of FORS-D analysis: a comparison with the statistically significant stem-loop potential A FORS-D analysis software "Random_fold_scan" and the influence of different shuffle approaches on FORS-D analysis Low complexity segments in Plasmodium falciparum proteins are primarily nucleic acid level adaptations Microsatellites that violate Chargaff's second parity rule have base order-dependent asymmetries in the folding energies of complementary DNA strands and may not drive speciation A stem-loop "kissing" model for the initiation of recombination and the origin of introns Local base order influences the origin of ccr5 deletions mediated by DNA slip replication The key role for local base order in the generation of multiple forms of China HIV-1 B'/C intersubtype recombinants Positive Darwinian selection. Does the comparative method rule? Codon usage and phenotypic divergences of SARS-CoV-2 genes Human immunodeficiency virus type 1 Gag polyprotein modulates its own translation Synonymous mutations and the molecular evolution of SARS-Cov-2 origins A putative role of de-mono-ADP-ribosylation of STAT1 by the SARS-CoV-2 Nsp3 protein in the cytokine storm syndrome of COVID-19 Two signal half century: from negative selection of self-reactivity to positive selection of near-self reactivity We thank Prof. Shungao Xu at Jiangsu University for software, and Ms. Le Cao and Ms.Yingying Ma at Shanghai Public Health Clinical Center, Fudan University, for their technical support. Queen's University hosts Forsdyke's webpages. The bioRxiv server hosts preprints.