key: cord-268339-jxm69ndw authors: Karamitros, Timokratis; Papadopoulou, Gethsimani; Bousali, Maria; Mexias, Anastasios; Tsiodras, Sotiris; Mentis, Andreas title: SARS-CoV-2 exhibits intra-host genomic plasticity and low-frequency polymorphic quasispecies date: 2020-03-28 journal: bioRxiv DOI: 10.1101/2020.03.27.009480 sha: doc_id: 268339 cord_uid: jxm69ndw In December 2019, an outbreak of atypical pneumonia (Coronavirus disease 2019 - COVID-19) associated with a novel coronavirus (SARS-CoV-2) was reported in Wuhan city, Hubei province, China. The outbreak was traced to a seafood wholesale market and human to human transmission was confirmed. The rapid spread and the death toll of the new epidemic warrants immediate intervention. The intra-host genomic variability of SARS-CoV-2 plays a pivotal role in the development of effective antiviral agents and vaccines, but also in the design of accurate diagnostics. We analyzed NGS data derived from clinical samples of three Chinese patients infected with SARS-CoV-2, in order to identify small- and large-scale intra-host variations in the viral genome. We identified tens of low- or higher-frequency single nucleotide variations (SNVs) with variable density across the viral genome, affecting 7 out of 10 protein-coding viral genes. The majority of these SNVs corresponded to missense changes. The annotation of the identified SNVs but also of all currently circulating strain variations revealed colocalization of intra-host but also strain specific SNVs with primers and probes currently used in molecular diagnostics assays. Moreover, we de-novo assembled the viral genome, in order to isolate and validate intra-host structural variations and recombination breakpoints. The bioinformatics analysis disclosed genomic rearrangements over poly-A / poly-U regions located in ORF1ab and spike (S) gene, including a potential recombination hot-spot within S gene. Our results highlight the intra-host genomic diversity and plasticity of SARS-CoV-2, pointing out genomic regions that are prone to alterations. The isolated SNVs and genomic rearrangements, reflect the intra-patient capacity of the polymorphic quasispecies, which may arise rapidly during the outbreak, allowing immunological escape of the virus, offering resistance to anti-viral drugs and affecting the sensitivity of the molecular diagnostics assays. SARS-CoV-2, COVID-19, intra-host variability, qiasispecies, genomic recombination, Wuhan seafood market epidemic Coronaviruses (CoVs), considered to be the largest group of viruses, belong to the Nidovirales order, Coronaviridae family and Coronavirinae subfamily, which is further subdivided into four genera, the alpha-and betacoronaviruses, which infect mammalian species and gamma-and deltacoronaviruses infecting mainly birds [1] , [2] . Small mammals (mice, dogs, cats) serve as reservoirs for HCoVs, with significant diversity seen in bats, which are considered to be primordial hosts of HCoVs [3] . On the contrary, peridomestic animals are usually intermediate hosts, who enable long-term establishment of endemicity of the viruses, facilitating mutations and recombination events [1] , [4] . Until 2002, minor consideration was given to HCoVs, as they were associated with mild-to-severe disease phenotypes in immunocompetent people [3] - [5] . In 2002, the beginning of severe acute respiratory syndrome (SARS) outbreak took place [6] . In 2005, after the discovery of SARS-CoV-related viruses in horseshoe bats (Rhinolophus), palm civets were suggested as intermediate hosts, and bats as primordial hosts of the virus [6] , [7] . In The size of the ssRNA genome of SARS-CoV-2 is 29,891 nucleotides, it encodes 9860 amino acids and is characterized by nucleotide identity of ~ 89% with bat SARS-related-CoV SL-ZXC21 and ~ 82% with human SARS-CoVs BJ01 2003 and Tor2 [9] . CoVs are enveloped positive-sense RNA viruses, which are characterized by a very large non-segmented RNA genome (26 to 32kb length), ready to be translated [2] , [5] . The genes arrangement on the SARS-CoV-2 genome is: [9] . The main difference between SARS-CoV-2 and SARS-CoV is in ORF3b, ORF8 and Spike. Intra host variability of pathogenic viruses and bacteria represents a significant barrier in the control of infectious diseases. In viral infections, this variation emerges from genomic phenomena taking place during error-prone replication, ending up to multiple circulating quasispecies of low or higher frequency [10] , [11] . These variants, in combination with the genetic profile of the host, can potentially influence the natural history of the infection, the viral phenotype, but also the sensitivity of molecular and serological diagnostics assays [12] , [13] . Importantly, intra-host genomic variability leads to antigenic variability, which is of higher importance, especially for pathogens that fail to elicit long-lasting immunity in their hosts, and remains a major contributor to the complexity of vaccine design [14] , [15] . To date, there are no clinically approved vaccines available for protection of general population from SARS-and MERS-CoV infections as there is no effective vaccine to induce robust cell mediated and humoral immune responses [16] , [17] . Here, we explore intra-host genomic variants and low-frequency polymorphic quasispecies in Next Generation Sequencing (NGS) data derived from patients infected by SARS-CoV-2. Our analyses provide insights into the intra-patient pool of viral genomes, identify the frequency levels of rare variants and highlight variable genomic regions and a potential recombination hot-spot within S gene. Intra-host genomic variability is critical for the development of novel drugs and vaccines, which are of urgent necessity, towards the containment of this newly emerging epidemic. In this study we analysed NGS data derived from clinical specimens (oral swabs) from three Chinese patients infected by SARS-CoV-2 (SRA projects PRJNA601736 and PRJNA603194). We aligned the raw read data on reference strain MN975262.1 using bowtie2 [18] , after quality check with FastQC (www.bioinformatics.bbsrc.ac.uk/projects/fastqc). The resulting alignments were visualized with the Integrated Genomics Viewer (IGV) [19] . After removing PCR duplicates, SNVs were called with a Bonferroni-corrected P-value threshold of 0.05 using samtools [20] and LoFreq [21] . Variants supported by absolute read concordance (>98%) were filtered-out from intra-host variant frequency calculations. We annotated the variations to the reference strain using snpEff [22] , SNVs effects were further filtered with snpSift [23] and we estimated the average mutation rate per gene across the viral genome using R scripts. We compared the localization of the intra-host SNVs with all available SNVs observed at population level up to February 18 th 2020 (retrieved from www.GISAID.org). We also compared all intra-host and population level SNPs with all primers and probes coordinates to investigate for potential interferences with currently available molecular diagnostic assays [24] (www.who.int/docs/default-source/coronaviruse/peiris-protocol-16-1-20.pdf). To investigate intra-host genomic rearrangements, we performed de novo assembly of the SARS-CoV-2 genomes using Spades [25] , and the resulting contigs were analyzed with BLAST [26] and confirmed by remapping of the raw reads. Smaller contigs (<200 bp) were elongated where possible, after pair-wise realignment of the corresponding mapped reads. Basic computations and visualizations we implemented in R programming language R version 3.6.2, using in-house scripts. The secondary structures of the genomic regions surrounding the recombination breakpoints was predicted using RNAfold [27] . The mapping assembly of the viral genome was almost complete for all samples. The genome coverage and the average read depth across the genome was 99.99% and 133.5x for sample SRR10903401, 99.99% and 522.5x for sample SRR10903402, and 99.94%, and 598.2x for sample SRR10971381, respectively. The alignment statistics for all samples are summarized in Suppl. Table 1 . In all cases we isolated the same 5 SNVs with 98-100% read concordance, thus in total divergence with the reference strain (MN975262.1), which were excluded from the downstream analysis. For sample SRR10903401 we isolated 34 lower frequency SNVs in total. Off these, 33 were present with frequencies ranking between 2 and 15%, while only one was present in 40% of the intra-host viral population. The sequencing depth, which is also evaluated during the SNV calling by the LoFreq algorithm, ranked between 39x and 290x at the corresponding SNV positions. The sequencing depth of sample SRR10903402 at the polymorphic positions was substantially higher (103x -1137x), allowing the isolation of 55 SNVs with frequencies distributed between 0.9% and 14%. The depth over the polymorphic positions of sample SRR10971381 was between 159x -1872x, allowing the isolation of 10 intra-host SNVs, with frequencies 1.1% -6.8% (Figure 1 .A, Suppl. Table 2 ). Intra-host variants were distributed across 7 out of the 10 protein-coding genes of the viral genome, namely ORF1ab, S, ORF3a, ORF6, ORF7a, ORF8 and N. After normalising for the gene length (variants / kb-gene-length -"v/kbgl"), the higher density was observed in the small ORF6 (16.21 v/kbgl), followed by ORF8 (8.21 v/kbgl), N (4.76 v/kbgl), S (4.18 v/kbgl), ORF1ab (3.47 v/kbgl), ORF7a (2.73 v/kbgl) and ORF3a (1.21 v/kbgl). Interestingly, the majority of the SNPs corresponded to missense changes (leading to amino-acid change) compared to synonymous changes (72 vs. 29 respectively, ratio 2.48:1) ( Table 1) . The average intra-host variant frequency did not differ substantially either between missense and synonymous polymorphisms (Figure 1.C) , neither between their hosting genes (Figure 1.D) . We did not detect any small-scale insertions or deletions in the samples (Suppl. Table 2 ). The comparison of all SNVs (intra-host and population level) with the genomic targets of the molecular diagnostics assays, revealed colocalizations of three intra-host SNVs and 2 isolate-specific SNVs with primers and probes currently in use. In detail, intra-host SNVs colocalized with the probe of RdRP_SARSr reaction (15, The de novo assembly of the viral genomes revealed intra-host genomic rearrangements. For samples SRR10903401 and SRR10903402, these large-scale structural events were systematically observed over poly-A / poly-U-rich genomic regions, located in ORF1ab and S genes. In all cases, similar or identical strings of nucleotides in close proximity appear to have served as seeds for homologous recombination events. All rearrangements were validated by remapping of the raw reads on the corresponding de novo assembled contigs, setting a threshold of at least 5 supporting reads of high mapping quality (>40) in each case. For sample SRR10903401 we isolated three inversions/misassemblies in ORF1ab (Suppl. Figure 2) and one inversion/misassembly in S gene (Figure 3-A) . Notably, we were able to validate the same inversion in S gene for sample SRR10903402 as well (Figure 3-B) . Apart from 2 inversions in ORF1ab supported by only 2 reads each (not passing the validation threshold), there were no further large-scale intra-host events observed for sample SRR10903402. Similarly, we identified one inversion/misassembly in sample SRR10971381 that was supported by only one read. The alignment coordinates of all rearrangementsupporting contigs with respect to the reference strain are presented in (Table 2) . The rapid spread and the death toll of the new SARS-CoV-2 epidemic warrants the immediate identification / development of effective antiviral agents and vaccines, but also the design of accurate diagnostics. The intra-and inter-patient variability of the viral genome plays a pivotal role in all the abovementioned efforts, since it affects the compatibility of molecular diagnostics but also impairs the effectiveness of the vaccines and the serological assays by altering the antigenicity of the virus. Intra-host low-frequency variants are also the main source of resistance to anti-viral drugs. Bioinformatics analysis of NGS data allows the generation of the consensus sequence of a viral genome from the of majority nucleotides at each position but also the identification of non-consensus nucleotides, enabling the exploration of intra-host variability but also its consequences on intra-host viral evolution [28] - [30] . All samples analysed in this study were probably infected by the same viral strain since they shared the same set of consensus SNVs. However, apart from 3 intra-host SNVs that were common between SRR10903401 and SRR10903402, there was no other overlap observed between the low frequency variants of each sample (Figure 1-B) . This indicates that these variations have been occurred in a rather random fashion and are not subject of selective pressures, which is also supported by the fact that the missense mutations were systematically more, compared to the synonymous mutations. On the other hand, missense substitutions are more common in loci involving pathogen resistance, indicating positive selection [31] . The analysed viral RNA might have been originated from functional/packed virions, but also from unpacked viral genomes, which are unable to replicate and infect other host cells. Even if a viral genome is unable to replicate independently, its abundant presence in the pool of viral quasispecies implies some functionality regarding the intra-host evolution and adaptation. For example, defective viral genomes might affect infection dynamics such as viral persistence but also the natural history of an infection [32] - [34] . At the same time, these variants may arise rapidly during an outbreak and can be used for tracking the transmission chains and the spaciotemporal characteristics of the epidemic [35] - [37] . Studies involving large number of samples and in-vitro experiments on SARS-CoV-2 viral isolates are needed, in order to conclude whether these variations are advantageous or come with a fitness cost for the virus. SNVs and quasispecies that are observed at low frequency could represent viral variations of low impact on the functionality of the genome. However, their abundance is largely affected by the population size and the epidemic characteristics. For example, a neutral substitution in a region that represents a primer target for a molecular diagnostic assay can drift to fixation rather quickly in a rapidly spreading virus, jeopardizing the sensitivity of the assay [38] , [39] . Here, we highlight three intra-host but also two fixed variants that colocalized with primers or probes of real-time PCR diagnostics assays that are currently in use ( Figure 2 ). Since the alignment of these oligos with their genomic targets is directly linked to the performance of the corresponding diagnostic assays, the community should pay extra attention in the evaluation of these potentially emerging variations and be alerted, in case redesigning of these oligos is needed. As it is well documented, recombination events lead to substantial changes in genetic diversity of RNA viruses [40] , [41] . In CoVs, discontinuous RNA synthesis is commonly observed, resulting in high frequencies of homologous recombination [42] , which can be up to 25% across the entire CoV genome [43] . For pathogenic HCoVs genomic rearrangements are frequently reported during the course of epidemic outbreaks, such as HCoV-OC43 [44] , and HCoV-NL63 [45] , SARS-CoV [46] [44] and MERS-CoV [47] . We have isolated intra-host genomic rearrangements, located in poly-A and poly-U enriched palindrome regions across the SARS-CoV-2 genome (Figure 3, Suppl. Figure 1) . We have validated the majority of these events by visual inspection of the alignments. We conclude that these rearrangements do not represent artifacts derived from the NGS library preparation (e.g. PCR crosstalk artifacts), especially since all the supporting reads were not duplicated and, in some cases, differed in polymorphic positions (Suppl. Figure 2) . Recombination processes involving S gene particularly, have been reported for SARSand SARS-like CoV but also for HCoV-OC43. In the case of sister species HCoV-NL63 and HCoV-229E, recombination breakpoints are located near 3'-and 5'-end of the gene [1] [47] . S is a trimeric protein, which is cleaved into two subunits, the globular N-terminal S1 and the Cterminal S2 [48] . The S1 subunit consists of a signal peptide and the NT and receptor binding (RB) domains, with the latter sharing only 40% amino acid identity with other SARS-related CoVs. Our analysis revealed that similarly to other genomic regions, the S1 subunit hosts many low-frequency SNVs, characterized by higher density compared to the rest of the S gene sequence (Figure 1-E) . The S2 subunit is highly conserved, with 99% identity compared to human SARS-CoV and two bat SARS-like CoVs [9] . The S2 subunit consists of two fusion peptides (FP, IFP), followed by two heptad repeats (HR 1 and 2) , the pretransmembrane domain (PTM), the transmembrane and the cytoplasmic domain (TM, CP) [48] . In S gene, the same rearrangement event has taken place in two samples analyzed in this study. This observation highlights a potential recombination hot-spot in S gene. The rearrangement that was common between the two samples of this study is located in nt24,000 of the 2019-nCoV genome, which corresponds to the ~200nt linking region between the fusion peptides FP and IFP (aa 812-813). Examining closely the secondary structure of the RNA genome around the breakpoints, we suggest a model where the palindromes 5'-UGGUUUU-3' and 5'-AAAACCAA-3', have served as donor-acceptor sequences during the recombination event, since they are both exposed in the single-stranded internal loops formed in a highly structured RNA pseudoknot (Figure 3-C) . The RB domain of the S protein has been tested as a potential immunogen as it contains neutralization epitopes which appear to have a role in the induction of neutralizing antibodies [16] , [49] . It should be mentioned though that S protein of SARS-CoV is the most divergent in all strains infecting humans [50] , [51] , as in both C and N-terminal domains variations arise rapidly, allowing immunological escape [52] . Our findings support that apart from these variations, the N-terminal region also hosts a recombination hot-spot, which together with the rest of the observed rearrangements, indicates the genomic instability of SARS-CoV-2 over poly-A and poly-U regions. Prediction of the secondary structure of the genomic region spanning the rearrangement breakpoint (100 bases upstream and 100 bases downstream). The corresponding donor-acceptor sequences, exposed in internal loops, are indicated in green bars. I II I II I II 5' 3' Base-pair probabilities 0 1 MFE secondary structure G A A G U U U U U G C A C A A G U C A A A C A A A U U U A C A A A A C A C C A C C A A U U A A A G A U U U U G G U G G U U U U A A U UUU U C A C A A A U A U U A C C A G A U C C A U C A A A A C C A A G C A A G A G G U C A U U U A U U G A A G A U C U A C U U U U C A A C A A A G U G A C A C U U G C A G A U G C U G G C U U C A U C A A A C A A U A U G G U G A U U G C C U U G G U G A U A U U G C U G C C Hosts and Sources of Endemic Human Coronaviruses Evolutionary Insights into the Ecology of Coronaviruses Coronavirus Infections -More Than Just the Common Cold Structural biology: Structure of SARS coronavirus spike receptor-binding domain complexed with receptor Origin and evolution of pathogenic coronaviruses Bats Are Natural Reservoirs of SARS-Like Coronaviruses Novel coronavirus 2019-nCoV: early estimation of epidemiological parameters and epidemic predictions Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan Rapid Reversion of Sequence Polymorphisms Dominates Early Human Immunodeficiency Virus Type 1 Evolution Assessment of phylogenetic sensitivity for reconstructing HIV-1 epidemiological relationships Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population The interferon receptor-1 promoter polymorphisms affect the outcome of Caucasians with HB eAg-negative chronic HBV infection CD4+ T cells mediate antibody-independent acquired immunity to pneumococcal colonization Patterns of antigenic diversity and the mechanisms that maintain them A decade after SARS: Strategies to control emerging coronaviruses Fast gapped-read alignment with Bowtie 2 Variant review with the integrative genomics viewer The Sequence Alignment/Map format and SAMtools LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3 Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing Basic local alignment search tool The Vienna RNA websuite Full-genome deep sequencing and phylogenetic analysis of novel human betacoronavirus The Application of Genomics to Emerging Zoonotic Viral Diseases Unravelling the history of hepatitis B virus genotypes A and D infection using a full-genome phylogenetic and phylogeographic approach Nature encyclopedia of the human genome Long-term transmission of defective RNA viruses in humans and Aedes mosquitoes The outcome of acute hepatitis C predicted by the evolution of the viral quasispecies HCV defective genomes promote persistent infection by modulating the viral life cycle HIV-1 epidemic in Russia: an evolutionary epidemiology analysis Molecular investigation of HIV-1 cross-group transmissions during an outbreak among people who inject drugs An innovative study design to assess the community effect of interventions to mitigate HIV epidemics using transmission-chain phylodynamics Simultaneous detection of severe acute respiratory syndrome, Middle East respiratory syndrome, and related bat coronaviruses by real-time reverse transcription PCR A contaminant-free assessment of Endogenous Retroviral RNA in human plasma The evolution and emergence of RNA viruses Spatiotemporal characteristics of the HIV-1 CRF02_AG/CRF63_02A1 epidemic in Russia and Central Asia RNA recombination in animal and plant viruses Establishing a genetic recombination map for murine coronavirus strain A59 complementation groups Molecular Epidemiology of Human Coronavirus OC43 Reveals Evolution of Different Genotypes over Time and Recent Emergence of a Novel Genotype due to Natural Recombination Mosaic Structure of Human Coronavirus NL63, One Thousand Years of Evolution Evidence of the Recombinant Origin of a Bat Severe Acute Respiratory Syndrome (SARS)-Like Coronavirus and Its Implications on the Direct Ancestor of SARS Coronavirus Co-circulation of three camel coronavirus species and recombination of MERS-CoVs in Saudi Arabia Bat-to-human: Spike features determining 'host jump' of coronaviruses SARS-CoV, MERS-CoV, and beyond Evaluation of serologic and antigenic relationships between middle eastern respiratory syndrome coronavirus and other coronaviruses to develop vaccine platforms for the rapid response to emerging coronaviruses Severe acute respiratory syndrome coronavirus spike protein expressed by attenuated vaccinia virus protectively immunizes mice Vaccines to prevent severe acute respiratory syndrome coronavirus-induced disease The recombinant N-terminal domain of spike proteins is a potential vaccine against Middle East respiratory syndrome coronavirus (MERS-CoV) infection Prediction of the secondary structure of the genomic region spanning the rearrangement breakpoint (100 bases upstream and 100 bases downstream). The corresponding donoracceptor sequences