key: cord-0841988-9785qudj authors: Cresswell-Clay, Evan; Periwal, Vipul title: Genome-Wide Covariation in SARS-CoV-2 date: 2021-03-10 journal: bioRxiv DOI: 10.1101/2021.03.08.434363 sha: 641e1aa978ff267615438c1bee73c3c772547b06 doc_id: 841988 cord_uid: 9785qudj The SARS-CoV-2 virus causing the global pandemic is a coronavirus with a genome of about 30Kbase length [Song et al., 2019]. The design of vaccines and choice of therapies depends on the structure and mutational stability of encoded proteins in the open reading frames(ORFs) of this genome. In this study, we computed, using Expectation Reflection, the genome-wide covariation of the SARS-CoV-2 genome based on an alignment of ≈ 130000 SARS-CoV-2 complete genome sequences obtained from GISAID[Shu & McCauley, 2017]. We used this covariation to compute the Direct Information between pairs of positions across the whole genome, investigating potentially important relationships within the genome, both within each encoded protein and between encoded proteins. We then computed the covariation within each clade of the virus. The covariation detected recapitulates all clade determinants and each clade exhibits distinct covarying pairs. we decrease the threshold for conserved columns, the condition for variation at a given 82 position becomes more stringent and the number of columns retained for analysis will 83 decrease. As in the example above, going from 100% (full genome, all columns allowed) 84 to 95% conservation removes a significant number of genome positions. Because inference 85 with ER relies on mutations at a given position we must consider the resulting number of 86 columns, or incidence, after such curation. In addition, region-specific incidence may also 87 underline importance to the efficacy of the virus because higher incidence represents more 88 variation and mutation. We also consider the clade-specific incidence of the full genome 89 since we will consider genome interactions in different clades in future sections. 90 In Figure 1 we plot the incidence of the ORF1ab and S regions for different thresholds 91 of allowed conservation. These incidence curves are given for the different clade data 92 sets. Figure 1 shows that region incidence varies between clades and that the incidence of 93 different encoding regions is affected differently for a given clade. For example, consider 94 the full genome data set against the S clade set in Figure 1 . The full genome data set (blue) 95 has one of the highest ORF1ab incidence curves, but the same level of incidence is not 96 necessarily expressed in the S region. In contrast, the S clade (purple) shows middling 97 incidence for ORF1ab and one of the strongest incidence curves for the S encoding region. For the remainder of the paper we will set the conservation threshold to 95%. This is 107 in order to retain a significant number of positions for the subsequent analysis. We must 108 also consider that the size of a given clade, or genome data set will affect the incidence 109 and variability. Specifically, as the number of sequence considered changes, the level of 110 variability, enforced by the conservation thresholds, will be altered as well. Therefore, when considering smaller data sets, such as the S and V clades, we must keep in mind that 112 the incidence is affected by the cardinality of the set itself. 114 We begin by inferring interactions between nucleotide positions across the entire genome. Figure 3 shows that the region also has the largest number of non-conserved nucleotide 137 positions, or position incidence, (as described in Section 2.1) of the major encoding regions. This increased incidence is expressed in the cardinality of the interaction map. Table 2 ). We continue with the same analysis for the S encoding 162 region. The ORF3a gene region encodes a unique membrane protein with a 3-membrane structure Tables 8 and 9 . In Table 9 we see a new case in the It is important to note that there was little significant coevolution between positions in 199 ORF7ab and other regions of the genome. We will only present the AA single site fre-200 quencies, in Table 11 for this region because of the single AA dominance at each position 201 considered. In addition to this strong dominance (≥ 90%) of the primary AA at each 202 position, the secondary AA at all positions was undefined (X). When investigating genetic variance, it is useful to stratify available data to understand and We begin our clade analysis with the largest existing clade. The G clade is stratified by showing sufficient variation. This change in incidence is expressed in Figure 9 as a decrease 234 in cardinality of the interaction map from the full genome set to the G clade genome set. clade. This is not due to an overall loss in information, as the DI magnitude remains in we forego further analysis as the inference would be unreliable. the database of SARS CoV-2 genomes grows, the incidence and overall variability will 296 increase, yielding further insights into genome interactions. Additionally, the availability of 297 data over longer time periods will allow for chronological compartmentalization of genome 298 data sets and interaction maps can be compared across the temporal evolution of the virus. Second, this analysis can also be applied to diseases for which there is more data available 300 as the importance of genome interactions is not SARS-CoV-2 specific. 894,X:.106 22384 T:.896,X:.104 23401 Q:.998,X:.001 7540 T:.997,X: .003 18555 D:.997,X:.003 22497 I:.893,X:.107 22495 G:.893,X:.107 22353 A:.891,X:.109 22355 1163 I:.938,X:.057 16647 T:.989,X:.011 22334 W:.937 Single Site Freqs. 27801 F:.934,X:.065 27803 F:.934,X:.065 27792 F:.929,X:.071 27688 P:.904,X:.096 27698 L:.904,X:.1 27700 I: .903,X.1 27579 I:.902, X:.097 27581 F:.901,X: .099 27805 L:.941 References 344 Identification of direct residue contacts in protein-protein interaction by message passing Coronavirus 413 disease (COVID-2019) situation re-ports Genome composition and divergence of the novel 416 coronavirus (2019-nCoV) originating in China Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2 A novel coronavirus from patients with pneumonia 421 in China