key: cord-0901150-oot6ioi6
authors: O’Donoghue, Seán I.; Schafferhans, Andrea; Sikta, Neblina; Stolte, Christian; Kaur, Sandeep; Ho, Bosco K.; Anderson, Stuart; Procter, James; Dallago, Christian; Bordin, Nicola; Adcock, Matt; Rost, Burkhard
title: SARS-CoV-2 structural coverage map reveals state changes that disrupt host immunity
date: 2020-09-09
journal: bioRxiv
DOI: 10.1101/2020.07.16.207308
sha: ccdba4642defd7df93ce741b8215334d9b2caf3f
doc_id: 901150
cord_uid: oot6ioi6

In response to the COVID-19 pandemic, many life scientists are focused on SARS-CoV-2. To help them use available structural data, we systematically modeled all viral proteins using all related 3D structures, generating 872 models that provide detail not available elsewhere. To organise these models, we created a structure coverage map: a novel, one-stop visualization summarizing what is — and is not — known about the 3D structure of the viral proteome. The map highlights structural evidence for viral protein interactions, mimicry, and hijacking; it also helps researchers find 3D models of interest, which can then be mapped with UniProt, PredictProtein, or CATH features. The resulting Aquaria-COVID resource (https://aquaria.ws/covid) helps scientists understand molecular mechanisms underlying coronavirus infection. Based on insights gained using our resource, we propose mechanisms by which the virus may enter immune cells, sense the cell type, then switch focus from viral reproduction to disrupting host immune responses.

In response to the COVID-19 pandemic, many life scientists are focused on SARS-CoV-2. To help them use available structural data, we systematically modeled all viral proteins using all related 3D structures, generating 872 models that provide detail not available elsewhere. To organise these models, we created a structure coverage map: a novel, one-stop visualization summarizing what is -and is not -known about the 3D structure of the viral proteome. The map highlights structural evidence for viral protein interactions, mimicry, and hijacking; it also helps researchers find 3D models of interest, which can then be mapped with UniProt, PredictProtein, or CATH features. The resulting Aquaria-COVID resource ( https://aquaria.ws/covid ) helps scientists understand molecular mechanisms underlying coronavirus infection. Based on insights gained using our resource, we propose mechanisms by which the virus may enter immune cells, sense the cell type, then switch focus from viral reproduction to disrupting host immune responses.

Due to the COVID-19 pandemic, many life scientists have recently switched focus towards SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2). This includes structural biologists, who have so far determined~300 Protein Data Bank entries (PDB; Berman et al. 2000) for the 27 viral proteins.

These structures are, in turn, driving molecular modeling studies, most focused on the spike glycoprotein (Jaimes et al., 2020; Piplani et al., 2020; Rodrigues et al., 2020) . Other modelling studies focus on breadth of coverage, predicting 3D structures for the entire SARS-CoV-2 proteome; this has been done using AlphaFold (Senior et al., 2020) , C-I-TASSER , Rosetta (Rohl et al., 2004) , and SwissModel (Waterhouse et al., 2018) . Unfortunately, these predictions vary greatly (Heo and Feig, 2020) , raising quality issues explored in the current CASP activity (Kryshtafovych et al., 2019) . Additionally, these studies derive few structural states for each viral protein, thereby producing moderate numbers of models (e.g., 24 for AlphaFold and 116 for SwissModel).

We aim to address these limitations via a depth-based strategy that models multiple states for each viral protein, using all related 3D structures in the PDB -thus leveraging structures determined for other coronaviruses, such as SARS-CoV, BtCoV-HKU4 (bat coronavirus HKU4), FCoV (feline coronavirus), IBV (avian infectious bronchitis virus), and MHV-A59 (mouse hepatitis virus A59), as well as distant viruses, such as CHIKV (Chikungunya virus) and FMDV (foot-and-mouth disease virus).

Combining breadth and depth of coverage requires modeling methods with low computational cost; here, we use only sequence profile comparisons (Steinegger et al., 2019) to align viral sequences onto experimentally derived 3D structures . This generates what we call minimal models, in which 3D coordinates are not modified, but are mapped onto SARS-CoV-2 sequences using coloring that indicates model quality .

Minimal models have benefits: it is easy to understand how they were derived, helping assess the validity of insights gained. Thus, minimal models are broadly useful, even for researchers who are not modelling experts. Conversely, models generated by more sophisticated methods (e.g., Senior et al., 2020) can be more accurate (Kryshtafovych et al., 2019) , but require more time and expertise to assess their accuracy (Heo and Feig, 2020) and the validity of insights gained; this often limits their usefulness.

Large numbers of models can be generated by such minimal strategies, raising a new problem: how to visually organize such complex datasets to be usable. We consider this one instance of a critical issue impeding not just COVID-19 research, but all life sciences . Thus, we introduce a novel concept: a one-stop visualization summarizing what is known -and not known -about the 3D structure of the viral proteome. This tailored visualizationcalled the SARS-CoV-2 structural coverage map -helps researchers find structural models related to specific research questions. Once a structural model of interest is found, it can be used to explore the spatial arrangement of sequence features -i.e., residue-based annotations, such as nonsynonymous mutations or post-translational modifications. Here, we integrated SARS-CoV-2 models into Aquaria , a molecular graphics system designed to simplify feature mapping and make minimal models broadly accessible to researchers who are not modelling experts. Previously, Aquaria could only map features from UniProt (The UniProt Consortium, 2019); for this work, we added >32,000 SARS-CoV-2 features from additional sources.

The resulting Aquaria-COVID resource ( https://aquaria.ws/covid ) comprises a large set of SARS-CoV-2 structural information not readily available elsewhere. The resource further identifies structurally dark regions of the proteome, i.e., regions with no significant sequence similarity to any protein region observed by experimental structure determination . Dark proteins perform important biological functions (Schafferhans et al., 2018) ; identifying them helps direct future research that might reveal currently unknown infection mechanisms.

Below, we describe the resource and systematically review insights it provides across the proteome. In the discussion, we highlight insights gained into how SARS-CoV-2 proteins self-assemble, and how they may mimic (Elde and Malik, 2009; Murphy, 1993) host proteins and hijack (Davey et al., 2011) host processes, including innate and adaptive immunity, nonsense-mediated decay, ubiquitination, and regulation of chromatin and telomeres.

During the COVID-19 pandemic, the Aquaria-COVID resource aims to fulfil a vital role by helping scientists more rapidly understand molecular mechanisms of coronavirus infection that are currently known, and to keep abreast of emerging knowledge, as new 3D structures and sequence features become available.

The Aquaria-COVID resource uses 14 UniProt sequences (The UniProt Consortium, 2019) that comprise the SARS-CoV-2 proteome. We matched these (using HHblits; Steinegger et al., 2019) against sequences of all available 3D structures in the PDB (Berman et al., 2000) , resulting in 872 minimal models (Tables S1-S3) that we incorporated into Aquaria , where they can be mapped with >32,000 features from UniProt, CATH (Dawson et al., 2017) , SNAP2 (Hecht et al., 2015) , and PredictProtein (Yachdav et al., 2014) . These features include residue-based prediction scores for conservation, disorder, domains, flexibility, mutational propensity, subcellular location, and transmembrane helices (see Methods for details). To help other researchers use these models and features we created a matrix layout showing models for all 14 UniProt sequences ( https://aquaria.ws/covid#matrix ). We also created a structural coverage map (Figures 1 & 2; https://aquaria.ws/covid ) -a novel visual layout based on the viral genome organization. The coverage map summarizes key results obtained, including evidence for viral mimicry (Figure 3 ) or hijacking of host proteins (Figure 4) , as well as viral protein interactions ( Figure 5 ). The map also helps scientists find 3D models related to specific research questions. Below, we introduce each element of the structural coverage map as we use it to describe structural models found in three viral genome regions.

Polyprotein 1a (a.k.a. PP1a) derives from polyprotein 1ab (a.k.a. PP1ab), and is cleaved into 10 proteins (NSP1-NSP10) that modify viral proteins, disable host defenses, and support viral replication.

NSP1 derives from residues 1-180 of PP1a and is thought to interact with the ribosome, suppressing translation of host mRNAs and promoting their degradation (Kamitani et al., 2009) . We found two structures matching most of the NSP1 sequence (Figure 1 ), both determined for NSP1 from SARS-CoV, which had 85% pairwise sequence identity to the matched region of NSP1 from SARS-CoV-2. These sequence-to-structure alignments were generated using HHblits sequence profiles (Steinegger et al., 2019) , and had a significance score (E-value) of 10 -20 , giving ≥ 97% likelihood that NSP1 from SARS-CoV and SARS-CoV-2 adopt similar structures . Both matching structures were from the same researchers , so the more recent structure ( P0DTC1/2gdt ) was selected for the representative image, which was colored to convey alignment quality ( Figure 1 ). Unfortunately, neither structure showed interactions with other proteins or RNA and -unusually -they provided few functional insights , partly because they had a unique fold with no matches in CATH (Dawson et al., 2017) . NSP1 also had two short dark regions resulting from alignment gaps ( Figure 1 ); the N-terminal dark region may be accounted for by high flexibility (average predicted B-value = 60; see Methods).

NSP2 (PP1a 181-818) may disrupt intracellular signaling by interaction with host proteins (Cornillez-Ty et al., 2009) . Unfortunately, these interactions are not captured in the single matching structure ( P0DTC1/3ld1 ; Figure 1 ) derived for NSP2 from IBV, a remote homolog with 11% pairwise sequence identity to the matched region of PP1a (residues 104-562). However, the HHblits E-value was 10 -32 , giving ≥ 98.5% likelihood that SARS-CoV-2 NSP2 adopts a similar structure -a stronger prediction than for NSP1 (above), largely because the matched region was much longer. Interestingly, the matched region overlapped NSP1 (PP1a 104-180). NSP2 also had multiple short dark regions, resulting from alignment gaps, plus a long C-terminal dark region ( Figure 1 ). The Aquaria webpage for this matching structure showed that -very unusually -there is no scientific publication associated with this 10 year-old dataset. Thus, while it may be of some use for researchers focused on NSP2, for most researchers this structure should probably be excluded, making NSP2 an entirely dark protein.

NSP3 (PP1a 819-2763) is a large, multidomain protein thought to perform many functions, including anchoring the viral replication complex to double-membrane vesicles derived from the endoplasmic reticulum (Lei et al., 2018) . NSP3 had 188 matching structures clustered in nine distinct sequence regions.

NSP3 region 1 (a.k.a. Ubl1; PP1a 819-929) was the least conserved NSP3 region (average ConSurf score = 3.7; see Methods), suggesting it adapts to host-specific defenses. Ubl1 is thought to bind single-stranded RNA and the viral nucleocapsid protein (Lei et al., 2018) . Unfortunately, these interactions are not shown in the three matching structures found, which all adopt a ubiquitin-like topology (CATH 3.10.20.350 ) . Although it has distinct structural differences, Ubl1 may mimic host ubiquitin (Lei et al., 2018) ; however, we found no matches to structures of human ubiquitin, undermining the mimicry hypothesis. For the representative image ( Figure 2) , we used the top-ranked matching structure provided by Aquaria: P0DTC1/2gri , derived for NSP3 from SARS-CoV, which aligned onto SARS-CoV-2 NSP3 with 77% identity and with a significance of E = 10 -23 .

NSP3 region 2 (PP1a 930-1026) had no matches in CATH and no matching structures. This was the NSP3 region with lowest predicted sensitivity to mutation (median sensitivity 0%; see Methods), highest predicted flexibility (average B-factor = 55), highest fraction of disordered residues (47%), and highest fraction of residues predicted to be solvent-accessible (99%). We speculate that this region acts as a flexible linker and may contain post-translational modification sites hijacking host signalling, as are often found in viral disordered regions (Davey et al., 2011) .

NSP3 region 3 (PP1a 1027-1193) has a macro domain (a.k.a. Mac1 or X domain; CATH 3.40.220.10 ) that may counteract innate immunity via interfering with ADP-ribose (ADPr) modification (Lei et al., 2018; Rack et al., 2016) . This was the second least conserved NSP3 region (ConSurf = 3.9) and had the highest fraction of mutationally sensitive residues (29%), suggesting it is well adapted to specific hosts. This region had 144 matching structures; as a representative image we used P0DTC1/6woj , derived from SARS-CoV-2 ( Figure 2 ). Of the matching structures, 47 were of human proteins (Figures 2 & 3A) , suggesting these host proteins may be mimicked by the virus. The potentially mimicked human proteins were: GDAP2 ( P0DTC1/4uml , 19% identity to NSP3 macro domain, E = 10 -22 ), MACROD1 ( P0DTC1/2x47 , 25% identity, E = 10 -22 ), MACROD2 ( P0DTC1/4iqy , 24% identity, E = 10 -22 ), MACROH2A1 ( P0DTC1/1zr5 , 19% identity, E = 10 -18 ), MACROH2A2 ( P0DTC1/2xd7 , 19% identity, E = 10 -21 ), OARD1 ( P0DTC1/2eee , 13% identity, E = 10 -12 ), PARP9 ( P0DTC1/5ail , 22% identity, E = 10 -17 ), PARP14 ( P0DTC1/3q6z , 25% identity, E = 10 -19 ), and PARP15 ( P0DTC1/3v2b , 18% identity, E = 10 -16 ). An additional 73 structures matched to viral proteins, two in complex with RNA ( P0DTD1/4tu0 , 24% identity, E = 10 -17 and P0DTD1/3gpq , 24% identity, E = 10 -17 ; both from CHIKV). For brevity here and in Figure 2 , we have omitted the remaining 41 matching structures derived from other host organisms (Table S1 ).

NSP3 region 4 (PP1a 1027-1368) comprised a disordered region plus a macro-like domain called SUD-N (a.k.a. Mac2; CATH 3.40.220.30 ) that may bind RNA (Lei et al., 2018) . This region had no matching structures. NSP3 region 5 (PP1a 1389-1493) comprised another macro-like domain called SUD-M (a.k.a. Mac3; CATH 3.40.220.20 ) that may bind both RNA and host proteins, and take part in viral replication (Lei et al., 2018) . However, no interactions are shown in the seven matching structures found ( P0DTC1/2jzd , 81% identity, E = 10 -34 ), all determined using SARS-CoV NSP3. Comparing these with structures matching NSP3 region 3, we see considerable differences, and no evidence of mimicry of host macro domains ( Figure 2 ). NSP3 region 6 (PP1a 1494-1562) comprised a domain called SUD-C (CATH 2.30.30.590 ) and had two matching structures: one from SARS-CoV ( P0DTC1/2kqw , 76% identity, E = 10 -41 ) and one matching a distant virus ( P0DTC1/4ypt from MHV-A59, 28% identity, E = 10 -62 ), which also spanned NSP3 region 7. Using these structures, Lei et al. speculated that this region may bind metal ions and induce oxidative stress (Lei et al., 2018) . NSP3 region 7 (PP1a 1563-1878) comprises a papain-like protease (a.k.a. PL-Pro), made up of three domains (CATH 3.10.20.540 , CATH 1.10.8.1190 , and CATH 3.90.70 .90 ) thought to cleave NSP1-NSP3 from the polyprotein and to cleave ubiquitin-like modifications from host proteins ( Figure 4A ), thereby undermining interferon-induced antiviral activity (Dzimianski et al., 2019; Lei et al., 2018) . PL-Pro had 38 matching structures ( P0DTC1/6wrh from SARS-CoV-2, 100% identity), of which 11 show binding to human ubiquitin-like proteins (Figures 2 & 4A ): four showed binding to ISG15 ( P0DTC1-5tl6 from SARS-CoV, 82% identity, E = 10 -70 ); two showed binding to UBA52 ( P0DTC1/4rf0 from MERS-CoV, 29% identity, E = 10 -65 ); two showed binding to UBB ( P0DTC1/5wfi from MHV-A59, 31% identity, E = 10 -61 ); one showed binding to UBC ( P0DTC1/4mm3 from SARS-CoV, 83% identity, E = 10 -62 ), and one showed binding to both UBB and UBC ( P0DTC1/5e6j from SARS-CoV, 83% identity, E = 10 -62 ). Note these identity scores and E-values indicate similarity between SARS-CoV-2 NSP3 and the viral proteins in each matching structure, not to the human proteins. Two additional matching structures showed SARS-CoV-2 NSP3 in complex with inhibitory peptides ( P0DTC1/6wuu , 100% identity, P0DTC1/6wx4 , 100% identity).

NSP3 region 8 (PP1a 1879-2020) comprises a nucleic-acid binding (a.k.a. NAB) domain thought to bind single-stranded RNA and to unwind double-stranded DNA (Lei et al., 2018) . NAB had only one matching structure ( P0DTC1/2k87 from SARS-CoV, 82% identity, E = 10 -24 ) with a fold not seen in any other PDB structure ( CATH 3.40.50.11020 ).

NSP3 region 9 (PP1a 2021-2763) may anchor NSP3 to double-membrane vesicles (Lei et al., 2018) . This was the most conserved NSP3 region (ConSurf = 4.9), suggesting it is less adapted to specific hosts. This region had no CATH matches and no matching structures.

NSP4 (PP1a 2764-3263) may act with NSP3 and NSP6 to create the double-membrane vesicles required for viral replication (Angelini et al., 2014) . NSP4 mostly comprised a dark region (PP1a 2764-3172) with no CATH matches, no disorder, and multiple transmembrane helices. The C-terminal region (PP1a 3173-3263) comprised a domain called NSP4C ( CATH 1.10.150.420 ) that had three matching structures ( P0DTC1/3vcb from MHV-A59, 59% identity, E = 10 -35 ); all were homodimers, although NSP4C is thought to act as a monomer (Xu et al., 2009 ). NSP5 (a.k.a. 3C-like protease or 3CL-Pro; PP1a 3264-3569) is a two-domain protein (CATH ( CATH 2.40.10.10 and CATH 1.10.1840.10 ) thought to cleave the viral polyprotein at 11 sites, resulting in NSP4-NSP16. NSP5 had 256 matching structures ( P0DTC1/5rfa from SARS-CoV-2, 100% identity) -more than any other viral protein -of which, 15 showed binding to inhibitory peptides ( P0DTC1/7bqy from SARS-CoV-2, 100% identity).

NSP6 (PP1a 3570-3859) is a transmembrane protein thought to act with NSP3 and NSP4 to create double-membrane vesicles (Angelini et al., 2014) . Of the 15 PP1ab proteins, NSP6 was the only completely dark protein and had the lowest conservation (ConSurf = 2.9). NSP6 also had no CATH matches and no disordered regions. In two structures, NSP7 occurs as a monomer ( P0DTC1/2kys , 98% identity, E = 10 -32 and P0DTC1/1ysy , 98% identity, E = 10 -32 -both from SARS-CoV), while the remaining 13 structures showed NSP7 bound to NSP8 ( P0DTC1/3ub0 from FCoV, 43% identity, E = 10 -79 ). Six of these structures also showed binding to NSP12 ( P0DTC1/7bv1 from SARS-CoV-2, 99% identity); two of these six structures also showed binding to RNA ( P0DTC1/7bv2 , 99% identity and P0DTC1/6yyt , 97% identity, E = 10 -32 -both from SARS-CoV-2). NSP7 comprises an antiparallel helical bundle ( CATH 1.10.8.370 ) that adopts distinct substates, depending on its interaction partners ( Figure 5A ). NSP8 (PP1a 3943-4140) is also thought to support viral genome replication . It features a highly conserved (ConSurf = 7.3) 'tail' segment (PP1a 3943-4041), predominantly helical with some disordered residues and no CATH matches, followed by a less conserved (ConSurf = 5.7) 'head' domain (PP1a 4042-4140), with a beta barrel fold ( CATH 2.40.10.290 ). NSP8 had 14 matching structures, all showing interactions to other viral proteins (Figures 2 & 5A) . One structure showed binding to NSP12 only ( P0DTC1/6nus from SARS-CoV, 97% identity, E = 10 -78 ), while the remaining 13 structures all showed binding to NSP7 ( P0DTC1/3ub0 ). Six of these 13 structures also showed binding to NSP12, and two of these six showed binding to RNA as well.

NSP9 (PP1a 4141-4253) may bind single-stranded RNA and take part in genome replication (Miknis et al., 2009) . NSP9 had 14 matching structures ( P0DTC1/6wxd from SARS-CoV-2, 98% identity, E = 10 -44 ), all with beta barrel architecture (CATH 2.40.10.250 ) and mostly homodimers -thought to be the functional state (Miknis et al., 2009 ).

(PP1a 4254-4392) is thought to act with NSP14 and NSP16 to cap and proofread RNA during genome replication (Bouvet et al., 2012) . NSP10 had no CATH matches, yet had 33 matching structures, some showing interactions to other viral proteins (Figures 2 & 5B ). In one matching structures, NSP10 was monomeric ( P0DTC1/2fyg from SARS-CoV, 97% identity, E = 10 -57 ), while in two structures, NSP10 was a dodecamer (Figure 1 ; P0DTC1/2ga6 from SARS-CoV, 96% identity, E = 10 -72 and P0DTC1/2g9t from SARS-CoV, 96% identity, E = 10 -72 ), forming a hollow sphere with 12 C-terminal zinc finger motifs on the outer surface, and another 12 on the inner surface (Su et al., 2006) . Four matching structures showed binding to NSP14 ( P0DTC1/5c8u from SARS-CoV, 97% identity, E = 10 -68 ), while the remaining 26 structures showed binding to NSP16 ( P0DTC1/6w61 from SARS-CoV-2, 100% identity).

Polyprotein 1b (a.k.a. PP1b) is cleaved by 3CL-PRO into five proteins (NSP12-NSP16) that drive replication of viral RNA. These proteins were predicted to have no disordered regions, no transmembrane helices, very few dark regions, and to be highly conserved (ConSurf = 5.3-6.6), compared with PP1a proteins (ConSurf = 2.9-5.6).

NSP12 (PP1ab 4393-5324) is an RNA-directed RNA polymerase (a.k.a. RdRp) thought to drive viral genome replication (Yin et al., 2020) . NSP12 was the second most conserved PP1ab protein (ConSurf = 6.5), and had 47 matching structures, many showing interactions to other viral proteins (Figures 2 & 5A) . One structure showed binding to NSP8 ( P0DTC1/6nus from SARS-CoV, 97% identity, E = 10 -77 ), while four structures showed binding to both NSP8 and NSP7 ( P0DTD1/7bv1 from SARS-CoV-2, 99% identity, E = 10 -230 ); two of these four also showed binding to RNA ( P0DTD1/6yyt from SARS-CoV-2, 100% identity). An additional 15 structures showed binding to RNA only ( P0DTD1/3kmq from FMDV, 18% identity, E = 10 -11 ), while one structure showed binding to both RNA and DNA ( P0DTD1/4k4v from poliovirus, 17% identity, E = 10 -11 ).

NSP13 (PP1ab 5325-5925) is a multi-functional helicase thought to unwind double-stranded RNA and DNA (Adedeji et al., 2012) and to be a core component of the viral replication complex (Subissi et al., 2014) . The N-terminal half of NSP13 (PP1ab 5325-5577) had no matches in CATH, while the C-terminal half contained two Rossman fold domains ( CATH 3.40.50.300 ). NSP13 had 64 matching structures, of which 23 showed potential mimicry of four human helicase proteins (Figures 2 & 3B) . Two of these structures showed matching between PIF1 ( P0DTD1/6hpt , 20% identity, E = 10 -12 ) and the first Rossman fold domain (PP1ab 5597-5733). A further 11 structures showed matching between AQR ( P0DTD1/4pj3 , 19% identity, E = 10 -30 ) and the second Rossman fold domain plus part of the first (PP1ab 5689-5909); of these structures, 10 showed AQR bound to the spliceosome ( P0DTD1/6id0 , 20% identity, E = 10 -30 ). The remaining 10 structures matched to both Rossman fold domains: eight of these structures showed mimicry of UPF1 ( P0DTD1/2xzo , 24% identity, E = 10 --32 ), of which two structures also showed binding to UPF2 ( P0DTD1/2wjy , 23% identity, E = 10 -34 ); finally, two structures showed mimicry of IGHMBP2 ( P0DTD1/4b3f , 23% identity, E = 10 -35 ), of which one structure also showed binding with RNA ( P0DTD1/4b3g , 25% identity, E = 10 -33 ). An further 41 structures showed matches to viral proteins ( P0DTD1/6jyt from SARS-CoV, 100% identity), of which four structures also showed binding to DNA ( P0DTD1/4n0o , 22% identity, E = 10 -19 ).

NSP14 (PP1ab 5926-6452) is a proofreading exoribonuclease (a.k.a. ExoN) thought to remove 3'-terminal nucleotides from RNA, thereby reducing mutations during viral genome replication (Minskaia et al., 2006) . NSP14 had no matches in CATH, but had four matching structures ( P0DTD1/5nfy , 100% identity), all in complex with NSP10 (Figures 2 & 5B ). NSP15 (PP1ab 6453-6798) is a uridylate-specific endoribonuclease (a.k.a. NendoU) thought to support viral genome replication (Ricagno et al., 2006) . The N-terminal region of NSP15 had two domains ( CATH 2.20.25.360 and CATH 3.40.50.11580 ) , while the C-terminal region (PP1ab 6642-6798) had none. NSP15 had 19 matching structures ( P0DTD1/6wxc , 99% identity), none showing matching to or interaction with human proteins, or with other viral proteins ( Figure 2 ). NSP16 (PP1ab 6799-7096) may methylate viral mRNA caps, following replication, which is thought to be important for evading host immune defenses (Bouvet et al., 2010) . This was also the most conserved (ConSurf = 6.6) PP1ab protein. NSP16 

The 3' end of the genome encodes 12 proteins, many involved in virion assembly. Remarkably, our analysis found no interactions between these proteins.

Spike glycoprotein (a.k.a. S protein) binds host receptors, thereby initiating membrane fusion and viral entry . This protein has four UniProt domains, of which one (S1-CTD; residues 334-527) binds to host receptors ( Figure 4B ). Aquaria found 136 matching structures ( P0DTC2/6vxx from SARS-CoV-2, 99% identity) clustered in two regions (Figure 2 ). One was a heptad repeat region (HR2), with 15 matching structures ( P0DTC2/2fxp from SARS-CoV, 98% identity, E = 10 -14 ), of which four structures showed binding to antibodies ( P0DTC2/6pxh from MERS-CoV, 22% identity, E = 10 -11 ). The second region had 121 matching structures, of which 68 showed binding to antibodies ( P0DTC2/6w41 from SARS-CoV-2, 100% identity) and two structures showed binding to inhibitory peptides ( P0DTC2/5zvm from SARS-CoV, 88% identity, E = 10 -32 ; P0DTC2/5zvk from SARS-CoV, 55% identity, E = 10 -32 ). Of the remaining structures, 18 showing interaction with human proteins (Figures 2 & 4B) : 15 showed binding to ACE2 ( P0DTC2/6acg from SARS-CoV, 77% identity, E = 10 -321 ), of which one structure also showed binding to SLC6A19 ( P0DTC2/6m17 from SARS-CoV-2, 100% identity); finally, three structures showed binding to DPP4 ( P0DTC2/4qzv from BtCoV-HKU4, 21% identity, E = 10 -45 ).

ORF3a may act as a homotetramer, forming an ion channel in host cell membranes that helps with virion release (Lu et al., 2006) . ORF3a had no matching structures.

The envelope protein (a.k.a. E protein) matched two structures, both from SARS-CoV. One was a monomer ( P0DTC4/2mm4 , 91% identity, E = 10 -26 ) while the other was a pentamer ( P0DTC4/5x29 , 86% identity, E = 10 -27 ), forming a ion channel thought to span the viral envelope (Surya et al., 2018; Vennema et al., 1996) .

The matrix glycoprotein (a.k.a. M protein) is also part of the viral envelope (Vennema et al., 1996) . The matrix glycoprotein had no matching structures.

ORF6 may block expression of interferon stimulated genes (e.g., ISG15) that have antiviral activity (Frieman et al., 2007) . ORF6 had no matching structures.

ORF7a may interfere with the host cell surface protein BST2, preventing it from tethering virions (Taylor et al., 2015) . ORF7a had three matching structures: one from SARS-CoV-2 ( P0DTC7/6w37 , 100% identity) and two from SARS-CoV ( P0DTC7/1yo4 , 90% identity, E = 10 -59 ; P0DTC7/1xak , 88% identity, E = 10 -59 ).

ORF7b is an integral membrane protein thought to localize to the Golgi compartment and the virion envelope (Schaecher et al., 2007) . ORF7b had no matching structures.

ORF8 is thought to inhibit type 1 interferon signaling (J.-Y. ; it is also very different to proteins from other coronaviruses. ORF8 had no matching structures.

The nucleocapsid protein (a.k.a. N or NP protein) is thought to package the viral genome during virion assembly through interaction with the matrix glycoprotein , and also to become ADP-ribosylated (Grunewald et al., 2018) . Depending on its phosphorylation state, this protein may also switch function, translocating to the nucleus and interacting with the host genome (Surjit et al., 2005) . This protein had 35 matching structures clustered in two regions: the N-terminal region had 24 matching structures ( P0DTC9/6yi3 from SARS-CoV-2, 100% identity), of which one was a tetramer, five were dimers, and the rest were monomers; the C-terminal region had 13 matching structures ( P0DTC9/6wji from SARS-CoV-2, 98% identity), all dimers.

ORF9b is a lipid-binding protein thought to interact with mitochondrial proteins, thereby suppressing interferon-driven innate immunity responses (Shi et al., 2014) . ORF9b matched two structures ( P0DTD2/2cme from SARS-CoV, 78% identity, E = 10 -48 ) that showed binding to a lipid analog (Meier et al., 2006) .

ORF9c (a.k.a. ORF14) is currently uncharacterized experimentally; it is predicted to have a single-pass transmembrane helix. ORF9c had no matching structures.

ORF10 is a predicted protein that currently has limited evidence of translation (Gordon et al., 2020) , has no reported similarity to other coronavirus proteins, and has no matching structures.

The 872 models derived in this study capture essentially all structural states of SARS-CoV-2 proteins with supporting experimental evidence. We used these states to create a structural coverage map (Figure 2 ): a concise yet comprehensive visual summary of what is known about the 3D structure of the viral proteome. Remarkably, we found so few states showing viral self-assembly ( Figure 5 ), mimicry (Figure 3) , or hijacking (Figure 4 ) that -excluding non-human host proteins -all states could be included in the coverage map via several simple graphs. This may indicate that host interactions are rarely used in COVID-19 infection, consistent with the notion that viral activity is largely shielded from the host. However, other experimental techniques have found many more interactions between viral proteins (Pan et al., 2008) and with host proteins (Gordon et al., 2020) . Thus, the small number of interactions found in this work likely indicates limitations in current structural data.

Using the current structural data (Figure 2 ), we can divide the 27 SARS-CoV-2 proteins into four categories: mimics, hijackers, teams, and suspects -below, we highlight insights derived in this work for each of these categories.

We found structural evidence for mimicry of human proteins for only two SARS-CoV-2 proteins: NSP3 and NSP13 (Figures 2 & 3 ; Table S4 ).

NSP3 may mimic host proteins containing macro domains, interfering with ADP-ribose (ADPr) modification and thereby suppressing host innate immunity (Lei et al., 2018; Rack et al., 2016) . We found nine potentially mimicked proteins ( Figure 3A) ; the top ranked matches ( MACROD2 and MACROD1 ) remove ADPr from proteins (Hottiger, 2015; O'Sullivan et al., 2019) , reversing the effect of ADPr writers (e.g., PARP9 , PARP14 , and PARP15 in lymphoid tissues), and affecting ADPr readers (e.g., the core core histone proteins MACROH2A1 , and MACROH2A2 , found in most cells). Thus we speculate that, in infected cells, ADPr erasure by NSP3 may influence epigenetic regulation of chromatin state (Schäfer and Baric, 2017) , potentially contributing to the high variation in COVID-19 patient outcomes (Callaway et al., 2020) . Furthermore, in infected macrophages, activation by PARP9 and PARP14 may be undermined by NSP3's erasure of ADPr, resulting in vascular disorders (Iwata et al., 2016) , as seen in COVID-19 (Varga et al., 2020) . NSP13 (a.k.a. viral helicase) may mimic four human helicases, based on stronger alignment evidence than for mimicry by NSP3 (Figure 3 ). Three of these helicases are associated with DNA repair and recombination, thus we speculate mimicry of these proteins could either recruit host proteins to assist in viral recombination (Graham and Baric, 2010) , or alternatively, could dysregulate recombination of host DNA. However, we found no evidence for mimicry of the ~100 other human helicases (Umate et al., 2011) , many also involved in recombination, suggesting that mimicry may hijack more specific functions performed by the four helicases. The top ranked helicase was IGHMBP2 (a.k.a. SMBP2), which interacts with single-stranded DNA in the class switching region of the genome (Yu et al., 2011) , close to IGMH (the gene coding the constant region of immunoglobulin heavy chains); we speculate that mimicry of IGHMBP2 may explain the dysregulation of immunoglobulin-class switching observed clinically (Bauer, 2020) . The second ranked helicase was UPF1 (a.k.a. up-frameshift suppressor 1, RENT1, or regulator of nonsense transcripts 1), which acts in the cytoplasm as part of the nonsense-mediated mRNA decay pathway known to counteract coronavirus infection (Wada et al., 2018) ; we speculate that mimicry of UPF1 may hijack this pathway, thus impeding host defenses. UPF1 also has acts in the nucleus, interacting with telomeres -as does the helicase PIF1 ; we speculate that mimicry of UPF1 or PIF1 may be implicated to the connection seen between COVID-19 severity, age, and telomere length (Aviv, 2020) . In summary, we propose that (as mentioned above for nucleocapsid protein) NSP13 may sometimes switch from its key role in viral replication to a state focused on host genome interactions that undermine host immunity.

We found direct structural evidence for hijacking of human proteins for only two SARS-CoV-2 proteins: NSP3 and spike glycoprotein (Figures 2 & 4; Table S4 ).

NSP3 is believed to cleave ubiquitin from host proteins (Barretto et al., 2005) , thereby suppressing innate immune response and disrupting proteasome-mediated degradation (Lei et al., 2018) . We found six structures showing NSP3 PL-Pro bound to ubiquitin domains from UBB , UBC , or UBA52 (a.k.a. ubiquitin-60S ribosomal protein L40). Four additional structures showed binding to the ubiquitin-like domains of ISG15 (a.k.a. interferon stimulated gene 15), which is attached to most newly synthesized proteins; ISGylation does not induce degradation but likely disturbs virion assembly (Dzimianski et al., 2019; Held et al., 2020; Lei et al., 2018) . Together, these 10 structures show in detail how NSP3 reverses either ubiquitination or ISGylation ( Figure  4B ); a further two structures show how NSP3 can be blocked with inhibitors ( Figure 2 ). Spike glycoprotein is known to bind receptors on the host cell membrane, thereby initiating membrane fusion and viral entry . We found 16 matching structures showing hijacking of ACE2 , a carboxypeptidase that cleaves vasoactive peptides. One of these structures also shows binding to SLC6A19 (a.k.a. B 0 AT1), an amino acid transporter reported to act with ACE2 (Camargo et al., 2009) . Most structures show only short fragments of spike; however, by consolidating several structures and sequence features, we created an integrated view showing the cellular context of the pre-fusion event, when spike first binds the ACE + SLC6A19 complex ( Figure 4B ). Three further matching structures showed interactions between spike (from MERS-CoV and BtCoV-HKU4) and host membrane receptor DPP4 (a.k.a. CD26), an aminopeptidase that SARS-CoV-2 may use (Y. to enter host immune cells (Radzikowska et al., 2020) . If so, DPP4 and ACE2 may interact with spike at distinct but overlapping binding sites ( Figure 4B ) -making this region a potential therapeutic target. Once inside immune cells, the virus may switch to a non-reproductive state aimed at host cell disruption -as reported for MERS and SARS (Chu et al., 2016; Gu et al., 2005) -thereby contributing to lymphopenia in severe COVID-19 . We speculate that this cell-dependent state switching may be driven by NSP3 modifications of either ADP ribosylation, ubiquitination, or ISGlyation, and may involve the state switches noted above for NSP13 and nucleocapsid protein.

We found structural evidence for interaction between only six SARS-CoV-2 proteins (Figures 2 & 5) ; they divided into two distinct teams, each with three proteins.

Team 1 comprised NSP7, NSP8, and NSP12, all members of the viral RNA replication complex ( Figure 5A ). NSP12 alone can replicate RNA, as can NSP7 + NSP8 acting together . However, replication is greatly stimulated by cooperative interactions between these proteins . Most combinations of these proteins occur in the 60 matching structures found ( Figure 5A ), although models for NSP12 (alone or with RNA) all derived from very remote viruses (e.g., enterovirus). The matrix of combinations ( Figure 5A ) gives insight into how the complex may assemble, and shows that several proteins that may be involved in genome replication (Subissi et al., 2014) are missing (NSP3, NSP9, NSP10, NSP13, NSP14, and NSP16). These outcomes again demonstrate the value of modelling all available structural states, compared with strategies that select only one or few structural states per protein.

Team 2 comprised NSP10, NSP14, and NSP16 ( Figure 5B ). All matching structures found for NSP14 and NSP16 showed binding with NSP10, consistent with the belief that NSP10 is required for NSP16 RNA-cap methyltransferase activity (Decroly et al., 2011) , and also for NSP14 methyltransferase and exoribonuclease activities (Ma et al., 2015) . In addition, NSP10 was found to form a homododecamer; however, these three states of NSP10 share a common binding region ( Figure 5B ; Bouvet et al., 2014) . This suggests NSP10, NSP14, and NSP16 interact competitively -in contrast to cooperative interactions seen in team 1. We speculate that NSP10 is produced at higher abundance than NSP14 or NSP16, or may be rate limiting for viral replication.

Finally, it is noteworthy that no interactions were found between the 12 virion or accessory proteins (Figure 1, bottom third) . This, again, highlights limitations in currently available structural data.

This leaves 18 of the 27 viral proteins in a final category we call suspects: these are proteins thought to play key roles in infection, but having no structural evidence of interaction with RNA, DNA, or other proteins (viral or human). We divided the suspects into two groups, based on matching structures.

Group 1 proteins were those with at least one matching structure: NSP1, NSP2, NSP4, NSP9, NSP15, E, ORF7a, and ORF9a. Some of these have been well studied (e.g., NSP5 had 256 matching structures). Yet none of these proteins had significant similarity to any experimentally determined 3D structure involving human proteins, or to any structure showing interactions between viral proteins -based on the methods used in this work.

Group 2 proteins were those with no matching structures: NSP6, ORF3a, matrix glycoprotein (a.k.a. M protein), ORF6, ORF8, and ORF9c, and ORF10. As noted previously, NSP2 could, arguably, be included in this group. These are structurally dark proteins , meaning not only is their structure unknown, but also that they have no significant sequence similarity to any experimentally determined 3D structure -based on the methods used in this work. These proteins are ripe candidates for advanced modelling strategies, e.g., using predicted residue contacts combined with deep learning (e.g., Senior et al. 2020 ).

We have assembled a wealth of structural data, not available elsewhere, about the viral proteome. We used these data to construct a structural coverage map summarizing what is known -and not known -about the 3D structure of SARS-CoV-2 proteins. The coverage map also summarizes structural evidence for viral protein interaction teams and for mimicry and hijacking of host proteins -and identifies suspect viral proteins, with no evidence of interactions to other macromolecules. Additionally, the map helps researchers find 3D models of interest, which can then be mapped with a large set of sequence features to provide insights into protein function. The resulting Aquaria-COVID resource (https://aquaria.ws/covid) aims to fulfil a vital role during the COVID-19 pandemic, helping scientists use emerging structural data to understand the molecular mechanisms underlying coronavirus infection. Based on insights gained using our resource, we propose mechanisms by which the virus may enter immune cells (via spike hijacking of DPP4), sense the cell type (via posttranslational modifications by NSP3), then switch focus from viral reproduction to disrupting host immune responses (via hijacking by NSP13 and nucleocapsid protein). Our resource lets researchers easily explore and assess the evidence for these mechanisms -or for others they may propose themselves -thereby helping direct future research.

This study was based on the 14 protein sequences provided in UniProtKB/Swiss-Prot version 2020_03 (released April 22, 2020; https://www.uniprot.org/statistics/ ) as comprising the SARS-CoV-2 proteome. Swiss-Prot provides polyproteins 1a and 1ab (a.k.a. PP1a and PP1ab) as two separate entries, both identical for the first 4401 residues; PP1a then has four additional residues ('GFAV') not in PP1ab, which has 2695 additional residues not in PP1a. Swiss-Prot also indicates residue positions at which the polyproteins become cleaved in the cell, resulting in 16 protein fragments, named NSP1 though NSP16. The NSP11 fragment comprises the last 13 residues of PP1a (4393-4405). The first 9 residues of NSP12 are identical to the first nine of NSP11, but the rest of that 919 residue long protein continues with a different sequence due to a functionally important frameshift between ORF1a and ORF1b (Nakagawa et al., 2016) . Thus, following cleavage, the proteome comprises a final total of 27 separate proteins.

The 14 SARS-CoV-2 sequences were then systematically compared with sequences derived from all known 3D structures from all organisms, based on PDB released on May 30, 2020. These comparisons used the latest version of HHblits (Steinegger et al., 2019) -an alignment method employing iterative comparisons of hidden Markov models (HMMs) -and using the processing pipeline defined previously , accepting all sequence-to-structure alignments with a significance threshold of E ≤ 10 -10 . The result is a database of protein sequence-to-structure homologies called PSSH2, which is used in the Aquaria web resource.

HHblits has been substantially updated (Mirdita et al., 2017) since we last assessed the specificity and sensitivity of PSSH2 , therefore we repeated our assessment of resulting alignments. We used CATH (Dawson et al., 2017) to replace the discontinued COPS database (Suhrer et al., 2009) used previously. Our test data set comprised 23,028 sequences from the CATH nr40 data set. We built individual sequence profiles against UniClust30 and used these profiles to search against "PDB_full", a database of HMMs for all PDB sequences. We then evaluated how many false positives were retrieved at an E-value lower than 10 -10 , where a false positive was seen to be a structure with a different CATH code at the level of Homologous superfamily (H) or Topology (T). We compared the ratio of false positives received with HH-suite3 and UniClust30 with a similar analysis for data produced in 2017 with HH-suite2 and UniProt20, and found that in both cases the false positive rate was at 2.5% at the homology level (H), and 1.9% at the topology level (T). The recovery rate, i.e. the ratio of proteins from the CATH nr40 data (with less than 40% sequence identity) found by our method that have the same CATH code, was slightly higher with HH-suite3 (20.8% vs. 19.4%).

For each sequence-to-structure alignment, the Aquaria interface gives a pairwise sequence identity score, thus providing an intuitive indication of how closely related the given region of SARS-CoV-2 is to the sequence of the matched structure. However, to more accurately assess the quality of the match, Aquaria also gives an E-value, calculated by comparing two HMMs, one generated for each of these two sequences.

To facilitate analysis of SARS-CoV-2 sequences, we enhanced the Aquaria resource to include PredictProtein features (Yachdav et al., 2014) , thus providing a very rich set of predicted features for all Swiss-Protein protein sequences. In Aquaria, the PredictProtein feature collection is fetched directly by the browser via:

https://api.predictprotein.org/v1/results/molart/:uniprot_id The PredictProtein feature sets used in the analysis presented in this work are described below.

Conservation. This feature set is generated by ConSurf (Ashkenazy et al., 2010; Celniker et al., 2013) and gives, for each residue, a score between 1 and 9, corresponding to very low and very high conservation, respectively. These scores estimate the evolutionary rate in protein families, based on evolutionary relatedness between the query protein and its homologues from UniProt using empirical Bayesian methods (Mayrose et al., 2004) . The strength of these methods is that they rely on the phylogeny of the sequences and thus can accurately distinguish between conservation due to short evolutionary time and conservation resulting from importance for maintaining protein foldability and function.

Disordered Regions. This feature set gives consensus predictions generated by Meta-Disorder (Schlessinger and Rost, 2005) , which combines outputs of several structure-based disorder predictors (Schlessinger et al., 2009 ) to classify each residue as either disordered or not disordered.

Relative B-values. This feature set predicts, for each residue, normalized B-factor values (Masmaliyeva and Murshudov, 2019) expected to be observed in an X-ray-derived structure. The predictions were generated by PROFbval (Schlessinger et al., 2006) which gives, for each residue, a score between 0 and 100; residues with a score of 50 are predicted to have average flexibility, while those with ≥71 are considered to be highly flexible.

Topology. This feature set is generated by TMSEG (Bernhofer et al., 2016) , a machine learning model that uses evolutionary-derived information to predict regions of a protein that traverse membranes, as well as the subcellular locations of the complementary (non-transmembrane) regions.

We further enhanced Aquaria to include SNAP2 features, which provides details on the mutational propensities for each residue position (Hecht et al., 2015) . In Aquaria, the SNAP2 feature collection for each protein sequence is fetched directly by the browser via:

https://rostlab.org/services/aquaria/snap4aquaria/json.php?uniprotAcc=:uniprot_id Two SNAP2 feature sets were used in this work:

Mutational Sensitivity. For each residue position, this feature set provides 20 scores indicating the predicted functional consequences of the position being occupied by each of the 20 standard amino acids. Large, positive scores (up to 100) indicate substitutions likely to have deleterious changes, while negative scores (down to -100) indicate no likely functional change. From these 20 values a single summary score is calculated based on the total fraction of substitutions predicted to have a deleterious effect on function, taken to be those with a score > 40. The summary scores are used to generate a red to blue color map, indicating residues with highest to least functional importance, respectively.

Mutational Score. This feature set is based on the same 20 scores above, but calculates the single summary score for each residue as the average of the individual scores for each of the 20 standard amino acids.

UniProt features are curated annotations, and therefore largely complement the automatically generated PredictProtein features. In Aquaria, for each protein sequence, the UniProt feature collection is fetched directly by the browser via:

https://www.uniprot.org/uniprot/:uniprot_id.xml

For this work, we further enhanced Aquaria to include CATH domain annotations (Dawson et al., 2017) . For most protein sequences, Aquaria fetches these annotations directly from the browser via APIs given at:

For SARS-CoV-2 proteins, however, CATH annotations are not yet fully available via the above APIs. In this work, we used a pre-release version of these annotations, derived by scanning the UniProt pre-release sequences against the CATH-Gene3D v4.3 FunFams HMM library (Dawson et al., 2017; Lewis et al., 2018) using HMMsearch with inclusion thresholds cut-offs (Mistry et al., 2013) . Domain assignments were obtained using cath-resolve-hits and curated manually (Lewis et al., 2019) . For the SARS-CoV-2 sequences, these data are fetched directly from the browser via: https://aquaria.ws/covid19cath/P0DTC2 Two CATH feature sets were used in this work:

Superfamilies. These identify regions of protein sequences across a wide variety of organisms that are expected to have very similar 3D structures and to have general biological functions in common.

Also known as FunFams, these domains partition each superfamily into subsets expected to have more specific biological functions in common (Dawson et al., 2017; Lewis et al., 2018) .

When examining a specific superfamily or functional family domain in Aquaria, the browser uses additional CATH API endpoints (see link above) to create compact, interactive data visualizations that give access to detailed information on the biological function and phylogenetic distribution of proteins containing this domain.

In the results, we include average scores for some of the above feature sets, calculated using the APIs defined above. The average scores given for NSP3 regions were calculated using PP1a features. The whole-protein average scores given for each of the 15 PP1ab proteins were calculated using PP1ab features.

We determined the set of residues comprising intermolecular contacts by selecting all residues for one protein, then using the 'Neighbours' function in jolecule. This highlights all residues in which any atom is within 5 Angstrom.

We created a web page featuring a matrix of structures that summarizes the total number of matches found for each viral protein sequence, and allows navigation to the corresponding Aquaria page for each protein.

We created an additional web page with a layout derived from the organization of the viral genome. For each region of the viral proteome with matching structures, we selected a single representative structure (Figure 2) , primarily based on identity to the SARS-CoV-2 sequence. However, in some cases, representatives were chosen that best illustrated the consensus biological assembly seen across all related matching structures or showed the simplest assembly. Under the name of each viral protein, the total number of matching structures found in PSSH2 , is indicated. Below each structure, a tree graph is drawn to indicate cases of mimicry (where the viral sequence aligns onto human proteins), hijacking (where the matching structures show binding between viral protein and human proteins, DNA, RNA, antibodies, or inhibitory factors), or interaction with other viral proteins. When these tree graphs are missing, none of the matching structures meet these criteria.

The ruler (top) indicates sequence residue numbering. Dark sequence regions indicate either alignment gaps, or regions that do not match any experimentally-determined 3D structure. Regions with matching structure are indicated in green, and with a representative 3D structure position below, with conserved residues colored by secondary structure, with amino acid substitutions in dark gray, and with insertions in light gray. This coloring makes it clear that the NSP1 model is similar in sequence to SARS-CoV-2, while the NSP2 model is remote. However, a better measure of model significance is the HHblits E-value, calculated by aligning sequence profiles for SARS-CoV-2 onto profiles for each 3D structure. The NSP2 model is more significant as it has a longer alignment.

Made using Aquaria and edited with Keynote. Figure 2 . SARS-CoV-2 structure coverage map Integrated visual summary of 3D structural evidence about viral protein states. Proteins are shown as arrows, scaled by sequence length, ordered by genomic location, and divided into three groups: (1) polyprotein 1a proteins (top); (2) polyprotein 1b proteins (middle); and (3) virion and accessory proteins (bottom). Dark coloring indicates sequence regions that do not match any experimentally-determined 3D structure. Regions with matching structure are indicated in green, and with representative 3D structures positioned below (Figure 1 ), all at about the same scale. NSP3 macro domain and NSP13 may mimic the human proteins shown as orange nodes connected by solid lines (Figure 3) . NSP3 PL-Pro domain and spike glycoprotein may hijack the human proteins shown as gray nodes connected by dotted lines (Figure 4) . Interactions between viral proteins are shown using green colored nodes connected by dotted lines; these interactions occur in two disjoint teams: (1) NSP7, NSP8, and NSP12; and (2) NSP10, NSP14, and NSP16. For the remaining 18 viral proteins, there was no structural evidence for interactions with other viral proteins, or for mimicry or hijacking of human proteins. Seven of the 18 are structurally dark proteins: NSP6, ORF3a, matrix glycoprotein, ORF6, ORF8, ORF9c, and ORF10.

Made using Aquaria and edited with Keynote. Figure 3 . Viral mimicry of human proteins (A) Lists domain topology for nine human proteins potentially mimicked by NSP3 macro domain (ranked by alignment significance) along with functions that may be affected by mimicry. Each macro domain is numbered to indicate its CATH functional family. The top ranked proteins (MACROD2 and MACROD1) remove ADPr from proteins, reversing the effect of ADPr writers (PARP14, PARP15, and PARP9), and affecting ADPr readers (GDAP2, MACROH2A2, and MACROH2A1).

(B) Lists four human helicase proteins potentially mimicked by NSP13 (ranked by alignment significance) along with functions that may be affected by mimicry. Based on alignment significance, there is stronger evidence for mimicry by NSP13 than by NSP3. For the two top ranked proteins, we show 3D structures of the human proteins, colored to indicate alignment (Figure 1) region) has three domains (yellow, green, and blue) and two binding sites for ubiquitin-like (Ubl) domains (gray). In most matching structures, a single Ubl domain occupies the primary binding site (dark gray), with its C-terminal LRLRGG motif (red) positioned between the PL-Pro thumb and palm domains and ending at the proteolytic cleavage site (red). The LRLRGG motif can attach to other proteins, including other copies of ubiquitin. PL-Pro cleaves this attachment, reversing monoor poly-ubiquitination. One matching structure showed binding of two ubiquitin domains (UBB and UBC), with UBB occupying the secondary site (light gray). Other structures showed binding to the N-and C-terminal Ubl domains of ISG15 (interferon stimulated gene 15) at the secondary and primary sites, respectively.

(B) The spike trimer structure (top left; P0DTC2/6acg ) shows the pre-fusion state after cleavage at the S1/S2 boundary, and with one S1-CTD domain (green) bound to the extracellular region of the carboxypeptidase ACE2 (purple). Below that, we have manually aligned a second structure ( P0DTC2/6m17 ) showing S1-CTD in complex with ACE2 and the amino acid transporter SLC6A19. By integrating sequence features (top and bottom) with 3D structures, the resulting figure shows the cellular context of the pre-fusion event, when spike first binds the ACE + SLC6A19 complex. The insert highlights S1-CTD residues that contact ACE2 (magenta). These are compared to residues that contact the N-terminal dipeptide peptidase DPP4 in a structure from BtCoV-HKU4 ( P0DTC2/4qzv ). DPP4 may also be targeted by SARS-CoV-2 (Y. ; if so, our analysis shows that the S1-CTD + DPP4 binding site may partly overlap the S1-CTD + ACE2 binding site.

Made using features and structures from Aquaria, and edited in Photoshop and Keynote.

(A) Summarizes all structural states observed for NSP7 (red), NSP8 (blue), and NSP12 (yellow). NSP12 alone (top row, left) can replicate RNA (top row, right); however, all matching structures were from distantly related viruses (pairwise identity ≤ 18%), such as EV71 (enterovirus). NSP8 binds NSP12 at two sites: (1) at the NSP12 core (2nd row, left); and (2) , 2012) . Six of the matching structures showed NSP7 + NSP8 as a dimer; however, one structure showed a tetramer and one other showed a hexadecamer ( P0DTD1/3ub0 and P0DTD1/2ahm , respectively).

(B) Summarizes all structural states observed for NSP10, NSP14, and NSP16 (Team 2). One matching structure showed an NSP10 monomer (top), while two structures showed NSP10 as a dodecamer (bottom, left). All structures matching NSP14 showed a heterodimer with NSP10 (bottom, middle). All structures matching NSP16 showed a heterodimer with NSP10 (bottom, right). Nine NSP10 residues (shown in red on the monomer) were common intermolecular contacts in all three oligomers. Each oligomer also had specific NSP10 contacts (blue); but most NSP10 contacts were shared with at least one other oligomer (shown in red on each oligomer). This suggests that NSP10, NSP14, and NSP16 interact competitively.

Made using Aquaria and edited with Keynote. 

Mechanism of Nucleic Acid Unwinding by SARS-CoV Helicase

Novel β-Barrel Fold in the Nuclear Magnetic Resonance Structure of the Replicase Nonstructural Protein 1 from the Severe Acute Respiratory Syndrome Coronavirus

Untangling Membrane Rearrangement in the Nidovirales

ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids

Telomeres and COVID-19

Crystal Structure of the Middle East Respiratory Syndrome Coronavirus (MERS-CoV) Papain-like Protease Bound to Ubiquitin Facilitates Targeted Disruption of Deubiquitinating Activity to Demonstrate Its Role in Innate Immune Suppression

The papain-like protease of severe acute respiratory syndrome coronavirus has deubiquitinating activity

The variability of the serological response to SARS-corona virus-2: Potential resolution of ambiguity through determination of avidity (functional affinity)

Recognition of Lys48-Linked Di-ubiquitin and Deubiquitinating Activities of the SARS Coronavirus Papain-like Protease

The Protein Data Bank

TMSEG: Novel prediction of transmembrane helices

In Vitro Reconstitution of SARS-Coronavirus mRNA Cap Methylation

RNA 3'-end mismatch excision by the severe acute respiratory syndrome coronavirus nonstructural protein nsp10/nsp14 exoribonuclease complex

Coronavirus Nsp10, a Critical Co-factor for Activation of Multiple Replicative Enzymes

Six months of coronavirus: the mysteries scientists are still racing to solve

Tissue-Specific Amino Acid Transporter Partners ACE2 and Collectrin Differentially Interact With Hartnup Mutations

ConSurf: Using Evolutionary Data to Raise Testable Hypotheses about Protein Function

Molecular Mechanisms for the RNA-Dependent ATPase Activity of Upf1 and Its Regulation by Upf2

Nuclear Magnetic Resonance Structure Shows that the Severe Acute Respiratory Syndrome Coronavirus-Unique Domain Contains a Macrodomain Fold

Identification of Macrodomain Proteins as Novel O -Acetyl-ADP-ribose Deacetylases

X-ray Structural and Functional Studies of the Three Tandemly Linked Domains of Non-structural Protein 3 (nsp3) from Murine Hepatitis Virus Reveal Conserved Functions

Middle East Respiratory Syndrome Coronavirus Efficiently Infects Human Primary T Lymphocytes and Activates the Extrinsic and Intrinsic Apoptosis Pathways

Unusual bipartite mode of interaction between the nonsense-mediated decay factors, UPF1 and UPF2

Severe Acute Respiratory Syndrome Coronavirus Nonstructural Protein 2 Interacts with a Host Protein Complex Involved in Mitochondrial Biogenesis and Intracellular Signaling

Structural Insights into the Interaction of Coronavirus Papain-Like Proteases and Interferon-Stimulated Gene Product 15 from Different Species

How viruses hijack cell regulation

CATH: an expanded resource to predict protein function through structure and sequence

The RNA helicase Aquarius exhibits structural adaptations mediating its recruitment to spliceosomes

Crystal Structure and Functional Analysis of the SARS-Coronavirus RNA Cap 2′-O-Methyltransferase nsp10/nsp16 Complex

Structural and functional analysis of the nucleotide and DNA binding activities of the human PIF1 helicase

Structural basis for the regulatory function of a complex zinc-binding domain in a replicative arterivirus helicase resembling a nonsense-mediated mRNA decay helicase

ISG15: It's Complicated

The evolutionary conundrum of pathogen mimicry

Structure of Foot-and-Mouth Disease Virus Mutant Polymerases with Reduced Sensitivity to Ribavirin

Structural and molecular basis of mismatch correction and ribavirin excision from coronavirus RNA

Recognition of Mono-ADP-Ribosylated ARTD10 Substrates by ARTD8

Severe Acute Respiratory Syndrome Coronavirus ORF6 Antagonizes STAT1 Function by Sequestering Nuclear Import Factors on the Rough Endoplasmic Reticulum/Golgi Membrane

Structures of Coxsackievirus, Rhinovirus, and Poliovirus Polymerase Elongation Complexes Solved by Engineering RNA Mediated Crystal Contacts

Recombination, Reservoirs, and the Modular Spike: Mechanisms of Coronavirus Cross-Species Transmission

The coronavirus nucleocapsid protein is ADP-ribosylated

Multiple organ infection and the pathogenesis of SARS

Solution Structure of the Severe Acute Respiratory Syndrome-Coronavirus Heptad Repeat 2 Domain in the Prefusion State

Solution structure of the X4 protein coded by the SARS related coronavirus reveals an immunoglobulin like fold and suggests a binding activity to integrin I domains

Better prediction of functional effects for sequence variants

Evaluating the Effectiveness of Color to Convey Alignment Quality in Macromolecular Structures. Presented at the Symposium on Big Data Visual Analytics

Evidence for an involvement of the ubiquitin-like modifier ISG15 in MHC class I antigen presentation

Modeling of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) Proteins by Machine Learning and Physics-Based Refinement

Structure of replicating SARS-CoV-2 polymerase

SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor

SnapShot: ADP-Ribosylation Signaling

PARP9 and PARP14 cross-regulate macrophage activation via STAT1 ADP-ribosylation

Phylogenetic Analysis and Structural Modeling of SARS-CoV-2 Spike Protein Reveals an Evolutionary Distinct and Proteolytically Sensitive Activation Loop

A family of macrodomain proteins reverses cellular mono-ADP-ribosylation

Delicate structural coordination of the Severe Acute Respiratory Syndrome coronavirus Nsp13 upon ATP hydrolysis

Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors

SARS Coronavirus Unique Domain: Three-Domain Molecular Architecture in Solution and RNA Binding

NMR Structure of the SARS-CoV Nonstructural Protein 7 in Solution at pH 6.5

Crystal Structure of Nonstructural Protein 10 from the Severe Acute Respiratory Syndrome Coronavirus Reveals a Novel Fold with Two Zinc-Binding Motifs

A two-pronged strategy to suppress host protein synthesis by SARS coronavirus Nsp1 protein

HOMCOS: an updated server to search and model complex 3D structures

Structure of the SARS-CoV nsp12 polymerase bound to nsp7 and nsp8 co-factors

Critical assessment of methods of protein structure prediction (CASP)-Round XIII

Splicing regulates NAD metabolite binding to histone macroH2A

Nsp3 of coronaviruses: Structures and functions of a large multi-domain protein

Gene3D: Extensive prediction of globular domains in proteins

cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly

The ORF6, ORF8 and nucleocapsid proteins of SARS-CoV-2 inhibit type I interferon signaling pathway

Structure of a Conserved Golgi Complex-targeting Signal in Coronavirus Envelope Proteins

The MERS-CoV Receptor DPP4 as a Candidate Binding Target of the SARS-CoV-2 Spike

The Ighmbp2 helicase structure reveals the molecular basis for disease-causing mutations in DMSA1

Severe acute respiratory syndrome-associated coronavirus 3a protein forms an ion channel and modulates virus release

Structural basis and functional analysis of the SARS coronavirus nsp14-nsp10 complex

The Crystal Structures of Chikungunya and Venezuelan Equine Encephalitis Virus nsP3 Macro Domains Define a Conserved Adenosine Binding Pocket

Analysis and validation of macromolecular B values

Comparison of Site-Specific Rate-Inference Methods for Protein Sequences: Empirical Bayesian Methods Are Superior

The Crystal Structure of ORF-9b, a Lipid Binding Protein from the SARS Coronavirus

Severe Acute Respiratory Syndrome Coronavirus nsp9 Dimerization Is Essential for Efficient Viral Growth

Discovery of an RNA virus 3'->5' exoribonuclease that is critically involved in coronavirus RNA synthesis

Uniclust databases of clustered and deeply annotated protein sequences and alignments

Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions

Molecular mimicry and the generation of host defense protein diversity

Chapter Five -Viral and Cellular mRNA Translation in Coronavirus-Infected Cells

Structure and Intracellular Targeting of the SARS-Coronavirus Orf7a Accessory Protein

Visualization of Biomedical Data

Aquaria: simplifying discovery and insight from protein structures

Genome-Wide Analysis of Protein-Protein Interactions and Involvement of Viral Proteins in SARS-CoV Replication

Unexpected features of the dark proteome

Structural Genomics of the Severe Acute Respiratory Syndrome Coronavirus: Nuclear Magnetic Resonance Structure of the Protein nsP7

In silico comparison of spike protein-ACE2 binding affinities across species; significance for the possible origin of the SARS-CoV-2 virus

Macrodomains: Structure, Function, Evolution, and Catalytic Activities

Distribution of ACE2, CD147, CD26, and other SARS-CoV-2 associated molecules in tissues and immune cells in health and in asthma, COPD, obesity, hypertension, and COVID-19 risk factors

Structural Basis for the Ubiquitin-Linkage Specificity and deISGylating Activity of SARS-CoV Papain-Like Protease

Crystal structure and mechanistic determinants of SARS coronavirus nonstructural protein 15 define an endoribonuclease family

Insights on cross-species transmission of SARS-CoV-2 from structural modeling (preprint)

Protein Structure Prediction Using Rosetta

Ribonucleocapsid Formation of Severe Acute Respiratory Syndrome Coronavirus through Molecular Action of the N-Terminal Domain of N Protein

Epigenetic Landscape during Coronavirus Infection. Pathogens 6, 8

Dark Proteins Important for Cellular Function

Improved Disorder Prediction by Combination of Orthogonal Approaches

Protein flexibility and rigidity predicted from sequence

PROFbval: predict flexible and rigid residues in proteins

Improved protein structure prediction using potentials from deep learning

Nuclear Magnetic Resonance Structure of the N-Terminal Domain of Nonstructural Protein 3 from the Severe Acute Respiratory Syndrome Coronavirus

Nuclear Magnetic Resonance Structure of the Nucleic Acid-Binding Domain of Severe Acute Respiratory Syndrome Coronavirus Nonstructural Protein 3

SARS-Coronavirus Open Reading Frame-9b Suppresses Innate Immunity by Targeting Mitochondria and the MAVS/TRAF3/TRAF6 Signalosome

Cryo-EM structure of the SARS coronavirus spike glycoprotein in complex with its host cell receptor ACE2

HH-suite3 for fast remote homology detection and deep protein annotation

Dodecamer Structure of Severe Acute Respiratory Syndrome Coronavirus Nonstructural Protein nsp10

SARS-CoV ORF1b-encoded nonstructural proteins 12-16: Replicative enzymes as antiviral targets

COPS-a novel workbench for explorations in fold space

The Severe Acute Respiratory Syndrome Coronavirus Nucleocapsid Protein Is Phosphorylated and Localizes in the Cytoplasm by 14-3-3-Mediated Translocation

Structural model of the SARS coronavirus E channel in LMPG micelles

Lymphopenia predicts disease severity of COVID-19: a descriptive and predictive study

Severe Acute Respiratory Syndrome Coronavirus ORF7a Inhibits Bone Marrow Stromal Antigen 2 Virion Tethering through a Novel Mechanism of Glycosylation Interference

The SARS-coronavirus nsp7+nsp8 complex is a unique multimeric RNA polymerase capable of both de novo initiation and primer extension

UniProt: a worldwide hub of protein knowledge

Genome-wide comprehensive analysis of human helicases

Endothelial cell infection and endotheliitis in COVID-19

Nucleocapsid-independent assembly of coronavirus-like particles by co-expression of viral envelope protein genes

Interplay between coronavirus, a cytoplasmic RNA virus, and nonsense-mediated mRNA decay pathway

Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein

Structural Definition of a Neutralization-Sensitive Epitope on the MERS-CoV S1-NTD

Bat Origins of MERS-CoV Supported by Bat Coronavirus HKU4 Usage of Human Receptor CD26

SWISS-MODEL: homology modelling of protein structures and complexes

A pan-coronavirus fusion inhibitor targeting the HR1 domain of human coronavirus spike

Nonstructural Proteins 7 and 8 of Feline Coronavirus Form a 2:1 Heterotrimer That Exhibits Primer-Independent RNA Polymerase Activity

Crystal Structure of the C-Terminal Cytoplasmic Domain of Non-Structural Protein 4 from Mouse Hepatitis Virus A59

Structures of Two Coronavirus Main Proteases: Implications for Substrate Binding and Antiviral Drug Design

PredictProtein-an open resource for online prediction of protein structural and functional features

Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2

Structural basis for inhibition of the RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir

Genetics and Immunopathogenesis of IgA Nephropathy

A highly conserved cryptic epitope in the receptor binding domains of SARS-CoV-2 and SARS-CoV

Insights into SARS-CoV transcription and replication from the structure of the nsp7-nsp8 hexadecamer

Structures of the human spliceosomes before and after release of the ligated exon

Deep-learning contact-map guided protein structure prediction in CASP13

Table S1 | Matching structures for SARS-CoV-2 polyprotein 1a NSP1: P0DTC1/2gdt , P0DTC1/2hsx

NSP3 (Ubl1): P0DTC1/2gri , P0DTC1/2idy

P0DTC1/2jzd , P0DTC1/2jze , P0DTC1/2kqv , P0DTC1/2jzf , P0DTC1/2rnk , P0DTC1/2w2g , P0DTC1/2wct NSP3

P0DTC1/2k87 NSP4: P0DTC1/3vcb , P0DTC1/3vc8 , P0DTC1/3gzf NSP5

Table S2 | Matching structures for SARS-CoV-2 polyprotein 1b NSP12

Table S3 | Matching structures for SARS-CoV-2 Virion and Accessory Proteins Spike glycoprotein

ORF3a: No matching structures Envelope protein: P0DTC4/2mm4

Matrix glycoprotein: No matching structures ORF6: No matching structures ORF7a: P0DTC7/6w37 , P0DTC7/1xak , P0DTC7/1yo4 ORF8: No matching structures ORF7b: No matching structures N protein

ORF9b: P0DTD2/2cme , P0DTD2/2cme ORF10: No matching structures ORF14: No matching structures

Thanks to Tim Mercer and Giulia Wang (Garvan Institute, Australia), Phil Austin (University of Sydney, Australia) and Lucy van Dorp (UCL Genetics Institute, UK) for helpful feedback and discussions, to Ian Sillitoe (UCL, UK) for helpful advice regarding the CATH API, to Michael Bernhofer and Maria Littmann (TUM) for advice regarding the PredictProtein server. We are very grateful to Max Ott (CSIRO, Australia) for detailed advice on improving the performance and reliability of the Aquaria web application, and to Tim Karl (TU Munich, Germany) for contributing towards this same goal, even during parental leave.