key: cord-302146-51hof7it authors: Cross, Thomas J.; Takahashi, Gemma R.; Diessner, Elizabeth M.; Crosby, Marquise G.; Farahmand, Vesta; Zhuang, Shannon; Butts, Carter T.; Martin, Rachel W. title: Sequence characterization and molecular modeling of clinically relevant variants of the SARS-CoV-2 main protease date: 2020-05-15 journal: bioRxiv DOI: 10.1101/2020.05.15.097493 sha: doc_id: 302146 cord_uid: 51hof7it The SARS-CoV-2 main protease (Mpro) is essential to viral replication and cleaves highly specific substrate sequences, making it an obvious target for inhibitor design. However, as for any virus, SARS-CoV-2 is subject to constant selection pressure, with new Mpro mutations arising over time. Identification and structural characterization of Mpro variants is thus critical for robust inhibitor design. Here we report sequence analysis, structure predictions, and molecular modeling for seventy-nine Mpro variants, constituting all clinically observed mutations in this protein as of April 29, 2020. Residue substitution is widely distributed, with some tendency toward larger and more hydrophobic residues. Modeling and protein structure network analysis suggest differences in cohesion and active site flexibility, revealing patterns in viral evolution that have relevance for drug discovery. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in late 2019 (1) and rapidly spread worldwide, causing an ongoing pandemic. Although the sequence of its RNA genome is highly similar to that of SARS-CoV-1, SARS-CoV-2 is believed to have arisen independently from a bat coronavirus (2) , to which it shares 96% similarity (3) . The emerging SARS-CoV-2 subsequently gained a modified spike protein due to recombination in an intermediate host, the pangolin (4, 5) , followed by purifying selection for binding to the human ACE2 protein (6) . No therapeutic agents able to reduce SARS-CoV-2 mortality in clinical settings are yet known, although extensive efforts are underway to discover new drugs or repurpose existing ones to inhibit key viral proteins. Here we focus on the main protease (M pro ), which plays a critical role in viral replication. Like other betacoronaviruses, SARS-CoV-2 is a positive-sense RNA virus that expresses all of its proteins as a single polypeptide chain, which is cleaved by M pro to yield the mature proteins (7) . Inhibiting this key enzyme would prevent viral replication, reducing viral load and thus symptom intensity. A similar approach was instrumental in making HIV a manageable disease (8) (9) (10) . However, the proteins in question differ markedly, rendering HIV protease inhibitors ineffective against SARS-CoV-2; indeed, a standard HIV protease inhibitor combination did not prove effective against COVID-19 in a recent clinical trial (11) . Specifically, HIV protease is an aspartic protease (and functional only as a dimer, as the active site comprises one residue from each monomer), whereas M pro is a 3CL cysteine protease that is likewise most active in the dimeric state, although each monomer has its own catalytic dyad (12) . The 3CL cysteine proteases are characterized by a chymotrypsin-like fold and a cysteine-histidine catalytic dyad 2 in the active site, implying both different structures and distinct chemical mechanisms. While the general strategy of seeking protease inhibitors is hence viable for both SARS-CoV-2 and HIV, drug development for the former depends on characterizing this novel enzyme. Molecular modeling is an important tool for guiding inhibitor discovery, making it possible to evaluate large numbers of candidate drugs in silico to select experimental targets; however, standard approaches screen against only one version of the protein, typically the reference or wild-type (WT) sequence. In a host population, mutations accumulate with each viral passage, generating a mutational landscape rather than a single protein. The design of robust inhibitors that can protect against the multiple strains encountered in clinical settings requires characterization of this sequence space and the populations of conformations it engenders. Furthermore, effective and rapid response to future emerging coronavirus diseases requires both in silico screening and experimental testing of antiviral agents and a validated library of relatively general inhibitors that can be used as a basis for the development of specialized therapeutics. Central to the success of that effort will be developing an understanding of structural and functional variation in SARS-CoV-2 proteins, particularly as mutations accumulate and new strains emerge. Here we characterize all 79 known variants of M pro as of 29 April, 2020, and analyze trends in amino acid substitutions and the resulting structural changes using network analysis and molecular modeling. To our knowledge this is the first detailed analysis of clinically relevant mutations in M pro . Our analysis shows a trend toward substitution for larger and more hydrophobic residues versus the WT protein. Analysis of active site networks (ASN) from M pro variants suggests differences in active site flexibility and cohesion that may serve to guide the design of robust, mutation-resistant inhibitors. Mutations in M pro are geographically distributed and occur throughout the protein From the GISAID (https://www.gisaid.org/) (13) EpiCoV database (through 29 Apr, 2020), 78 unique non-synonymous mutations to M pro were found in addition to the WT sequence, including 73 single point variants and 5 double variants. For genome sequences containing these M pro variants, full genome alignments were performed using MUSCLE (14) , and neighbor-joining trees were generated using MEGA X (15) . Overall, the variation in SARS-CoV-2 sequences observed so far is relatively low, with mutation hotspots not evenly distributed throughout the genome, but localized to specific sequence regions (16) . Because M pro is critical for viral replication, mutations that have a large deleterious effect on virus replication are unlikely to be observed in clinical isolates; all M pro variants investigated here are therefore assumed to be enzymatically competent. In general, codon usage and amino acid frequency in viruses of eukaryotes are essentially identical to those of their eukaryotic hosts, reflecting the viruses' use of the host translation machinery (17) . M pro sequences found in sequences isolated from human hosts will therefore likely reflect bias toward human codon usage, somewhat limiting the scope of the observed mutation space. The known mutations in M pro are summarized in Figure 1 . The tree was generated based on overall genome similarity; however, only sequences containing at least one mutation in M pro were included in the analysis, along with the WT human sequence and two non-human reference sequences. The accession numbers and geographical sources are listed in Supplementary Table S2 . The solid arcs around the outside of the diagram indicate M pro mutations; color coding corresponds to the geographical source. Several mutations appear to have arisen more than once in the virus's evolutionary history so far. Notably, K90R variants appear in multiple distantly related subtrees; five of these unique evolutionary events can be verified in Nextstrain's SARS-CoV-2 phylogenetic tree (18) . Further, L89F, P108S, and N274D arise at least twice in both trees. These phylogenetic comparisons appear to support a multiple event hypothesis, but are subject to errors resulting from the sparsity of testing. The repeated occurrence of the same mutation in seemingly unrelated subtrees may be due to missing data that would show their evolutionary connectedness. The average branch length of Figure 1 , which shows only topology, is 1.432161x10 −4 base substitutions per site (including those from the bat (3) and pangolin (19) ); 32.2% of the 1028 branches have, to ten significant figures, 0 base substitutions per site. For a genome of roughly 30,000 base pairs, this amounts to an average of only 4 substitutions per branch. All of these unique mutants therefore effectively belong to the same strain, making them difficult to place in an evolutionary context. For more diverged mutants, unfortunately-placed ambiguous nucleotides (20) could push them from one subtree to another. With the exception of five double variants, a majority of the sequences in Figure 1 arise from single point mutations. Whether and how M pro mutations have affected viral fitness is not yet known, but at least three mutants have remained in the population long enough to accumulate another mutation: L220F to A191V/L220F, G15S to G15S/D48E and G15S/V35L, and K90R to V77A/K90R. It is worth noting that although a single variant A191V exists, the A191V/L220F double variant likely stemmed from an L220F ancestor due to its shared lineage with L220F single variants. A fifth double variant, A193T/R279C, was found but did not stem from any single mutation in our dataset; its origins remain unclear. While a mutation's prevalence and evolution in a population may be interpreted as a sign of stable viral function, the opposite does not necessarily indicate reduced virulence. Testing rates, social behavior, and time of first infection in each region are all factors that contribute to the spread of the disease and the availability of sequencing data. For instance, a large number of K90R mutants were collected in Iceland, where the number of tests per 1,000 people is 5 nearly twice as many as the next leading country's and more than seven times as many as the United States' (Iceland: 141.75, USA: 18.21, as of 29 April, 2020) (21) . Consequently, further investigation is needed to determine whether M pro mutations affect viral fitness on a global scale. As such, without greater divergence and more sequences, it is difficult to tell if the presence of an M pro mutation in unrelated subtrees is evidence of multiple evolutionary events, or an artifact of sparse testing. Because only sequences harboring M pro mutations were retained for analysis, certain geographical areas appear to be underrepresented. It is likely that the strains that had spread to underrepresented regions prior to our data collection simply did not have M pro mutations. Different regions tend to be dominated by different mutants, a feature that might be explained by the timing at which these mutations arose or arrived. For instance, 83 of the 100 M pro mutants from Iceland were K90R, and most stemmed from a single shared ancestor (Supplementary Table S2). Further, it is likely that heterogeneity in sequencing rates have resulted in a less-thancomplete dataset. As of April 29th, the only North American, South American, and African M pro mutants reported in the GISAID database that passed our filtering parameters were from Costa Rica and the USA, Argentina and Brazil, and the DRC respectively. This does not necessarily indicate a lack of M pro mutations in other subregions, and may instead reflect differences in sequencing rates. In the structural analyses that follow, we focus on the differences in protein properties relative to WT, of the clinically observed M pro variants. M pro mutations to date suggest selection for larger, more massive, and more hydrophobic residues to substitute for another ( Figure 2 ). The weights of the edges indicate the frequency of the mutation across known M pro variants, while node color reflects residue hydrophobicity on the scale of (23) (larger numbers indicate greater hydrophobicity.) The most obvious trend observed in the pattern of mutation so far is the preferential substitution of larger, more hydrophobic amino acids in place of smaller, less hydrophobic ones. Overall, the pattern is consistent with increased incidence of amino acids that are more likely to be present in folded domains, rather than in linker regions (24) . In particular, it is notable that alanine has very few incoming ties and a large number of outgoing ties, mostly to valine, which has a larger and more hydrophobic side chain. Alanine is at the same time one of the most common amino acids and one of those with the most variable prevalence in the human genome (25) . Similarly, observed ties to isoleucine are mostly incoming from smaller residues, and leucine, which is also large and hydrophobic, likewise has more incoming than outgoing ties overall, with the bulk of its outgoing ties going to phenylalanine. However, aromatic residues per se do not appear to be selected at a higher rate than can be explained by their hydrophobicity. Also notable is the selection away from the secondary structure-breakers proline and glycine, both of which have only outgoing ties, and the propensity for lysine to be replaced by arginine even though both side chains are positively charged. Arginine is both larger and capable of making more and stronger hydrogen bonds, as well as cation-π interactions not available to lysine, leading to its known overrepresentation in inter-domain and inter-monomer interfaces (26) (27) (28) (29) . The mean differences in sidechain properties for observed M pro mutations are summarized in Table 1 . As observed in the network representation ( Figure 2 ), mutated residues are, on average, larger and more hydrophobic than those they replace. Although substituted residues are on average larger and more massive, we do not see strong evidence favoring bulky over compact residues net of mass: residue bulk (measured as volume/mass) for substituted residues did not differ significantly from WT (mean difference=0.02Å 3 /Da, t = 1.87, p = 0.0650 codes: * p < 0.05, ** p < 0.01, *** p < 0.001 Table 1 : Mean differences in side chain properties for substituted residues, versus WT (N = 83; substitutions from double mutants considered separately). On average, substituted residues are significantly more hydrophobic, massive, and larger than those they replace (all p-values for two-tailed t-tests versus no difference). For WT and each M pro variant, a molecular model was constructed using MODELLER 9.23 (30) , based on the A chain monomer of the PDB structure 6Y2E (31), followed by annealing, correction of protonation states, and all-atom molecular dynamics simulation in explicit solvent (see Methods). Examples of representative models are shown in Supplementary Figure S1 , with the positions of all mutated residues shown mapped onto the WT structure in Supplementary Figure S2 . We do not observe gross differences in structure or dynamics across variants, as expected given that all variants were found in clinical isolates and are therefore necessarily functional; mutations leading to radically altered or misfolded structures would likely be strongly selected against. However, analysis of MD trajectories does suggest more subtle differences across variants, providing insight into function-preserving changes. To assess the overall degree to which local structure is conserved across M pro variants, we compute the cross-variant variance in average φ, ψ backbone torsion angles by residue. In order to control for overall flexibility, we normalize this by the estimated variance in torsion angles within each trajectory. For arbitrary angle α i at residue i, this leads to the local variation index where α ij is the vector of angles of type α i over the trajectory of variant j with corresponding angular mean α ij , α i is the vector of such means across variants, Var B is the "between variant" angular variance in mean angles, and Var W is the "within variant" angular variance in α ij . Intuitively, high values of v(α i ) indicate relatively large between-variant variation in α i relative to angular variation seen within the trajectories themselves. For v(φ i ) and v(ψ i ), such values correspond to systematic changes in local conformation associated with M pro mutations. By turns, low values of v(φ i ) and v(ψ i ) indicate residues whose local structure does not vary meaningfully across variants. It should be noted that such regions can be either flexible or rigid. whereas the latter is of obvious relevance to substrate processing and specificity. This motivates a more detailed examination of variation in the active site, to which we return below. were significantly above it; similarly, when directly compared to WT, 12 variants were observed to be significantly less constrained, while 17 were significantly more constrained (i.e., bootstrap t-scores less than -2 or greater than 2, respectively). 43 out of 78 variants (55%) showed nominally higher levels of mean constraint than WT (discounting significance), suggesting a lack of uniform selection pressure for active sites that are more or less constrained than wild type (the fraction greater does not differ significantly from random deviation, p = 0.16, exact binomial were used for molecular modeling. Initial conditions for the WT trajectory used here are based on the A monomer of PDB structure 6Y2E (39), representing a mature (i.e., cleaved pro-sequence) protein. Initial variant protein structures were predicted using MODELLER 9.23 (30) , using the 6Y2E structure as a template; three rounds of annealing and MD refinement were performed using the "slow" optimization level for each. Initial structures were then processed to correct protonation states to reflect their predicted cellular environment (with protonation states predicted using PROPKA 3.1 (40) ). Each corrected model structure was then minimized and equilibrated in explicit solvent; simulations were performed using NAMD (41) with the CHARMM36 forcefield (42) in TIP3P water (43) at 310 K under periodic boundary conditions (with a 10Å margin water box). Solvated protein models were energy-minimized for 10,000 iterations before being simulated for 0.5ns to adjust water box size, after which a 10ns trajectory was simulated with conformations being sampled every 20ps; an N pT ensemble was used, with temperature controlled via Langevin dynamics with a damping coefficient of 1/ps and Nosé-Hoover Langevin piston pressure control set to 1 atm (44, 45). • Molecular models of representative variants (S1); • Sites of known M pro mutations, mapped on the wild-type structure (S2); • Additional information on the fibril model reference measure (S3); • Local variation index values for M pro backbone torsion angle, by residue number (S4); • Additional detail on the PSNs used in this work (S5) Tables: • Mean cohesion scores (k-core number) and autocorrelation-corrected bootstrap standard errors by variant (S1); • Accession numbers, locations, and dates of collection of all M pro variants referred to in this work (S2) Additional files available for download: • Uncompressed version of the tree depicted in Figure 1 (muscle gisaid 20200429.txt); • Full acknowledgments for all sequences used in this work (514 acknowledgements.pdf) Cys 145 Figure S2 : The positions of each mutated residue are shown plotted on the wild-type protein (PDB ID: (31) . Panels A-C show different views of the M pro dimer (left) and monomer (right.) One chain of the dimer is shown in black; on this monomer, only the active site residues His 41 (magenta) and Cy 145 (yellow) are shown in space-filling representations. On the section monomer (gray) side-chains of the mutated residues are also shown, using the following color coding to indicate the properties of the substituted residue: light gray -polar to nonpolar; teal -polar to polar; sky blue -nonpolar to polar; light green -polar to nonpolar; and salmonmultiple mutations (i.e. two or more independent substitutions with different properties.) Table S2 : Accession numbers, locations, and dates of collection of variants as they appear in the uncompressed version of the tree depicted in Figure 1 , which is available for download as a .txt file. Variants in bold are shown as individual branches in Figure 1 . Those without accession numbers or dates represent subtrees that were compressed; their constituent variants reside underneath. A new coronavirus associated with human respiratory disease in China The proximal origin of SARS-CoV-2 A pneumonia outbreak associated with a new coronavirus of probable bat origin Viral metagenomics revealed sendai virus and coronavirus infection of Malayan pangolins (Manis javanica) Probable pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak Emergence of SARS-CoV-2 through recombination and strong purifying selection From SARS to MERS, thrusting coronaviruses into the spotlight HIV-1 protease inhibitors: A review for clinicians ABT-378, a highly potent inhibitor of the human immunodeficiency virus protease Lopinavir/ritonavir in the treatment of HIV-1 infection: a review A trial of lopinavir-ritonavir in adults hospitalized with severe Covid-19 Only one protomer is active in the dimer of SARS 3C-like proteinase Data, disease and diplomacy: GISAID's innovative contribution to global health MUSCLE: multiple sequence alignment with high accuracy and high throughput MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms The establishment of reference sequence for SARS-CoV-2 and variation analysis Trend of amino acid composition of proteins of different taxa Nextstrain: real-time tracking of pathogen evolution Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984 Coronavirus Disease (COVID-19) A new coronavirus associated with human respiratory disease in China A simple method for displaying the hydropathic character of a protein Proteome-wide comparison between the amino acid composition of domains and linkers Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes Hydrogen bonding motifs of protein side chains: descriptions of binding of arginine and amide groups Cation-π interactions in structural biology Cation-π interactions in protein-protein interfaces The molecular origin of like-charge arginine-arginine pairing in water Comparative protein structure modeling using Modeller αketoamides as broad-spectrum inhibitors of coronavirus and enterovirus replication: Structure-based design, synthesis, and activity assessment A chemical group graph representation for efficient highthroughput analysis of atomistic protein simulations Structure prediction and network analysis of chitinases from the Cape sundew, Drosera capensis Network analysis provides insight into active site flexibility in esterase/lipases from the carnivorous plant Drosera capensis Centrum voor Wiskunde en Informatica Amsterdam The neighbor-joining method: a new method for reconstructing phylogenetic trees Confidence limits on phylogenies: An approach using the bootstrap Crystal structure of SARS-CoV-2 main protease provides a basis for design of improved α-ketoamide inhibitors PROPKA3: Consistent treatment of internal and surface residues in empirical pKa predictions Scalable molecular dynamics with NAMD Optimization of the additive CHARMM all-atom protein force field targeting improved sampling of the backbone φ, ψ and side-chain χ(1) and χ(2) dihedral angles Comparison of simple potential functions for simulating liquid water Constant pressure molecular dynamics algorithms Constant pressure molecular dynamics simulation: The Langevin piston method statnet: Software tools for the representation, visualization, analysis and simulation of network data network: a package for managing relational data in R Social network analysis with sna Rpdb: Read, Write, Visualize and Manipulate PDB Files Bio3d: An r package for the comparative analysis of protein structures R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing A cartography of the van der Waals territory Network structure and minimum degree Social network analysis: methods and applications An Introduction to the Bootstrap