key: cord-0734023-wm3fohtz
authors: Sztain, Terra; Amaro, Rommie; McCammon, J. Andrew
title: Elucidation of Cryptic and Allosteric Pockets within the SARS-CoV-2 Main Protease
date: 2021-05-03
journal: J Chem Inf Model
DOI: 10.1021/acs.jcim.1c00140
sha: c793080df40c99e5859fe3f52de61190ea32ac68
doc_id: 734023
cord_uid: wm3fohtz

[Image: see text] The SARS-CoV-2 pandemic has rapidly spread across the globe, posing an urgent health concern. Many quests to computationally identify treatments against the virus rely on in silico small molecule docking to experimentally determined structures of viral proteins. One limit to these approaches is that protein dynamics are often unaccounted for, leading to overlooking transient, druggable conformational states. Using Gaussian accelerated molecular dynamics to enhance sampling of conformational space, we identified cryptic pockets within the SARS-CoV-2 main protease, including some within regions far from the active site. These simulations sampled comparable dynamics and pocket volumes to conventional brute force simulations carried out on two orders of magnitude greater timescales.

In December 2019, the World Health Organization learned of a novel coronavirus that has since rapidly spread, leading to a global pandemic. The SARS-CoV-2 virus, causing the disease named COVID-19, has exceeded 100 million cases worldwide, taking over 2 million lives as of January 30, 2021. 1 The genetic and structural similarity of the SARS-CoV-2 virus to the agents of the severe acute respiratory syndrome coronavirus (SARS-CoV) epidemic in 2003 and Middle East respiratory syndrome coronavirus (MERS-CoV) epidemic in 2012 provide a basis of information for understanding and ultimately treating or preventing this disease; however, no highly effective antiviral drug exists for any human-infecting coronavirus.

Proteases are responsible for activating viral proteins for particle assembly and have proved successful targets for antiviral agents; most notable are the protease inhibitors used to treat HIV and Hepatitis C. 2, 3 The main protease of SARS-CoV-2, called M pro or 3CL pro encoded by the nsp5 gene, 4 was the first SARS-CoV-2 protein deposited to the protein databank (PDB) on January 26th 2020. 5 This structure was crystalized with a covalent inhibitor (N3) identified from computer-aided drug design and validated biochemically. Currently, over one hundred M pro structures exist in the PDB and massive efforts to discover a successful inhibitor are underway. 5−8 M pro is highly similar to 3CL proteases from other coronaviruses in sequence, structure, and function. 9, 10 First, M pro autocatalytically cleaves itself from the SARS-CoV-2 polypeptide and then forms a homodimer to subsequently cleave at 11 distinct sites, while the papain-like protease, PL pro , cleaves at three sites, allowing the viral proteins to fold and perform their functions. 11−13 Proteolytic cleavage is catalyzed by residues H41 and C145 of the catalytic cysteine dyad. These active site residues are located at the N-terminal globular domain of M pro , which contains an anti-parallel beta barrel reminiscent of trypsin-like serine proteases, while the Cterminal domain consists of five alpha helices. 13 Upon dimerization, M pro resembles the shape of a heart symbol ( Figure 1A ). Many efforts to identify inhibitors involve high throughput virtual screening, which exploits the power of computational docking to screen millions of molecules in silico to narrow down a few "hits" for lead optimization. Incorporation of molecular dynamics (MD) has significantly improved the ability to identify promising protein inhibitors. 14 Docking to protein ensembles obtained from MD simulations is often employed to consider multiple target states that remain elusive in static crystal structures. 15 Conventional MD simulations are however limited in the amount of conformational space that can be sampled due to the amount of time required to traverse energy barriers between stable conformational states. 16 In the present study, we used the enhanced sampling technique, Gaussian accelerated MD (GaMD), 17 to overcome such barriers. GaMD adds a harmonic boost to the potential energy up to a threshold, effectively "filling in the wells" creating a smoother potential energy surface. 17, 18 This method allowed us to extensively sample conformations of the SARS-CoV-2 main protease at minimal computational expense and provide detailed characterization of several potentially druggable pockets, which can serve as a basis for identifying an M pro inhibitor for COVID-19 treatment.

GaMD simulations were performed on the first published M pro structure PDB 6LU7. 5 Six systems were simulated, including monomer and dimer simulation in the presence and absence of the co-crystalized N3 inhibitor either covalently or noncovalently bound ( Figure S1 ). Five independent simulations of 200 ns for an aggregate of 1 μs were carried out for each system. Three regions were defined to investigate potentially druggable pockets: the active site, which lies in the N-terminal top lobe of the heart-shaped protease; the C-terminal region, as a potential allosteric site; and the dimer interface region ( Figure 1A) .

Characterization of M pro Pockets. A variety of pocket volumes were sampled between 16 and 277 Å 3 for the active site pocket, starting from 170 Å 3 in the crystal structure. The interface spanned from 36 to 429 Å 3 starting from 209 Å 3 and the distal site from 0 to 78 Å 3 starting from 18 Å 3 ( Figure 1B − D). Simulations with the N3 ligand present sampled greater mean volumes for the active site and distal region than the apo simulations without N3; however, the average dimer interface region was greater in the apo simulation ( Figure S2 ).

Covalently bound N3 sampled greater mean volumes for the active site and dimer interface than non-covalent, which sampled greater mean volumes for the distal site ( Figure S3 ). In a few simulations, the non-covalent ligand escaped the active site pocket ( Figure S4 ). In all simulations, the ligand rearranges compared to the crystal structure. The non-covalent ligands quickly decrease in total number of native contacts and increase in non-native contacts. The covalent ligands slightly decrease in native contacts but have a similar increase in nonnative contacts. (Figures S5 and S6 ). These simulations revealed significant loop dynamics ( Several metrics were used to assess the properties of these pockets. The PockDrug 19 webserver predicts druggability probability based on a variety of factors including geometry, hydrophobicity, and aromaticity, among others. The FTMap 20 computational solvent mapping webserver helps determine "hot spots" and can aid in fragment-based drug design. For each site, the sampled pocket volumes were clustered into 10 populations, and a frame corresponding to that volume was analyzed. The most populated volumes for the active site, dimer interface, and distal site were 114, 220, and 10 Å 3 , respectively. The 10 active site pockets gave predicted druggabilities between 0.18 and 0.86 and solvents mapped to seven of the pockets ( Figure S8 ). The dimer interface pockets gave higher druggability prediction values, between 0.5 and 0.93, with solvent mapping to all 10 structures ( Figure S9 ). The distal site pockets gave the highest predicted druggability, between 0.37 and 1.00 with six out of ten pockets giving predicted druggability values greater than 0.90. Solvents mapped to seven out of the ten pockets ( Figure S10) . Notably, none of the solvent mapping to any of the sites showed overlapping results, indicating that each structure may provide distinct opportunities for designing inhibitors. The pocket volume, predicted druggability, and solvent mapping were not directly correlated, as expected, since each calculation considers different metrics. Overall, these analyses indicate that the simulations reveal a diverse set of conformations that are likely to serve as viable targets for in silico drug development.

Dimer association is necessary for catalytic activity of the protease; 5 therefore, it is reasonable to assume that binding of a molecule, which prevents this association, would inactivate Journal of Chemical Information and Modeling pubs.acs.org/jcim Article the protease. The significance of the distal regions is less well understood, though computational investigations support a potential allosteric role. 21, 22 To study this further, we examined the correlated motions of each residue throughout the simulations. In the monomer simulation, the active site region was positively correlated to the distal site region and negatively correlated to the region that connects the two, whereas the inverse was observed in the dimer simulation ( Figure 3 and Figures S11 and S12). These correlations resulted in primary motions, detected by principal component analysis, comprising hinge-like in the monomer versus twisting in the dimer ( Figure  S13 ). These motions were similar in both the apo and N3 bound simulations. In the dimer simulations, the active site of chain A was slightly correlated to the active site of chain B, and the distal regions were highly anticorrelated, twisting in opposite directions. (Figure S13 ). Regardless of positive or negative correlation values, the distal and active sites show a dynamic interdependence on each other in both the apo and dimer forms. GaMD Enhancement of M pro Conformational Space. Previous benchmarks have demonstrated that GaMD can enhance sampling by multiple orders of magnitude. 23 To test the sampling enhancement compared to conventional MD simulations, we compared the 1 μs aggregate GaMD simulation from five replicates of 200 ns to the 10 μs simulations carried out at Riken 24 and 100 μs simulations D.E. Shaw Research (DESRES). 25 We additionally performed and compared those to 1 μs of GaMD continuous simulation and compared 200 ns cMD to 200 ns GaMD. The ranges of pocket volumes for each site were highly similar, as detailed in Figure   S14 . Even the short conventional simulations sampled comparable pocket volumes, indicating that this metric may not be the most effective measure of conformational sampling. Principal component analysis (PCA) is frequently used as a metric for conformational diversity as it decreases the dimensionality to maximize the dataset variance. 26, 27 PCA was indeed more consistent with expectations where increasing simulation time corresponded to larger regions of space covered by plotting the first two principal components ( Figure  4) .

While the GaMD simulations run continuously for 1 μs only covered 54% of the conformational space covered by the 100 μs DESRES, the 1 μs aggregate GaMD from five simulations run in parallel covered 80% of the DESRES space. The 10 μs Riken simulations covered 94% of that space (Figure 4 ). It is worth noting that each of the simulations with the most extensive sampling (DESRES, Riken, GaMD 5 × 200 ns) reached areas in the PC1-PC2 subspace that neither of the other two reached. This just confirms that none of these simulations achieve comprehensive sampling, which, in the ultimate case, would include extensively unfolded conformations. Each of these simulations reveals cryptic binding sites not seen in shorter, conventional MD simulations, let alone in the experimental structures, and some sites are seen in one but not the other simulations.

Though perhaps not intuitive at first glance, it is reasonable that many simulations in parallel would have a greater chance of sampling diverse space than one long simulation, which may settle into a favorable low energy conformation, even given the boost added by the GaMD, which is restricted by a threshold. 17 Representative structures from dense populations of the PCA plots were further analyzed structurally and through computational solvent mapping ( Figure S15 ). Additionally, RMSD and RMSF comparison revealed that the majority of the conformational deviations occur within the helices and loops lining the active site pocket (Figures S16,S17, and S21−S23), which was also visualized in the representative structures from populations determined from PCA ( Figure  S15 ). Taken together, the range of pocket volumes, structural conformations, and dynamics were similar between short GaMD simulations and brute force conventional simulations. The plethora of available structures from all of these studies is expected to be useful for in silico inhibitor development for targeting M pro .

GaMD simulations of the SARS-CoV-2 main protease, M pro , revealed cryptic pockets not detectable from the crystal structure. In addition to characterizing the vast conformational landscape of the active site and dimer interface, a distal region was explored as a potential allosteric site. GaMD sampled conformational space more efficiently than brute force simulations, with 1 μs simulation data providing comparable results to 10 μs simulations carried out by Riken 24 and 100 μs simulations carried out by DE Shaw Research, 25 offering a more widely accessible simulation strategy. These structures, along with the millisecond simulations from Folding@home, 28 can serve as a basis for in silico docking to identify M pro inhibitors for COVID-19 treatment.

■ METHODS Simulation Preparation. The first crystal structure of SARS-CoV-2 main protease, PDB ID 6LU7, 5 was used as a starting point for all simulations. Protonation states of titratable residues were determined using the H++ webserver 29 with 0.15 M salinity, 10 internal dielectric constant, 80 external dielectric constant, and pH 7.4. Histidines 64, 80, and 164 were thus named HID to indicate delta nitrogen protonation before protonation of entire protein using tleap from AmberTools 18. 30 Six systems were simulated as shown in Figure S1 . For apo simulations, the inhibitor was deleted from the PDB file. For dimer simulations, the PyMOL 31 symexp command was used to generate initial coordinates for the second chain based on symmetry. The ff14SB 32 forcefield was used for proteogenic residues. For simulations with the N3 bound, the antechamber package of AmberTools 18 was used to generate GAFF 33 forcefield parameters for N3. Gaussian 09 was used to calculate partial charges of atoms according to the RESP 34 method with the HF/6-31G* level of theory. For covalent simulations, the cysteine 145 was treated as an unnatural amino acid and capped with N-methyl and C-acetyl groups for charge calculations. A TIP3P isometric water box was added with at least 10 Å buffer between solute and edges of the box. Enough Na+ was added to neutralize the system, and then Na+ and Cl− ions were added to a final concentration of 150 mM. Minimization was carried out in two steps. First, the solute was restrained with a 500 kcal/mol restraint force to minimize the solvent for 10,000 cycles followed by unrestrained minimization for 300,000 cycles. Next, the system was heated to 310 K over 350 ps using an isothermal-isovolumetric (NVT) ensemble followed by isothermal-isobaric (NPT) equilibration for 1 ns.

Gaussian Accelerated Molecular Dynamics. Gaussian accelerated molecular dynamics (GaMD) 17 pmemd.cuda implementation of Amber 18 was used to generate five independent trajectories of 200 ns, an aggregate of 1 μs for each system. The dual boost method was employed, adding a biasing force to both the total and dihedral potential energy. The threshold energy was set to the upper bound. GaMD production of equilibrated systems was carried out in five stages. First, 200 ps of preparatory conventional MD simulation was carried out, without any statistics collected. Second, 1 ns of conventional MD was carried out to collect potential statistics V max , V min , V avg , and σV. Next, 800 ps of GaMD was carried out with a boost potential applied with fixed parameters. Then, 50 ns of GaMD was carried out with updated boost parameters, and finally, 150 ns of GaMD was carried out with fixed boost parameters. All production simulations were carried out using NVT, with periodic boundary conditions and 2 fs timesteps. The SHAKE 35 algorithm was used to restrain nonpolar hydrogen bonds and TIP3P water molecules. The Particle Mesh Ewald 36 method was used for electrostatic interactions with a 10 Å cutoff for nonbonded interactions. A Langevin thermostat was used for temperature regulation with a collision frequency of 5 ps −1 .

Pocket Analysis. For each system, the aggregated simulation was clustered based on the RMSD from the first frame using the cpptraj 37 hierarchical algorithm to obtain 100 distinct structures for pocket calculation. Three regions were defined for individual calculations: the active site pocket, distal site pocket, and dimer interface. Pocket volume was calculated using POVME 2 38 with a 1 Å grid spacing. The point inclusion sphere was determined based on the average center of mass of residues 7−198 with a 12 Å radius for the active site pocket, the average center of mass of residues 198−306 with a 10 Å radius for the distal site pocket, and the average center of mass between residues within 3.5 Å of the other dimer with a 10 Å radius for the dimer interface. The pocket volumes were chosen to be underestimated as opposed to overestimated and were calculated with a minimal radius to avoid extensively calculating known solvent exposed regions. As shown in the FTMap solvent screening (Figures S10 and S15), fragments bound to various regions around the distal site crevices. Increasing the radius of the pocket calculation space would incorporate solvent exposed regions from many directions that would not be useful and would involve other regions of the protein. Due to this volume dependence on chosen radius, these volumes are intended as an initial comparative metrics for identifying structures to analyze further, rather than concrete final values. For analysis of water occupancy, the GIST (Grid Inhomogeneous Solvation Theory Method) 39 function of cpptraj was used for the active site region, and the density of oxygen centers map was used to visualize results in VMD. 40 To quantify druggability, we have used the PockDrug 19 webserver, which predicts druggability probability based on a variety of factors including geometry, hydrophobicity, and aromaticity, among others. Additionally, we have used the FTMap 20 computational solvent mapping webserver to determine druggable "hot spots." Pocket volumes calculated by POVME 38 from each of the three regions were clustered into 10 populations, and representative frames for each were selected for analysis. The fpocket estimation method was used for PockDrug, and the druggability probability of the pocket Journal of Chemical Information and Modeling pubs.acs.org/jcim Article that aligned best to the POVME 38 calculation sphere was reported. Principal Component Analysis. Principal motions were calculated using cpptraj, as described previously, 41 from projection of displacement vectors of each of the backbone atoms onto a diagonalized mass-weighted covariance matrix after rms fitting of every atom except protons to the first frame. Residue correlation matrices were also generated this way. For the graphical abstract, reweighting of the potential energy was carried out on one 200 ns simulation of the one N3 noncovalently bound dimer using the PyReweighting toolkit. 42 Reweighting was carried out using the Maclaurin series expansion to the 10th order with a bin size of 6 and maximum energy of 100 based on the first two principal components. One conventional 200 ns MD simulation was performed for comparison and was outlined in the figure. For the conformational sampling comparison shown in Figure 4 , the dimer−apo system was used for consistency with Riken and DESRES simulations. The colors of the histograms are normalized so the same number of datapoints (frames) is considered. Therefore, 40,000 frames were used for all of the in house simulations, and 50,000 frames were used for the Riken and DESRES simulations. The color bars were set to maximum values of 0.0008 and 0.0010, respectively. This results in densely localized (yellow color) in the short simulations versus less density (blue color) and more spread out in the longer and enhanced simulations. Visualization of the first normal mode was carried out using the Normal Mode Wizard plugin of VMD.

In accordance with the community principles around open sharing of COVID19 simulation data, 43 all simulation input files and data are available at https://covid19.molssi.org through the NSF MolSSI COVID19 Molecular Structure and Therapeutics Hub.

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.1c00140.

Values for RMSD of 275 of M pro crystal structures compared to PDB 6LU7 (XLSX) Active site pocket dynamics sampled with GaMD (MP4) Supplementary Figures S1−S26; legend for Supplementary Movie S1, and legend for Supplementary Data S1 (PDF)

COVID-19 Map. Johns Hopkins Coronavirus Resource Center

Protease Inhibitors as Antiviral Agents

Approved Antiviral Drugs over the Past 50

Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The Species Severe Acute Respiratory Syndrome-Related Coronavirus : Classifying 2019-NCoV and Naming It SARS-CoV-2

Structure of M pro from SARS-CoV-2 and Discovery of Its Inhibitors

High Throughput Virtual Screening Reveals SARS-CoV-2 Multi-Target Binding Natural Compounds to Lead Instant Therapy for COVID-19 Treatment

Potential Inhibitors for Novel Coronavirus Protease Identified by Virtual Screening of 606 Million Compounds

Structure-Based Design of Antiviral Drug Candidates Targeting the SARS-CoV-2 Main Protease

Structure of Coronavirus Main Proteinase Reveals Combination of a Chymotrypsin Fold with an Extra α-Helical Domain

Structure of Main Protease from Human Coronavirus NL63: Insights for Wide Spectrum Anti-Coronavirus Drug Design

Genome Structure, Replication, and Pathogenesis

Analysis of Therapeutic Targets for SARS-CoV-2 and Discovery of Potential Drugs by Computational Methods

The SARS-CoV-2 Main Protease as Drug Target

Developing a Dynamic Pharmacophore Model for HIV-1 Integrase

Ensemble Docking in Drug Discovery

Reaching Biological Timescales with All-Atom Molecular Dynamics Simulations

Accelerated Molecular Dynamics: Unconstrained Enhanced Sampling and Free Energy Calculation

Gaussian Accelerated Molecular Dynamics in NAMD

PockDrug: A Model for Predicting Pocket Druggability That Overcomes Pocket Estimation Uncertainties

The FTMap Family of Web Servers for Determining and Characterizing Ligand Binding Hot Spots of Proteins

Computational Analysis of Dynamic Allostery and Control in the SARS-CoV-2 Main Protease

Drug Repurposing for Candidate SARS-CoV-2 Main Protease Inhibitors by a Novel In Silico Method

Chapter Six -Gaussian Accelerated Molecular Dynamics: Theory, Implementation, and Applications

COVID-19 related trajectory data of 10 microseconds all atom molecular dynamics simulation of SARS-CoV-2 dimeric main protease

Research Molecular Dynamics Simulations Related to SARS-CoV-2. D. E. Shaw Research Technical Data

Essential Dynamics of Proteins

High-Resolution Mining of the SARS-CoV-2 Main Protease Conformational Space: Supercomputer-Driven Unsupervised Adaptive Sampling

SARS-CoV-2 Simulations Go Exascale to Capture Spike Opening and Reveal Cryptic Pockets Across the Proteome

H++: A Server for Estimating P K a s and Adding Missing Hydrogens to Macromolecules

The Amber Biomolecular Simulation Programs

The PyMOL Molecular Graphics System

Ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from Ff99SB

Development and Testing of a General Amber Force Field

A Well-Behaved Electrostatic Potential Based Method Using Charge Restraints for Deriving Atomic Charges: The RESP Model

Numerical Integration of the Cartesian Equations of Motion of a System with Constraints: Molecular Dynamics of n-Alkanes

A Smooth Particle Mesh Ewald Method

Software for Processing and Analysis of Molecular Dynamics Trajectory Data

POVME 2.0: An Enhanced Tool for Determining Pocket Shape and Volume Characteristics

Solvation Thermodynamic Mapping of Molecular Surfaces in AmberTools: GIST

VMD: Visual Molecular Dynamics

All-Atom Simulations Disentangle the Functional Dynamics Underlying Gene Maturation in the Intron Lariat Spliceosome

Improved Reweighting of Accelerated Molecular Dynamics Simulations for Free Energy Calculation

A Community Letter Regarding Sharing Biomolecular Simulation Data for COVID-19

The authors declare no competing financial interest.

We thank Prof. Michael D. Burkart for advisement, and the NSF GRFP for supporting T.S. under grant number DGE-1650112. Simulations were carried out using GPU nodes of the Triton Shared Computing Cluster (TSCC) at the San Diego Supercomputer Center (SDSC). This work was supported by NIH GM031749.