key: cord-298242-iuskpoug authors: Yu, Alvin; Pak, Alexander J.; He, Peng; Monje-Galvan, Viviana; Casalino, Lorenzo; Gaieb, Zied; Dommer, Abigail C.; Amaro, Rommie E.; Voth, Gregory A. title: A Multiscale Coarse-grained Model of the SARS-CoV-2 Virion date: 2020-10-02 journal: bioRxiv DOI: 10.1101/2020.10.02.323915 sha: doc_id: 298242 cord_uid: iuskpoug The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative agent of the COVID-19 pandemic. Computer simulations of complete viral particles can provide theoretical insights into large-scale viral processes including assembly, budding, egress, entry, and fusion. Detailed atomistic simulations, however, are constrained to shorter timescales and require billion-atom simulations for these processes. Here, we report the current status and on-going development of a largely “bottom-up” coarse-grained (CG) model of the SARS-CoV-2 virion. Structural data from a combination of cryo-electron microscopy (cryo-EM), x-ray crystallography, and computational predictions were used to build molecular models of structural SARS-CoV-2 proteins, which were then assembled into a complete virion model. We describe how CG molecular interactions can be derived from all-atom simulations, how viral behavior difficult to capture in atomistic simulations can be incorporated into the CG models, and how the CG models can be iteratively improved as new data becomes publicly available. Our initial CG model and the detailed methods presented are intended to serve as a resource for researchers working on COVID-19 who are interested in performing multiscale simulations of the SARS-CoV-2 virion. Significance Statement This study reports the construction of a molecular model for the SARS-CoV-2 virion and details our multiscale approach towards model refinement. The resulting model and methods can be applied to and enable the simulation of SARS-CoV-2 virions. The onset of the global COVID-19 pandemic has brought intense investigation into the molecular components of SARS-CoV-2 encoded by the virus's 30 kilobase (kb) genome. Structural biology efforts using cryo-electron microscopy (cryo-EM) and x-ray crystallographic techniques are currently reporting new structures of viral proteins every week (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) , and computational structure prediction efforts are targeting unresolved sections of the genome using a variety of protein folding algorithms. Computational and experimental studies are underway to find new molecular therapeutics that can inhibit viral activity or further elucidate the mechanisms of action of SARS-CoV-2 proteins (13) (14) (15) (16) . The computer simulation of large-scale SARS-CoV-2 processes, such as virion assembly, budding, entry, and fusion, will remain intrinsically challenging to investigate using all-atom (AA) molecular dynamics (MD), owing to the computational cost of meaningfully simulating the hundred of millions to billions of atoms involved. A holistic model of the SARS-CoV-2 virion can provide insight into the mechanisms of large-scale viral processes and the collective behavior of macromolecules involved in viral replication and infectivity. SARS-CoV-2 virions contain four main structural proteins: the spike (S), membrane (M), nucleocapsid (N), and envelope (E) proteins (17) . S proteins are glycosylated trimers that mediate fusion and entry, in part by attaching enclosed fusion peptide sequences into the membranes of host cells (18) . M proteins appear as dimeric complexes embedded within the virion envelope, and are believed to anchor ribonucleoprotein complexes to the envelope (19, 20) . N proteins associate with and organize RNA into ribonucleoprotein structures found in the interior of virions (21, 22) . Lastly, E proteins are thought to form pentameric ion channels that are found at the lipid bilayers of virion membranes and contribute to viral budding (23) . In this paper, we construct a largely "bottom-up" coarse-grained (CG) model of the SARS-CoV-2 virion from the currently available structural and atomistic simulation data on SARS-CoV-2 proteins. In general, this model serves as a resource for researchers working on COVID- 19 , and as a platform to incorporate computational and experimental data. This model also enables the multiscale study of SARS-CoV-2 processes for the development of treatment and prevention strategies against COVID-19. Atomistic trajectory and experimental structural data deposited in the NSF Molecular Sciences Software Institute (MolSSI) will be incorporated, as they become publicly available (24) . In this work, we detail several of our CG methods used to iteratively develop a CG model for the full SARS-CoV-2 virion, in which molecular interactions between CG particles are derived using a combination of phenomenological, experimental, and atomistic simulation approaches. Building models from structural data We first constructed atomic models of the structural proteins of the SARS-CoV-2 virion, ( Figure 1 ). AA models of the open and closed state of the S protein were built based on the cryo-EM structures of the spike ectodomain (PDB ID: 6VYB, 6VXX) (5), respectively, and atomic models of the N protein were constructed based on the x-ray crystallographic structure of the nucleocapsid NTD (PDB ID: 6M3M) (25) . Glycosylation sites were modeled using Glycan Reader & Modeler in CHARMM-GUI (26) and the site-specific glycoprofile derived from mass spectrometry and cryo-EM analysis (27, 28) . Homology models for the S protein stalk, including the HR2 and TM domains were assembled as α-helical trimeric bundles using MODELLER (29) on the basis of secondary structure assignments in JPred4 (30) . Homology models for the SARS-CoV-2 N protein CTD were created from the x-ray crystallographic structure of the SARS-CoV N protein CTD (PDB ID: 2CJR) (31) . Missing amino acid backbones in loop regions were built in MODELLER, and side chain rotamers were built using SCWRL4 (32) . For the M protein dimer and the pentameric E ion channel, we used atomic models derived in the recent Critical Assessment of Structure Prediction (CASP) Commons competition (33, 34) . AA protein models (see discussion below) were subsequently simulated and coarsegrained to generate the CG models ( Figure 2 and see sections below). A previously developed CG model for lipids was used, consisting of three CG beads per lipid and distinct bead types for lipid head groups and hydrophobic tails (35) . A single component CG lipid bilayer was generated in a spherical configuration and equilibrated using CG molecular dynamics simulations under constant NVT conditions in LAMMPS (36) . We note that in the future more complex CG lipid models (37) can be added. Transmembrane segments of component membrane proteins were visually identified and assigned based on secondary structure motifs. Individual lipids on the outer leaflet of the spherical bilayer were randomly selected and used as initial positions for embedding spike, membrane, and envelope proteins. For each initial position, the center-of-mass of the transmembrane domain was aligned with the center of the lipid bilayer, and the principal axis of the protein was aligned with the vector normal to the lipid bilayer. Transmembrane regions were then substituted for the overlapping CG lipids to embed the proteins. The procedure was iterated until a spike, membrane, and envelope protein density on the virion surface was achieved that was approximately consistent with current available experimental estimates of ~25, 1000, and 20 per virion, respectively, from cryo-EM and biochemical data (38) (39) (40) . The diameter of the membrane envelope is approximately 100 nm and 120 nm including the S proteins on the virion surface. As higher-resolution experimental data are released, the overall structure of this model can be refined. Two glycosylated models of the open and closed spike were inserted into a symmetric 225 Å × 225 Å lipid bilayer mimicking the composition of the endoplasmic reticulum-Golgi intermediate compartment (41, 42) . The lipid patch was built using CHARMM-GUI. The complete proteinmembrane system was solvated with using the TIP3P water model (43) , and neutralized with chloride and sodium ions to maintain a 150 mM concentration. Each system contained ~1.7 million atoms. Minimization and equilibration was performed using the CHARMM36 forcefield (44, 45 ) and NAMD 2.14 (46) . Production runs were performed in the NPT ensemble using a Langevin thermostat at 310K and Nosé-Hoover Langevin barostat at 1 atm. All production runs used a 2-fs timestep and the SHAKE algorithm. Multiple replicas of AA MD simulations of the open (3x) and closed (3x) systems were performed on NSF Frontera at the Texas Advanced Computing Center (TACC), achieving an aggregate sampling of 3.0 and 1.8 µs, respectively. The CG model of the glycosylated S protein (Figure 2 A) was parameterized from the AA MD simulations described below (47) . Reference statistics used conformations sampled equally from both open and closed states with AA trajectories spanning the 3.0 and 1.8 µs, respectively. First, the protein was mapped to CG beads using essential dynamics coarse-graining (EDCG) (48) . We used 60 and 50 CG beads for the S 1 and S 2 domains, respectively, and the 22 N-linked glycans were each mapped to a single bead. Intra-protein interactions were represented as a hetero- ns prior to a 4 µs production run on Anton2. All simulations were run in the constant NPT ensemble at 310K and 1 atm using the CHARMM36m forcefield. A CG model containing approximately 5 residues per CG bead was mapped from the reference statistics of the AA MD simulations using the EDCG ( Figure 2C ), and hENM approaches. A CG model for the E protein was developed by linearly mapping the amino acid sequence to particles at a resolution of 1 CG bead per 5 amino acids ( Figure 2B ). Several studies suggest that the C-terminal domain (CTD) of the N protein assembles into a helix that contains two RNA binding grooves (21, 51) . Based on these studies, we constructed atomic and CG models of the viral ribonucleoprotein complex (vRNP) by iterating between CG and AA simulations. We first constructed an atomic model of the N protein CTD helix with two RNA binding grooves by stacking 3 copies of the CTD octamer structure (PDB: 2CJR), which is composed of 4 CTD dimers and homology modeled from the X-ray crystallographic structure of the SARS-CoV N protein CTD (31) . The CTD helix was simulated in the CHARMM36m force field for 400 ns. We then constructed the CTD helix model using EDCG combined with hENM followed by manually placing CG RNA beads into the groove of the helix ( Figure 2D ). The positions of the CG beads were used as restraints to build an atomic model of the vRNP complex. Finally, the vRNP model was relaxed and simulated in the CHARMM36m force field for 400 ns. It is important to note that recent cryo-EM studies have found granule-like densities within the virion for the vRNP complex (22) . Structural detail into how CTD oligomers (including the previously proposed helical model) and RNA fit into these densities will likely require higher-resolution images. Several computational approaches have been developed to build or refine CG models using data from AA or fine-grained simulations. Our approach to coarse-graining the SARS-CoV-2 virion is to couple several CG methods in a hierarchical fashion. CG sites or "beads" are mapped from atomic structures using EDCG, a method designed to preserve the principal modes of motion sampled during atomic-level simulations (48) . In EDCG, a given CG mapping operator, where, Similarly, in the REM approach, the objective function for minimization is the Kullback-Liebler divergence, which provides a metric for the differences between the atomistic and CG probability in the canonical ensemble, and ܼ is the configurational partition function. Furthermore, the relative entropy can be expressed as a difference between the potential energy and free energy of the atomistic and CG ensembles: Minimization of the relative entropy is performed using iterative Newton-Raphson techniques. It is important to note, however, that the quality and fidelity of such CG models is determined by the molecular behavior sampled in the underlying AA simulations. Macromolecular complexes, such as virions, undergo a wide range of behavior including physical and chemical transitions that will be difficult to capture through AA simulations alone, as well as experimental techniques. This is especially true for processes that involve large conformational changes that are not sampled effectively in AA simulations, whether due to the long timescales required, free energy barriers, or inherent limitations of the simulation force field. For instance, the S protein of SARS-CoV-2 exhibits two proteolytic cleavage sites (at the S1/S2 and S2' locations) as well as binding to the host cell receptor, angiotensin-converting enzyme 2 (ACE-2). Cleavage and binding events trigger dramatic conformational changes in the spike that result in the insertion of the fusion peptide into the host cell membrane. High-resolution structural studies of the S-ACE-2 complex have made protein binding simulations amenable to enhanced sampling techniques at the AA level (4, 56) . The proteolytic cleavage of the spike and large-scale conformational shift towards fusion peptide insertion, however, are more difficult to sample in atomistic simulations. To address these issues, one can use CG molecular simulation techniques that allow CG particles to adaptively switch discrete "states" and interactions, such as "Ultra-Coarse-Graining" (UCG) (57) (58) (59) Experiments can probe longer timescales than are available from AA MD simulations. In recent cryo-EM images of SARS-CoV-2 particles, the S1 domain of the S protein was found to transiently open and close in order to bind the ACE-2 receptor (3, 5) , which are subtle conformational changes that are difficult to sample in atomistic simulations. For these conformational changes -in the case that they cannot be treated as discrete state switchesplastic network models (60) or multi-configuration coarse graining (MCCG) methods (61) can be used to construct a CG model that continuously transitions from one state to the next. For plastic network models, two known experimental configurations of the protein are used to build a multibasin ENM that represents deviations away from each of the individual conformational minima. A phenomenological interaction Hamiltonian is constructed that couples and mixes the ENMs between the two structural endpoints. In MCCG, the primary difference is that the coupling terms in the Hamiltonian are constructed from a two-state mixing approach, derived on the basis of a mapped potential of mean force which is explicitly computed from AA simulation data along collective variables that distinguish between the two (or more) conformational states at a CG level. An alternative (and sometimes necessary "top-down") route to deriving CG models is to construct a model Hamiltonian and then analyze the model's resulting behavior in the context of the assumed interactions. Typically, parameterization of such models is designed to fit or reproduce particular observables measured in experimental data and perhaps particular sets of AA simulation data. These can be performed based on variational optimization of some systemspecific functional that depends on the experimental observable. Model Hamiltonian approaches have the advantage that physical intuition is clearer, but are not systematic, since each new problem requires a different treatment for the set of interactions involved. Furthermore, these approaches often require orthogonal experiments to validate the underlying model. Such coarsegraining methods are, however, especially useful in cases where atomistic simulation and/or experimental data is difficult or infeasible to obtain on the system or if the bottom-up methods described above are not expected to yield converged results for the CG effective potential, owing to limited atomistic sampling. Here we present results from the first CG simulations of the SARS-CoV-2 virion (Figure 3 ). It should be noted that these are early results and we can thus expect additional simulations to become available from this model as more experimental data and AA simulations become available for the various virion components. In addition, the overall CG methodology and modeling of the virion will continue to evolve, and are works in progress. A CGMD simulation was performed on the complete CG virion model using LAMMPS for (Figure 4 A) . In general, the CG model captured the positions and peaks in the pair correlation functions; however, error in the fine structure of the peaks was also present, indicating that refinement involving the addition of more expressive basis CG potentials (e.g., splines) may be necessary. We performed principal component analysis (PCA) on a subset of the CG particles to examine collective modes of motion of the virion ( Figure 4B ). The Cartesian coordinates of 1 particle for every 15 CG lipids, 1 for every M and E protein, and 1 for every 3 S particles were extracted from the trajectory data and used to compute the covariance matrix, the S1/S2 during CGMD. In general, there was a high degree of variance in the S protein, and these correlated modes of motion are likely informative of how the virion collectively utilizes spike proteins to explore and detect receptors. Longer CG simulations with more expressive CG models will likely be required to uncover additional modes of motion in the virion, including modes that involve the structural M, N, and E proteins. This work provides an initial CG molecular model of the SARS-CoV-2 virion and details a bottomup CG approach capable of further refining the model as new atomistic and experimental data becomes available. Currently, the lipid envelope is described using a particle-based The Protein Data Bank Announcing the worldwide Protein Data Bank Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2 Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein Structural basis of receptor recognition by SARS-CoV-2 Structure of replicating SARS-CoV-2 polymerase Cryo-EM structure of the SARS-CoV-2 3a ion channel in lipid nanodiscs Structural model of the SARS coronavirus E channel in LMPG micelles Molecular Basis for ADP-Ribose Binding to the Mac1 Domain of SARS-CoV-2 nsp3 Activity profiling and structures of inhibitor-bound SARS-CoV-2-PLpro protease provides a framework for anti-COVID-19 drug design Crystal structure of SARS-CoV-2 main protease provides a basis for design of improved α -ketoamide inhibitors Elucidation of cryptic and allosteric pockets within the SARS-CoV-2 protease 2020. Targeting SARS-CoV-2 with AI-and HPC-enabled Lead Generation: A First Data Release Developing a Fully Glycosylated Full-Length SARS-CoV-2 Spike Protein Model in a Viral Membrane The Molecular Biology of Coronaviruses Distinct conformational states of SARS-CoV-2 spike protein Supramolecular Architecture of Severe Acute Respiratory Syndrome Coronavirus Revealed by Electron Cryomicroscopy The M, E, and N structural proteins of the severe acute respiratory syndrome coronavirus are required for efficient assembly, trafficking, and release of virus-like particles The SARS coronavirus nucleocapsid protein--forms and functions Molecular architecture of the SARS-CoV-2 virus. Cell. S0092867420311594 Coronavirus envelope protein: current knowledge A Community Letter Regarding Sharing Biomolecular Simulation Data for COVID-19 Crystal structure of SARS-CoV-2 nucleocapsid protein RNA binding domain reveals potential unique drug targeting sites CHARMM-GUI: A web-based graphical user interface for CHARMM Site-specific glycan analysis of the SARS-CoV-2 spike Deducing the N-and Oglycosylation profile of the spike protein of novel coronavirus SARS-CoV-2 Comparative Protein Structure Modeling Using MODELLER JPred4: a protein secondary structure prediction server Structure of the SARS Coronavirus Nucleocapsid Protein RNA-binding Dimerization Domain Suggests a Mechanism for Helical Packaging of Viral RNA Improved prediction of protein side-chain conformations with SCWRL4 A large-scale experiment to assess protein structure prediction methods Modeling of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) Proteins by Machine Learning and Physics-Based Refinement Efficient Simulation of Tunable Lipid Assemblies Across Scales and Resolutions Fast Parallel Algorithms for Short-Range Molecular Dynamics Systematic Coarse-Grained Lipid Force Fields with Semiexplicit Solvation via Virtual Sites Structures and distributions of SARS-CoV-2 spike proteins on intact virions In situ structural analysis of SARS-CoV-2 spike reveals flexibility mediated by three hinges SARS-CoV-2 (COVID-19) by the numbers. eLife Membrane lipids: where they are and how they behave Membrane Lipid Composition: Effect on Membrane and Organelle Structure, Function and Compartmentalization and Therapeutic Avenues Comparison of simple potential functions for simulating liquid water Optimization of the Additive CHARMM All-Atom Protein Force Field Targeting Improved Sampling of the Backbone Scalable molecular dynamics with NAMD Beyond Shielding: The Roles of Glycans in the SARS-CoV-2 Spike Protein A Systematic Methodology for Defining Coarse-Grained Sites in Large Biomolecules Systematic Multiscale Parameterization of Heterogeneous Elastic Network Models of Proteins On the Dielectric "Constant" of Proteins: Smooth Dielectric Function for Macromolecular Modeling and Its Implementation in DelPhi SARS-CoV-2 structure and replication characterized by in situ cryo-electron tomography A Multiscale Coarse-Graining Method for Biomolecular Systems The multiscale coarse-graining method. I. A rigorous bridge between atomistic and coarse-grained models The relative entropy is fundamental to multiscale and inverse thermodynamic problems Coarse-graining errors and numerical optimization using a relative entropy framework Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor The Theory of Ultra-Coarse-Graining. 1. General Principles The Theory of Ultra-Coarse-Graining. 2. Numerical Implementation Insights into the Cooperative Nature of ATP Hydrolysis in Actin Filaments Large amplitude conformational change in proteins explored with a plastic network model: adenylate kinase Multiconfigurational Coarse-Grained Molecular Dynamics Coarse-grained simulation reveals key features of HIV-1 capsid selfassembly 2020. TRIM5α self-assembly and compartmentalization of the HIV-1 viral capsid Off-Pathway Assembly: A Broad-Spectrum Mechanism of Action for Drugs That Undermine Controlled HIV-1 Viral Capsid Formation Discovery of a Novel Binding Trench in HIV Integrase Structural basis for modulation of a G-protein-coupled receptor by allosteric drugs Glutamate and Glycine Binding to the NMDA Receptor Energetics of Glutamate Binding to an Ionotropic Glutamate Receptor Neurotransmitter Funneling Optimizes Glutamate Receptor Kinetics Molecular lock regulates binding of glycine to a primitive NMDA receptor Atomic-scale characterization of mature HIV-1 capsid stabilization by inositol hexakisphosphate (IP6) Research.