key: cord-0617265-5wjik52o authors: Tian, Hao; Tao, Peng title: Deciphering the Protein Motion of S1 Subunit in SARS-CoV-2 Spike Glycoprotein Through Integrated Computational Methods date: 2020-04-10 journal: nan DOI: nan sha: d35087c6f0465735e38cd46ee0da0b6c18a88ffb doc_id: 617265 cord_uid: 5wjik52o The novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a major worldwide public health emergency that has infected over $1.5$ million people. The partially open state of S1 subunit in spike glycoprotein is considered vital for its infection with host cell and is represented as a key target for neutralizing antibodies. However, the mechanism elucidating the transition from the closed state to the partially open state still remains unclear. Here, we applied a combination of Markov state model, transition path theory and random forest to analyze the S1 motion. Our results explored a promising complete conformational movement of receptor-binding domain, from buried, partially open, to detached states. We also numerically confirmed the transition probability between those states. Based on the asymmetry in both the dynamics behavior and backbone C$alpha$ importance, we further suggested a relation between chains in the trimer spike protein, which may help in the vaccine design and antibody neutralization. result of MD simulation, Markov state model (MSM) is applied to catch long-time kinetic information given time-limited simulation trajectories. 16, 17 Another advantage of MSM is that it can divide a large number of protein structures from simulation into subspaces based on the extracted kinetic information, and differences among those spaces can be calculated and used for comparison. Machine learning techniques have achieved great accomplishments in chemistry and biology including material discovery, 18 structure representation 19 and computation acceleration 20 . The great contributions from machine learning mainly come from its ability to deal with large scale data and its accurate and explainable models, 21 The root-mean-square deviation (RMSD) is used to measure the overall conformational change of each frame with regard to a reference structure in a MD simulation trajectory. For a molecular structure represented by Cartesian coordinate vector r i (i = 1 to N ) of N atoms, the RMSD is calculated as following: r 0 i represents the Cartesian coordinate vector of the i th atom in the reference structure. The transformation matrix U is defined as the best fit alignment between the protein structures along trajectories with respect to the reference structure. The root-mean-square fluctuation (RMSF) is used to measure the fluctuation of each atom in each frame with regard to a reference structure in a MD simulation trajectory. Where T is the total frames and r i is the average position of atom i in the given trajectory. Distances between pairs of backbone Cα were chosen as representative features for one struc- Random forest is a machine learning technique that can be used for classification. 25 In a random forest model, a quantitative evaluation of the importance for each feature used for training is calculated through training process. This feature importance is calculated using Gini impurity: where f i is the frequency of a label at a node of being picked to split the data set and C is the total number of unique labels. A random forest model is a collection of several decision tree models. The importance of node i in decision tree is calculated as: where w j is the weighted number of samples reading node i, C i is the impurity value of node i and m is the number of child nodes. The feature importance of feature i is calculated as f i = s l n j k∈all nodes n k (6) where s is the times of node j split on feature i. Feature importance within a decision tree is further normalized by: The feature importance in random forest is the averaged importance of feature i in all decision tree models: where norm f i is the normalized feature importance of one single decision tree and N is the number of decision trees. 29 Feature importance of all pairwise Cα distances were calculated using the above methods. The feature importance of an amino acid is the summation of importances of features that are related with that amino acid. The relative accumulated feature importance of each amino acid shows the distinguishability and contribution of that amino acid among all amino acids in differentiating states. Transition path theory (TPT) 34 where T ik is the transition probability from state i to state k. Sequentially, the effective flux is calculated as: where π i is the equilibrium probability of normalized transition matrix T π and q − i is the backward-committor probability calculated as q − i = 1 − q + i . However, backward flux f ji should also be considered and subtracted when calculating net flux. Therefore, the net flux Total flux can then be calculated as: The flux from initial state to target state can be decomposed to individual pathways p i . Dijkstra algorithm is implemented in MSMBuilder for pathway decomposition. A set of pathways p i can be generated along f i , which provides a relative probability by: 3 Results Simulation trajectory analysis shows dynamical activity The plot suggested that the close state is relatively stable while the partially open state is dynamically active and undergoes significant conformational changes after 1 microsecond. However, the simulation of open state after 6 microseconds suggested a convergence in the RMSD value and a relatively stable structure, which corresponds to the detached S1 subunit from S2 fusion machinery. RMSF results were plotted in Figure 1 PCCA was applied to map microstates onto macrostates based on the eigenfunction structure of transition probability matrix. The resulting macrostates with transition probability are shown in Figure 3 Figure 3B to show a series of transition processes. To better understand the shift between close and open states in S1 subunit, the pair wised Cα distances of S1 were extracted as features representing the character of protein configurations. There are 540 residues on each chain, residue ID from 27 to 676, and total of 1620 * 1619/2 = Supervised machine learning model was applied to extract the key residues that are vital during allosteric process and study the structural differences among macrostates. For each simulation trajectory, frames were saved for every 1.2 nanoseconds (ns), resulting in 8, 334 frames. Therefore, 16, 668 samples with 1, 311, 390 features were extracted from the trajectories. Each sample was labeled based on the above macrostate result. Full dataset was further split into training set(70%) and testing set(30%). After the preparation of data, random forest model was applied to distinguish the intrinsic conformational differences among macrostates. Training scores and testing scores were plotted in Figure 4A . 7 was chosen for the depth with corresponding testing accuracy of 92.18%, which indicated that the random forest model was able to catch the conformational characteristics of each macrostate only using pair-wised Cα distances. To further investigate the relationship between chain A and two other chains, the original Cα distances related with chain A was excluded as reduced features and applied to another random forest model. Training and testing results Figure 4B . The top 500 features accounted for 74.8% percent of the overall feature importance, shown in Figure S2 . The testing accuracy with reduced features reached 88.04% at depth 8. The top five important Cα distances were listed in Table 2 . In order to identify key residues along the transition from the closed to the partially open state, the feature importance of each Cα distance was added and accumulated to the two related residues. S1 subunit structure was plotted in Figure 5A as reference. Top 20 important residues on chain B and C, with corresponding structure and accumulated structure importance under each figure, were plotted in Figure 5 (B-C). Full results of residue importance on chain B and C are shown in S3. The significance of the partially open state of receptor-binding domain (RBD) in SARS-CoV-2 for interacting with the host cell receptor has been extensively studied. 36, 37 The opening of S1 subunit, thus exposing RBD, is necessary for engaging with ACE2 and following cleavage of S 2 site. 38 The spike protein is essential for SARS-CoV-2 as it destabilizes the trimer structure, causing the detachment of S1 subunit and exposing the RBD domain to host cell membrane. In this study, we used publicly available simulation trajectories of S protein and studied the asymmetric dynamics nature of the trimer structure. Markov state model was applied to divide the conformational space into 8 macrostates. The representative structures along the most probable channel in transition path theory are shown to present a clear route from the close state to the partially open state. Transition matrix was calculated to determine the probability of close-open transitions, with maximum of 9.9% from the close macrostate implied a relationship between chain A and two other chains. Yet, whether this relation originates from the mutual influence among chains or the intrinsic asymmetry in biological functions needs further investigation. Overall, our study quantitatively analyzed the motion of S1 subunit in S protein with important Cα distances and residues. Possible prevention and cure searching progress could be facilitated combining with these vital findings, which provide new opportunities for potential drug research and vaccine development. Figure S2 : Accumulated feature importance of top 500 Cα distances. A new coronavirus associated with human respiratory disease in China Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding The coronaviridae Bat-to-human: spike features determining 'host jump' of coronaviruses SARS-CoV, MERS-CoV, and beyond MERS-CoV spike protein: Targets for vaccines and therapeutics Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Structure, function, and evolution of coronavirus spike proteins. Annual review of virology Receptor recognition mechanisms of coronaviruses: a decade of structural studies Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation The coronavirus spike protein is a class I virus fusion protein: structural and functional characterization of the fusion core complex Cryo-electron microscopy structures of the SARS-CoV spike glycoprotein reveal a prerequisite conformational state for receptor binding Immunogenicity and structures of a rationally designed prefusion MERS-CoV spike antigen Cryo-electron microscopy structure of porcine deltacoronavirus spike protein in the prefusion state Markov models of molecular kinetics: Generation and validation Molecular Dynamics Simulations Related to SARS-CoV-2 Accurate estimation of protein folding and unfolding times: beyond Markov state models Stochastic steps in secondary active sugar transport Machine-learning-assisted materials discovery using failed experiments Crystal structure representations for machine learning models of formation energies Adaptive machine learning framework to accelerate ab initio molecular dynamics Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering Deep learning for drug design: an artificial intelligence paradigm for drug discovery in the big data era Dimensionality reduction methods for molecular simulations Rectified linear units improve restricted boltzmann machines Classification and regression by randomForest Machine Learning Classification Model for Functional Binding Modes of TEM-1 β-Lactamase. Frontiers in molecular biosciences Incremental induction of decision trees Scikit-learn: Machine learning in Python Random forests Dynamical Behavior of β-Lactamases and Penicillin-Binding Proteins in Different Functional States and Its Potential Role in Evolution Robust Perron cluster analysis in conformation dynamics. Linear algebra and its applications Using generalized ensemble simulations and Markov state models to identify conformational states statistical models for biomolecular dynamics Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations. Proceedings of the National Academy of Sciences Transition path theory for Markov jump processes Unexpected receptor functional mimicry elucidates activation of coronavirus fusion Cryo-EM structures of MERS-CoV and SARS-CoV spike glycoproteins reveal the dynamic receptor binding domains Stabilized coronavirus spikes are resistant to conformational changes induced by receptor recognition or proteolysis Pre-fusion structure of a human coronavirus spike protein Figure S3 : Accumulated residue importance in chain B (blue) and C (orange), respectively.