key: cord-0603317-2kbneknq authors: Russo, John D.; Zuckerman, Daniel M. title: Simple synthetic molecular dynamics for efficient trajectory generation date: 2022-04-09 journal: nan DOI: nan sha: d09f8cd520789fbd378a7bd90dcafc2544b6fcfc doc_id: 603317 cord_uid: 2kbneknq Synthetic molecular dynamics (synMD) trajectories from learned generative models have been proposed as a useful addition to the biomolecular simulation toolbox. The computational expense of explicitly integrating the equations of motion in molecular dynamics currently is a severe limit on the number and length of trajectories which can be generated for complex systems. Approximate, but more computationally efficient, generative models can be used in place of explicit integration of the equations of motion, and can produce meaningful trajectories at greatly reduced computational cost. Here, we demonstrate a very simple synMD approach using a fine-grained Markov state model (MSM) with states mapped to specific atomistic configurations, which provides an exactly solvable reference. We anticipate this simple approach will enable rapid, effective testing of enhanced sampling algorithms in highly non-trivial models for both equilibrium and non-equilibrium problems. We demonstrate the use of a MSM to generate atomistic synMD trajectories for the fast-folding miniprotein Trp-cage, at a rate of over 200 milliseconds per day on a standard workstation. We employ a non-standard clustering for MSM generation that appears to better preserve kinetic properties at shorter lag times than a conventional MSM. We also show a parallelizable workflow that backmaps discrete synMD trajectories to full-coordinate representations at dynamic resolution for efficient analysis. The overall goals of molecular dynamics (MD) simulation are to generate sufficiently well-sampled and accurate trajectories, but these are hindered by notable challenges. On the one hand, inadequate sampling of complex systems prevents complete characterization of force-field accuracy, impeding the improvement of force field models. On the other, poor sampling also complicates the development of new sampling methods, because it is effectively impossible to gauge the success of a new method without reference simulation data. Well-sampled simulation data (rather than experimental data) on complex Molecular dynamics can simulate highly complex systems, at the cost of great computational expense. Simpler toy potentials simulated under, for example, Langevin dynamics, can be highly efficient but lack the complexity present in real systems. * zuckermd@ohsu.edu systems is required as a reference for methods development because even perfectly sampled models are not expected to agree with experiments, again because of model inaccuracy [1] . Synthetic molecular dynamics (synMD), i.e., the generation of approximate but arbitrarily long trajectories of highly complex models [2] [3] [4] [5] [6] [7] [8] , can directly aid methods development for sampling and hence indirectly contribute to force field development. In the long term, increasingly accurate synMD models may ultimately provide a partial replacement for standard MD. The limitations of conventional MD simulation for biomolecules are well known. Record millisecondtimescale simulations are only achievable for relatively small and simple proteins, even with substantial computational resources [1, 9] . In contrast, more complex processes of biological interest in larger systems span timescales up to to seconds and beyond [10] [11] [12] , which are inaccessible by conventional MD [13, 14] . MD limitations have motivated the development of numerous alternative strategies. Coarse-graining atoms using a force field such as MARTINI [15] , or representing the solvent with an implicit model [16] are strategies for accelerating simulation speed by reducing the number of atoms being simulated, as are statistical mechanicsbased coarse-graining strategies like force-matching [17] [18] [19] . Enhanced equilibrium sampling methods such as replica exchange, metadynamics, or umbrella sampling with weighted histogram analysis employ modified energy landscapes, and are popular alternatives to conventional MD for atomistic systems [20] [21] [22] [23] [24] . Path-sampling methods, including weighted ensemble and forward-flux sampling among others, aim to improve simulation efficiency by focusing computational resources on regions of interest and can provide unbiased non-equilibrium observables. [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] . Finally, Markov state models (MSMs) are a useful tool for connecting data from independent simulations which sample different but overlapping regions of phase space [35, 36] . Despite this significant progress, a persistent challenge in methods development for biomolecular simulationwhich remains ongoing for essentially all of the strategies noted above -is the lack of validation data, i.e., extremely well-sampled MD data for systems of interest. As illustrated in Fig. 1 , well-sampled MD runs are typically slow for complex systems. This makes them infeasible for use as a step in methods development pipelines because sufficiently complex systems generally cannot be sampled well enough to provide reference values for comparison. Simpler and faster systems such as low-dimensional potentials likely will not capture sufficient complexity to challenge the methods being tested. Here, we describe a simple synthetic MD workflow based on MSMs, in which a generative model is trained using a set of initial, standard molecular dynamics data. MSMs [35] [36] [37] [38] , with states mapped to specific atomistic configurations, are perhaps the simplest type of generative model. MSM variants such as history-augmented Markov state models (haMSMs) and other MSM alternatives can also be used [26, [39] [40] [41] [42] [43] [44] [45] [46] again by mapping discrete states to specific configurations. A special class of coordinate-generative MSMs can also be used to probabilistically generate new, out-of-sample structures [4] . Our work is distinguished from notable previous synMD efforts [2] [3] [4] [5] [6] [7] [8] by its simplicity and the availability of exact solutions. In this preliminary work, we build a detailed generative MSM from folding trajectories of the Trp-cage miniprotein [1] . We employ a simple stratification strategy to augment the usual MSM clustering that appears to preserve kinetic characteristics at smaller lag times than might otherwise be necessary for validation [47] . We generate synMD trajectories at a rate of ∼250 ms/day on a MacBook computer, compared to ∼100 µs/day on the Anton supercomputer for the original trajectories. We confirm that the synMD trajectories reproduce observables of the MD training data consistent with known capabilities and limitations of MSMs [47] , and that the synMD trajectories replicate exactly calculable equilibrium and kinetic properties of the MSM as expected. We also demonstrate dynamic resolution analysis of the synMD trajectories, where full-coordinate structures are only backmapped within time intervals and at a timeresolution of interest, rather than to each generated point, enabling more efficient analysis. We present two main workflows for producing synthetic MD trajectories. First, we describe a generic strategy for efficiently generating trajectories with full atomic coor- dinates. Second, we outline a strategy to efficiently generate extremely long atomistic trajectories at a coarse temporal resolution, followed by enhancement of the resolution in post-processing for time intervals of interest. In the standard synthetic MD workflow employing discrete states (Fig. 2 ), a generative model employing a discretization of configuration space, such as an MSM, is first built from an initial set of traditional MD trajectories [1] . A specific full-coordinate atomistic configuration is associated with each discrete state. The generative model is then used to simulate trajectories, which will be time-ordered lists of discrete configurational states, stored as integers. Discrete trajectory generation typically will be an extremely rapid process. These trajectories are then back-mapped to the saved atomic coordinate structures. Because the discrete trajectories are generated before assigning full-coordinate structures, the back-mapping is highly parallelizable. Finally, the fullcoordinate trajectory is written to disk in a standard MD format, enabling processing by standard tools. Synthetic MD also enables a dynamic resolution workflow (Fig. 4) , where very long trajectories can be efficiently generated, and enhanced temporal resolution added to regions of interest in post-processing. In this workflow, the generative model is used to build a very long discrete trajectory. However, only a temporally subsampled set of points from the discrete trajectory are back-mapped to full-coordinate atomistic configurations, rather than the full trajectory. This enables "telescoping" detailed analysis of long trajectories that would be infeasible at full temporal resolution because of the large number of snapshots generated in synMD. To demonstrate the synMD approach, we employed a nearly standard MSM as a generative model, built with pyEMMA [48] . The clustering described below is slightly different than for typical MSMs. The original MD trajectory from a 208 µs simulation of the protein Trp-cage [1] was first featurized with residue-residue minimum RMSD, excluding nearest neighbors. Next, tICA dimensionality reduction was performed at a 10ns lag time with 10 tICs, using commute maps for eigenvector scaling. The dimensionality-reduced trajectories were clustered using a stratified k-means approach, which differs somewhat from typical MSM workflows. A coordinate of interest is first stratified into bins, and then k-means clustering is independently performed in each bin. Stratification guarantees an even distribution of states along coordinates of interest. In this case, we stratified along tIC 0, which sharply distinguishes the folded and unfolded states, guaranteeing reasonable coverage of transition regions in this coordinate. With 20 k-means centers for each stratified bin, there were a total of 1020 clusters which form the discrete states of the generative model. The discretized trajectories were used to build a MSM at a 10ns lag time, chosen to balance time resolution with reasonable kinetic fidelity [47] . The MSM was symmetrized to ensure satisfaction of detailed balance by adding the count matrix to its transpose. For each discrete state, a single representative structure was randomly chosen from all structures assigned to that state. Some of the choices made in constructing this MSM may decrease model fidelity to the MD training data, but we emphasize our initial goal is to construct a generative model with protein-like complexity to enable downstream analysis and testing. Indeed, MSMs have fundamental limitations that have been discussed in detail [47] . For reference, we note this MSM produced mean firstpassage times (MFPTs) of 12.7µs for folding and 2.8µs for unfolding as calculated from the transition matrix using pyEMMA [48] . Five 208 µs trajectories were produced at 10 ns resolution by propagating randomly chosen initial states using the Trp-cage generative model. This took 5 minutes 41 seconds in total for all five trajectories using a MacBook Pro with a single 2.8GHz Intel i7 processor. One such trajectory is shown in Fig. 3 , along with the original MD trajectory. Analysis of these trajectories' equilibrium distributions is consistent with the original MD trajectory data, as well as the underlying MSM, as shown in Fig. 5 . Likewise, the MFPT values estimated from the synMD trajectories were 4.1±1.5 µs for unfolding and 18.3±9.6 µs for folding, consistent with the reference values of 2.8 µs and 12.7 µs computed directly from the MSM transition matrix. We have explored a very simple approach to generating synthetic molecular dynamics (synMD) trajectories based on Markov state models (MSMs), with the motivation of rapidly generating trajectories in highly nontrivial systems that can be solved exactly, in turn providing ideal test beds for methods development. Previous work has employed a range of deep-learning techniques [2] [3] [4] [5] [6] [7] [8] . We show that MSM-based synMD trajectories are generated at multiple orders of magnitude speedup over conventional MD, and confirm that the trajectories reproduce exactly-solvable equilibrium and kinetic properties of the generative model. Our generative MSM was able to employ a shorter lag time -providing higher mech-anistic resolution [47] -because of an apparently novel stratified approach to state clustering. Rapid generation of synMD trajectories should be very useful in testing new methods because it provides arbitrary amounts of data in highly complex, but exactly solvable models. Such a framework could be particularly valuable for path sampling, enabling careful estimation of variance based on different choices of hyperparameters. SynMD can also advance methods development for trajectory analysis tools [41, 49] based on controlled amounts of synMD data, mimicking the lowdata regime typical for MD trajectory sets. Even MSM analysis protocols can be tested using synMD based on a fine-grained MSM, so long as the MSM used for analysis is blinded to the fine-grained MSM used to generate trajectories. SynMD may also be useful for generating an arbitrary number of stochastic mechanistic pathways encoded by the generative model, which may be compared to experimental or higher-quality simulated data to further refine the generative model [50] . It is feasible to construct significantly improved generative models within the MSM framework. For example, much finer-grained states can be employed, and established adaptive approaches for selecting key regions for further simulation (of MD training data) are available [51] [52] [53] [54] . Training data from polarizable or hybrid quantum/classical force fields could be used to refine a conventional MSM as needed. Numerous MSM-like discretestate models have been developed incorporating more dynamical information -i.e., trajectory history -than conventional MSMs [26, [39] [40] [41] [42] [43] [44] [45] [46] . For example, haMSMs are unbiased for kinetics at any lag time and were shown to significantly outperform conventional MSMs in characterizing mechanistic details of protein folding [41, 47] . Deep generative MSMs can be used to stochastically generate new out-of-sample structures [4] . More modern machine learning strategies will undoubtedly continue to play a large role in synMD. Frameworks such as variational autoencoders and recurrent neural networks including long short-term memory neural networks [3, 7, 8] have led to models with an improved ability to generate MD-like discrete-state trajectories; note that current MSMs and variants have not been optimized for this task, which is critical to synMD. Mixture density network autoencoders [5] and latent space simulators [6] generate trajectories in a lower-dimensional continuous space, and provide a mapping to full-coordinate representations. We appreciate valuable input from Jeremy Copperman, and are grateful for support from the NIH via Grant GM115805 and the NSF via Grant MCB 2119837. We thank Pratyush Tiwary and Andrew Ferguson for useful guidance on the current literature. How Fast-Folding Proteins Fold Simulate Time-integrated Coarse-grained Molecular Dynamics with Geometric Machine Learning Recurrent Neural Network-based Model for Accelerated Trajectory Analysis in AIMD Simulations, arXiv Deep generative markov state models Accelerated Simulations of Molecular Systems through Learning of Effective Dynamics Molecular latent space simulators Learning molecular dynamics with simple language model built upon long short-term memory neural network Path sampling of recurrent neural networks by incorporating known physics Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer A glycan gate controls opening of the SARS-CoV-2 spike protein Reaching biological timescales with all-atom molecular dynamics simulations Reaching biological timescales with all-atom molecular dynamics simulations Molecular Dynamics Simulation for All Martini 3: a general purpose force field for coarse-grained molecular dynamics Implicit Solvent Models in Molecular Dynamics Simulations: A Brief Overview Andersen, Dynamic force matching: A method for constructing dynamical coarse-grained models with realistic time dependence Machine Learning of Coarse-Grained Molecular Dynamics Force Fields Machine Learning for Molecular Simulation Metadynamics Replica-exchange molecular dynamics method for protein folding Enhanced Sampling in Molecular Dynamics Using Metadynamics, Replica-Exchange, and Temperature-Acceleration The weighted histogram analysis method for free-energy calculations on biomolecules Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling Computing time scales from reaction coordinates by milestoning Markovian milestoning with Voronoi tessellations WESTPA 2.0: High-Performance Upgrades for Weighted Ensemble Simulations and Analysis of Longer-Timescale Applications Forward flux sampling for rare event simulations Umbrella sampling for nonequilibrium processes Nonequilibrium umbrella sampling in spaces of many order parameters Enhanced Sampling of Nonequilibrium Steady States Elaborating transition interface sampling methods Investigating rare events by transition interface sampling Weighted-ensemble Brownian dynamics simulations for protein association reactions Markov state models of biomolecular conformational dynamics Long-Time Protein Folding Dynamics from Short-Time Molecular Dynamics Simulations Independent Markov decomposition: Toward modeling kinetics of biomolecular complexes Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations Computational Estimation of Microsecond to Second Atomistic Folding Times Accurate Estimation of Protein Folding and Unfolding Times: Beyond Markov State Models Unbiased estimation of equilibrium, rates, and committors from Markov state model analysis Long-Time-Scale Predictions from Short-Trajectory Data: A Benchmark Analysis of the Trp-Cage Miniprotein On the advantages of exploiting memory in Markov state models for biomolecular dynamics Correlation functions, mean first passage times, and the Kemeny constant Arbitrarily accurate representation of atomistic dynamics via Markov Renewal Processes, arXiv (2020) Projected metastable Markov processes and their estimation with observable operator models What Markov State Models Can and Cannot Do: Correlation versus Path-Based Observables in Protein-Folding Models PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models Nonparametric Analysis of Nonequilibrium Simulations Refining markov state models for conformational dynamics using ensemble-averaged data and time-series trajectories Extensible and Scalable Adaptive Sampling on Supercomputers Quantitative comparison of adaptive sampling methods for protein dynamics Everything you wanted to know about Markov State Models but were afraid to ask Enhanced Modeling via Network Theory: Adaptive Sampling of Markov State Models