key: cord-147565-mtdhdkc1 authors: Harmalkar, Ameya; Gray, Jeffrey J. title: Advances to tackle backbone flexibility in protein docking date: 2020-10-15 journal: nan DOI: nan sha: doc_id: 147565 cord_uid: mtdhdkc1 Computational docking methods can provide structural models of protein-protein complexes, but protein backbone flexibility upon association often thwarts accurate predictions. In recent blind challenges, medium or high accuracy models were submitted in less than 20% of the"difficult"targets (with significant backbone change or uncertainty). Here, we describe recent developments in protein-protein docking and highlight advances that tackle backbone flexibility. In molecular dynamics and Monte Carlo approaches, enhanced sampling techniques have reduced time-scale limitations. Internal coordinate formulations can now capture realistic motions of monomers and complexes using harmonic dynamics. And machine learning approaches adaptively guide docking trajectories or generate novel binding site predictions from deep neural networks trained on protein interfaces. These tools poise the field to break through the longstanding challenge of correctly predicting complex structures with significant conformational change. Protein-protein interactions are involved in nearly all of the biological processes in human health and disease. Understanding the dynamics of binding and the structure of protein complexes at the molecular level can be instrumental in delineating biological mechanisms and developing intervention strategies. Computational protein-protein docking provides a route to predict the three-dimensional structures of protein assemblies or complexes from known structures of individual monomeric proteins [1] . Docking methods are tested in the blind prediction challenge known as the Critical Assessment of PRediction of Interactions (CAPRI) [2] , which in recent rounds pushed the field by including a wide array of target types such as transport proteins, higher order assemblies and host-virus interactions [3, 4] . Out of the 28 protein-protein targets evaluated in CAPRI over the past four years [4, 5] , predictors achieved high quality structures for 11 "easy" targets, defined as those with little backbone motion (unbound to Figure 1 : Performance of protein docking approaches on blind targets in CAPRI Rounds 38-46. [4, 5] Distribution of DockQ scores for the best model submitted by each predictor group (points) for each individual target (x-axis). DockQ measures a combination of intermolecular residue-residue contacts, interface RMSD, and ligand RMSD on a scale of 0 (incorrect) to 1 (matching the experimental structure) [5] . Targets are labelled by their CAPRI target number and, when needed, interface number (after the decimal). The targets are classified into rigid (easy) targets (high-homology monomer templates and under 1.2 A unbound-bound backbone motion, and flexible targets (poor template availability and/or over 1.2Å RMSD BU ). DockQ scores are color-coded by CAPRI model quality ranking: blue, high; green, medium; yellow, acceptable; gray, incorrect. Data graciously provided by Marc Lensink [4, 5] . bound C α root mean square deviation (RMSD BU ) of less than 1.2Å [6, 7] ; Figure 1 ). The remaining 17 targets were categorized as "difficult" (RMSD BU over 2.2Å and/or poor monomer template availability). For these targets, predictors only achieved acceptable quality in 8 of 17 targets (47%) and high quality in only 2 (12%) [4, 5] . Thus, the intrinsic flexibility of biomolecules still confounds the protein docking community at large. In this review, we focus on the central docking challenge of capturing larger binding-induced conformational changes. We summarize progress by recent algorithms and frameworks, additionally augmented by growth in databases and computational power (CPU-and GPU-based). These new methods have achieved greater accuracy on more challenging targets and additionally yielded insight into binding mechanisms. We first present progress in binding site identification and then docking methods including molecular dynamics (MD) and Monte Carlo (MC) approaches, normal modes, and machine learning. Together, these techniques have helped better explore broader regions of conformational space and more thoroughly evaluate the energy landscape to improve protein-protein docking. To reduce the complexity of the immense conformation space of flexible proteins, coarse-grained models are frequently used to reduce the degrees of freedom ( Figure 2 ). In the extreme, global docking approaches typically first treat protein partners as rigid bodies by restricting to six degrees of freedom (three rotational and three translational). A prime method to exhaustively sample the global 6D space is enumerating and scoring different rigid-body orientations on a dense grid. Approaches such as Clus-Pro [15, 16] , ZDOCK [17, 18] , PIPER [19] and HexServer [20] [21] . Relative to traditional FFT-based docking, FMFT accelerates calculations ten-fold [13]*. Another shape-based approach is geometric hashing, which indexes point sets or curves to match geometric features under arbitrary transformations like translations, rotations or even scaling [22] . Local 3D Zernike descriptor-based docking (LZerD), one of the top methods in CAPRI, projects 3D surfaces onto spheres to efficiently capture complementarity of protein surfaces [23] . Some rigid-body approaches exploit data from chemical cross-linking experiments [24] or small-angle X-ray scattering (SAXS) [25] to further improve discrimination of generated structures. These approaches provide fast, global exploration of the energy landscape, and in recent CAPRI rounds [4, 5] , many predictors incorporated these approaches as the first step to identify putative binding patches, and they supplement with other refinement tools to capture backbone flexibility. Molecular dynamics (MD) is one strategy that is often used after grid-search or template-based approaches for refinement ( Figure 3 ) [26, 27] . Unbiased, all-atom MD simulations can provide a highresolution, time-resolved microscopic model of protein-protein interactions. MD calculates Newtonian trajectories using physics-based energy functions to simulate protein association and dissociation events. MD use for protein docking has been limited because non-native local minima trap proteins, and dissociation is too slow [28] . Over the past decade, two new modifications to capture conformational changes are steered molecular dynamics (SMD) [29] , which utilizes external force constraints, and Markov sampling, which breaks a long MD simulation into multiple short trajectories [30] . To accelerate dissociation of protein partners at sub-optimal binding regions, Ostermeir et al. developed a Hamiltonian replica exchange MD protocol (H-REMD) for protein docking [31] *. In H-REMD, biasing potentials are based on the shortest distance between protein partner atoms (defined as "ambiguity restraints"). As the biasing potential and associated ambiguity restraints vary across replicas, associated protein partners in one replica are forced to dissociate in another. Pan et al. simulated long timescales in a global search space for a benchmark set of five targets on the special purpose machine Anton [32, 33] . Their "tempered binding" protocol updates energy function parameters throughout the simulation: a soft-core van der Waals intermolecular potential is scaled so that long-lived states are dissociated more frequently, improving the In contrast to MD approaches that target flexibility with Newtonian dynamics; Monte Carlo (MC) methods sample by random moves often followed by minimization (MCM) [40, 41] . MC allows a wide variety of conformational move types to sample diverse conformations. MC algorithms have emulated the kinetic binding models, namely key-lock, conformer selection (CS) and induced-fit (IF) mechanisms [42, 43, 44] . The CS model chooses protein backbones from a pre-generated ensemble, thus this approach has the advantage of docking one partner's conformations at a time. However, CS docking can fail if the ensemble is devoid of native-like backbone conformations [45] . docked from an ensemble of 10 structures). To diversify backbone conformations, the protocol generates monomer structures by three methods: (1) normal modes [46] (2) backrub motions [47] and (3) all-atom backbone refinement [48] . Further, to discriminate between near-native and non-native structures, they developed a more accurate coarse-grained energy function with 6-dimensional residue-pair data obtained Since intrinsic fluctuations in proteins contribute to conformational change, some docking approaches utilize harmonic dynamics to capture protein backbone motions [49, 50, 51] . Normal modes of vibration represent internal motions of a protein based on a Hookean potential between close residues. Normal mode analysis (NMA) is incorporated in docking approaches such as ATTRACT [52] , FiberDock [53] , SwarmDock [54] and EigenHex [55] . To mimic induced-fit, Schindler et al. developed iATTRACT [56] by moving interface residues in Cartesian coordinate space subject to NMA-generated harmonic potentials. iATTRACT served as a refinement stage and improved the fraction of native contacts predicted by 70%. For targets with unbound to bound interface RMSD over 4Å, iATTRACT can achieve acceptable quality models [56] . Population-based methods such as particle swarm optimization (PSO) have also [57, 4] . Extending the swarm intelligence methods, the LightDock algorithm uses a "glowworm" swarm optimization to sample different backbone conformations in local regions of the protein surface with an anisotropic network model [58] . LightDock additionally uses multiscale modeling to combine all-atom and coarse grained scoring functions. While normal modes have typically been used on individual protein partners prior to docking, Oliwa and Shen introduced the complex NMA in docking to also sample molecular complex fluctuations [60] . By calculating modes of an encounter complex, this approach focuses on the binding region as it reduces the dimensionality of the search space [61] . One of the problems of NMA is that higher frequency modes often distort protein bonds. To overcome this limitation, Frezza and Lavery developed the internal coordinate NMA (iNMA) approach to move in the torsion angle space, that is, with fixed bond lengths and angles ( Figure 4 ) [62] . With a reduced protein model in an internal coordinate space, they captured larger conformational changes from eigenvectors of low-frequency modes [59] **. iNMA can generate structures within 3Å of the bound state when starting from the unbound for 39% of single-domain and 45% of multi-domain proteins in their benchmark. Although protein folding has been one prime focus of deep learning methods in biology (e.g., AlphaFold [63, 64] and RaptorX [65] ), in recent years, a few studies have explicitly addressed challenges relevant to protein docking [66] . Protein binding sites can be thought of as an information-rich molecular space that can be mined for elucidating protein interactions [67, 68, 69] . One approach is to use this information to create score functions for use with traditional docking approaches. For example, Geng et al. used graph representations to train a support vector machine (SVM) on native and non-native protein complex structures to develop a scoring potential (GraphRank) to rank docked poses [70] . And iScore, composed of the GraphRank and HADDOCK [71] scores, achieved top performance in CAPRI scoring rounds (medium or high quality structures for nine out of 13 targets). Other teams have used deep learning techniques to identify protein interfaces by extrapolating image recognition tools to protein structures. RaptorX-ComplexContact [69] uses a deep residual neural network trained on single-chain proteins to predict contacts between binding partners, achieving the top contact prediction scores in CASP [72] . Another approach is to characterize interaction environments. Townshend et al. created "voxels," i.e., volumetric pixels with local atomic information for every protein surface residue, and with this 3D representation, they trained a deep 3D convolutional neural network (SASNet) on a curated database of bound protein complex structures [73] . Pittala et al. employed graph convolutions with the nodes representing the amino acid residues and edges connecting residues with a C β − C β distance under 10Å [74] . They placed geometric and chemical features on both nodes and edges and used a graph neural network to predict epitopes and paratopes in antigen-antibody interfaces. In a unique approach by Gainza et al., a geometric deep learning model (MaSIF) used molecular interaction "fingerprints" calculated using geometric and chemical features of protein surfaces [14]** (2). Their deep network was composed from geodesic convolutional layers, and they used it to predict binding sites, evaluate alternate docked interfaces, and assess likelihood of a given protein-protein interaction. Relative to conventional rigid docking methods on protein targets, MaSIF-search can perform ultra-fast scanning to identify true 'binder' with similar accuracy but significantly faster (4 CPU-minutes vs. 45 hours for PatchDock and 93 days for ZDOCK to evaluate a benchmark of 100 bound protein complexes). In a study to explore how neural networks might be used to generate structures with considerable backbone motion, Degiacomi trained an autoencoder with conformations from MD simulations, compressing the protein motion into a low-dimensional latent space [75] *. By training with simulations of both closed (bound) and apo conformations of a target protein, the autoencoder generated an intermediate closed-apo conformation at 0.8Å RMSD [75] from the native state. However, when the autoencoder was trained only with open conformations, the generator could only create structures far from the closed state (over 4.2Å), limiting the utility of this approach for blind docking. In an approach suitable for blind cases, Cao and Shen developed a Bayesian active learning (BAL) model to quantify uncertainty in protein 8 structure quality, and then they extended their model to flexible protein docking [76] *. The Bayesian framework determines the posterior probability as it samples backbone conformations [60] . Flexibility is captured with low-frequency complex-NMA modes, and in principle it can be extended to higher frequencies that capture loop and hinge motions. Compared to ZDOCK [17] and PSO, BAL improves the interface RMSD of the near-native predictions by 0.5Å. In conjunction with experimental data, docking has advanced a range of biological and health applications (e.g., Alzheimer's disease [77] , celiac disease [78] , SARS-CoV-2 [79] , influenza [80] , cancer [81] , and heart disease [82] , to name just a few). Over the past few years, docking success rates have improved on "difficult" blind prediction targets, but rates need to be higher for docking to be a reliable stand-alone tool in all cases. Clearly, a diverse and impressive array of tools has steadily advanced toward reliably capturing large conformational changes in protein docking. Docking will be even more impactful when the field finally overcomes this challenge. Recent Progress and Future Directions in Protein-Protein Docking CAPRI: A critical assessment of PRedicted interactions The challenge of modeling protein assemblies: the CASP12-CAPRI experiment Blind prediction of homo-and hetero-protein complexes: The CASP13-CAPRI experiment Modeling protein-protein, protein-peptide, and protein-oligosaccharide complexes: CAPRI Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2 Dockground: A comprehensive data resource for modeling of protein complexes ClusPro: A fully automated algorithm for protein-protein docking The ClusPro web server for protein-protein docking ZDOCK: An initial-stage protein-docking algorithm Interactive docking prediction of protein-protein complexes and symmetric multimers PIPER: An FFT-based protein docking program with pairwise potentials HexServer: An FFTbased protein docking server powered by graphics processors Efficient search for the possible mutual arrangements of two rigid bodies with the use of the generalized five-dimensional Fourier transform Prediction of protein-protein interactions by docking methods Protein-protein docking using region-based 3D Zernike descriptors Integrating Cross-Linking Experiments with Ab Initio Protein-Protein Docking Ultra-fast Filtering Using Small-Angle X-ray Scattering Data in Protein Docking Modeling of protein complexes in CAPRI Round 37 using template-based approach combined with model selection Performance and enhancement of the LZerD protein assembly pipeline in CAPRI 38-46 Atomic-Level Characterization of the Structural Dynamics of Proteins Implicit flexibility in protein docking: crossdocking and local refinement Complete protein-protein association kinetics in atomic detail revealed by molecular dynamics simulations and Markov modelling Accelerated flexible protein-ligand docking using Hamiltonian replica exchange with a repulsive biasing potential Hamiltonian replica exchange (H-REMD) modifies parts of the force field across different replicas. In this paper, a repulsive potential between receptor and ligand surface residues promotes transient dissociation on switching replicas, accelerating exploration of the protein surface to identify possible binding sites Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer Atomic-level characterization of protein-protein association With long timescale MD simulations using a "tempered binding" protocol that scales a soft-core energy across replicas to promote dissociation of longlived states, this work found that protein binding occurs through repeated association-dissociation events rather than prolonged in-contact exploration Prediction of protein-protein complexes using replica exchange with repulsive scaling Using a novel replica exchange scheme with variable van der Waals radii for interface residue atoms, the RS-REMD approach promotes dissociation in some replicas, which improves sampling for both global searches and refinement Replica exchange with solute tempering: A method for sampling biological systems in explicit water Replica Exchange Improves Sampling in Low-Resolution Docking Stage of RosettaDock Umbrella sampling Funnel metadynamics as accurate binding free-energy method Holo-like and Druggable Protein Conformations from Enhanced Sampling of Binding Pocket Volume and Shape High-resolution protein-protein docking Sampling and scoring: A marriage made in heaven Protein-Protein Docking with Backbone Flexibility Conformer Selection and Induced Fit in Flexible Backbone Protein-Protein Docking Using Computational and NMR Ensembles Monte Carlo replica-exchange based ensemble docking of protein conformations Pushing the Backbone in Protein-Protein Docking Anisotropy of fluctuation dynamics of proteins with an elastic network model Backrub-like backbone simulation recapitulates natural protein conformational variability and improves mutant side-chain prediction Alternate states of proteins revealed by detailed energy landscape mapping Harmonic modes as variables to approximately account for receptor flexibility in ligand-receptor docking simulations: Application to DNA minor groove ligand complex Flexibility and Conformational Entropy in Protein-Protein Binding Accounting for conformational changes during protein-protein docking Flexible docking and refinement with a coarse-grained protein model using ATTRACT FiberDock: Flexible induced-fit backbone refinement in molecular docking SwarmDock and the use of normal modes in protein-protein docking Flexible protein docking refinement using pose-dependent normal mode analysis iATTRACT: Simultaneous global and local interface optimization for protein-protein docking refinement Enhanced sampling of protein conformational states for dynamic cross-docking within the protein-protein docking server SwarmDock * A hybrid conformational-selection/induced-fit approach for dynamic crossdocking in SwarmDock, a particle swarm optimization algorithm. Ensembles are pre-generated with NMA and undergo cross-docking while sampling alternate protein conformations using low frequency normal modes LightDock: A new multi-scale approach to protein-protein docking Internal Coordinate Normal Mode Analysis: A Strategy to Predict Protein Conformational Transitions This work employs NMA in the internal coordinate space with a reduced protein model to capture large conformational changes of proteins with a faster compute time and no distortion of protein bonds cNMA: A framework of encounter complex-based normal mode analysis to model conformational changes in protein interactions Predicting protein conformational changes for unbound and homology docking: learning from intrinsic and induced flexibility Internal normal mode analysis (iNMA) applied to protein conformational flexibility Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13) Improved protein structure prediction using potentials from deep learning Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model Deep learning in protein structural modeling and design (2020) Recognition of functional sites in protein structures Protein interface prediction using graph convolutional networks ComplexContact: A web server for interprotein contact prediction using deep learning iScore: a novel graph kernel-based function for scoring protein-protein docking models HADDOCK: A protein-protein docking approach based on biochemical or biophysical information Critical assessment of methods of protein structure prediction (CASP)-Round XIII End-to-End Learning on 3D Protein Structure for Interface Prediction Learning context-aware structural representations to predict antigen and antibody binding interfaces Coupling Molecular Dynamics and Deep Learning to Mine Protein Conformational Space This paper describes a unique method of generating plausible motions of a protein using a generative neural network (autoencoder) Bayesian Active Learning for Optimization and Uncertainty Quantification in Protein Docking * With a framework to quantify uncertainty in docked models, the Bayesian approach uses a posterior distribution to guide sampling to likely low-energy conformations From monomer to fibril: Abeta-amyloid binding to Aducanumab antibody studied by molecular dynamics simulation Plasma Cells Are the Most Abundant Gluten Peptide MHC-expressing Cells in Inflamed Intestinal Tissues From Patients With Celiac Disease DNA Aptamers Block the Receptor Binding Domain at the Spike Protein of SARS-CoV-2, chemRxiv Characterizing receptor flexibility to predict mutations that lead to human adaptation of influenza hemagglutinin Targeting the CoREST complex with dual histone deacetylase and demethylase inhibitors Protein docking and steered molecular dynamics suggest alternative phospholamban-binding sites on the SERCA calcium transporter This work was supported by the National Institutes of Health through grant R01-GM078221. We thank Marc Lensink for generously providing us with data from CAPRI and Sai Pooja Mahajan and Sudhanshu Shanker for helpful comments on the manuscript. Dr. Jeffrey J. Gray is an unpaid board member of the Rosetta Commons. Under institutional participation agreements between the University of Washington, acting on behalf of the Rosetta Commons, Johns Hopkins University may be entitled to a portion of revenue received on licensing Rosetta software including applications mentioned in this review. As a member of the Scientific Advisory Board, Dr. Gray has a financial interest in Cyrus Biotechnology. Cyrus Biotechnology distributes the Rosetta software, which may include methods mentioned in this review.