key: cord-0287777-fn7l93wh authors: Magar, Rishikesh; Yadav, Prakarsh; Farimani, Amir Barati title: Potential Neutralizing Antibodies Discovered for Novel Corona Virus Using Machine Learning date: 2020-03-20 journal: bioRxiv DOI: 10.1101/2020.03.14.992156 sha: 700c82ed5dd467d2c2551cf3602ec153d392ea47 doc_id: 287777 cord_uid: fn7l93wh The fast and untraceable virus mutations take lives of thousands of people before the immune system can produce the inhibitory antibody. Recent outbreak of novel coronavirus infected and killed thousands of people in the world. Rapid methods in finding peptides or antibody sequences that can inhibit the viral epitopes of COVID-19 will save the life of thousands. In this paper, we devised a machine learning (ML) model to predict the possible inhibitory synthetic antibodies for Corona virus. We collected 1933 virus-antibody sequences and their clinical patient neutralization response and trained an ML model to predict the antibody response. Using graph featurization with variety of ML methods, we screened thousands of hypothetical antibody sequences and found 8 stable antibodies that potentially inhibit COVID-19. We combined bioinformatics, structural biology, and Molecular Dynamics (MD) simulations to verify the stability of the candidate antibodies that can inhibit the Corona virus. The biomolecular process for recognition and neutralization of viral particles is through the process of viral antigen presentation and recruitment of appropriate B cells to synthesize the neutralizing antibodies. 1 Theoretically, this process allows the immune system to stop any viral invasion, but this response is slow and often requires days, even weeks before adequate immune response can be achieved. 2, 3 This poses a challenging question: Can the process of antibody discovery be accelerated to counter highly infective viral diseases? With the rapid expansion of available biological data, such as DNA/protein sequences and structures 4 , it is now possible to model and predict the complex biological phenomena through machine learning (ML) approaches. Given sufficient training data, ML can be used to learn a mapping between the viral epitope and effectiveness of its complementary antibody. 5 Once such mapping is learnt, it can be used to predict potentially neutralizing antibody for a given viral sequence 6 . ML can essentially learn the complex antigen-antibody interactions faster than human immune system, leading to the generation of synthetic inhibitory antibodies acting as a bridge, which can overcome the latency between viral infection and human immune system response. This bridge can potentially save the life of many especially during an outbreak and pandemic. One such instance is the spread of coronavirus disease (COVID- 19) 7 . With incredibly high infectivity and mortality rate, COVID-19 has become a global scare. 8, 9 To compound the problem, there are no proven therapeutics to aid the suffering patients 2, 8, [10] [11] [12] [13] [14] . Only viable treatment at the moment is symptomatic and there is a desperate need for developing therapeutics to counter COVID-19. Recently, the proteomics sequences of 'WH-Human 1' coronavirus became available through Metagenomic RNA sequencing of a patient in Wuhan. 4, 15 WH-Human 1 is 89.1% similar to a group of SARS-like coronaviruses. 4 With this sequence available, it is possible to find potential inhibitory antibodies by scanning thousands of antibody sequences and discovering the neutralizing ones [16] [17] [18] . However, this requires very expensive and time-consuming experimentation to discover the inhibitory responses to Corona virus in a timely manner. In addition, computational and physics-based models require the bound crystal structure of antibody-antigen complex, however; only a few of these structures have become available. 19, 20, 21, 22 In the case of COVID-19, the bound antigen-antibody crystal structure is not available to-date 23, 24 . Given this challenge and the fact that ML models require a large amount of data, the ML approach should rely on the sequences of the antibody-antigen rather than the crystal structures 25 . In this paper, we have collected a dataset comprised of antibody-antigen sequences of variety of viruses including HIV, Influenza, Dengue, SARS, Ebola, Hepatitis, etc. with their patient clinical/biochemical IC50 data. Using this dataset (we call it VirusNet), we trained and benchmarked different shallow and deep ML models and selected the best performing model. Based on SARS 2006 neutralizing antibody scaffold 26 , we created thousands of potential antibody candidates by mutation and screened them with our best performing ML model. Finally, molecular dynamics (MD) simulations were performed on the neutralizing candidates to check their structural stability. We predict 8 structures that were stable over the course of simulation and are potential neutralizing antibodies for COVID-19. In addition, we interpreted the ML method to understand what alterations in the sequence of binding region of the antibody would most effectively counter the viral mutation(s) and restore the ability of the antibody to bind to the virus 27 . This information is critical in terms of antibody design and engineering and reducing the dimension of combinatoric mutations needed to find a neutralizing antibody. The majority of the data in the training set is composed of HIV antibody-antigen complex (1887 samples). Most of the samples for the HIV training set were obtained from the Compile, Analyze and Tally NAb panels (CATNAP) database from the Los Alamos National Laboratory (LANL) 28, 29 . From CATNAP, data was collected for monoclonal antibodies, 2F5, 4E10 and 10E8, which bind with GP41 30-32 . Using CATNAP's functionality for identifying epitope alignment, we selected FASTA sequence of the antigen corresponding to the site of alignment, in the antibody. We To make the dataset more diverse and train a more robust ML model, we included more available antibody-antigen sequences and their neutralization potential. To do this, we compiled the sequences of Influenza, Dengue, Ebola, SARS, Hepatitis, etc. 26,33-86 by searching the keywords of "virus, antibody" on RCSB server 87 and selected the neutralizing complex by reading their corresponding publications. Furthermore, for each neutralizing complex, the contact residues at the interface of antibody and antigen were selected. To select the antigen contact sequences, all amino acids within 5Å of corresponding antibody were chosen. (Supporting Information) To select the antibody contact sequences, all amino acids within 5Å of the antigen were chosen. In total, 102 sequences of antibody-antigen complexes were mined and added to the 1831 samples, resulting in total number of 1933 training samples. For effective representation of molecular structure of amino acids, the individual atoms of amino acids of antibody and antigen were treated as undirected graph, where the atoms are nodes and bonds are edges 88 . It has been shown that graph representation is better in transferring the chemistry and topology of molecular structure compared to Extended Connectivity Fingerprints (ECFP) 88, 89 . We construct these molecular graphs using RDkit 90 . Embeddings are generated to encode relevant features about the molecular graph 91, 92 . These embeddings encode information like the type of atom, valency of an atom, hybridization state, aromaticity etc. First, each antibody and antigen were encoded into separate embeddings and then concatenated into a single embedding for the entire antibody-antigen complex. We then apply mean pooling over the features for this concatenated embedding to ensure dimensional consistency across the training data. The pooled information is then passed to classifier algorithms like XGBoost 93 , Random Forest 94 , Multilayer perceptron, Support Vector Machine (SVM) 95 and Logistic Regression which then predict whether the antibody is capable of neutralizing the virus. In order to find potential antibody candidates for COVID-19, 2589 different mutant strains of antibody sequences were generated based on the sequence of SARS neutralizing antibodies. The reason we selected these antibodies as initial scaffolds is that the genome of COVID-19 4 is 79.8% identical to "Tor2" isolate of SARS (Accession number: AY274119) 96 antigen and antibody interactions) To find out the binding region of these antibodies for sequence generation, all amino acids within 5Å of their respective antigen were chosen. To assess the biological feasibility of these mutant sequences, we scored each mutation by using the BLOSUM62 matrix 97 . To assess the stability of proposed antibody structures, we performed molecular dynamics (MD) simulations of each of antibody structure in a solvated environment 98 . The simulation of solvated antibody was carried out using GROMACS-5.1.4 [99] [100] [101] , and topologies for each antibody were generated according the GROMOS 54a7 102 forcefield. The protein was centered in a box, extending 1 nanometer from surface of the protein. This box was the solvated by the SPC216 model water atoms, pre-equilibrated at 300K. The antibody system in general carried a net positive charge and it was neutralized by the counter ions. Energy minimization was carried out using steepest descent algorithm, while restraining the peptide backbone to remove the steric clashes in atoms and subsequently optimize solvent molecule geometry. The cut-off distance criteria for this minimization were forces less than 100.0 kJ/mol/nm or number of steps exceeding 50,000. This minimized structure was the sent to two rounds of equilibration at 300K. First, an NVT ensemble for 50 picoseconds and a 2-femtosecond time step. Leapfrog dynamics integrator was used with Verlet scheme, neighbor-list was updated every 10 steps. All the ensembles were under Periodic Boundary Conditions and harmonic constraints were applied by the LINCS algorithm 103 ; under this scheme the long-range electrostatic interactions were computed by Particle Mesh Ewald (PME) algorithm 104 . Berendsen thermostat was used for temperature coupling and pressure coupling was done using the Parrinello-Rahman barostat 105, 106 . The last round of NPT simulation ensures that the simulated system is at physiological temperature and pressure. The system volume was free to change in the NPT ensemble but in fact did not change significantly during the course of the simulation. Following the rounds of equilibration, production run for the system was carried out in NPT and no constraints for a total of 15 nanoseconds, under identical simulation parameters. The flowchart of COVID-19 antibody discovery using ML has four major steps ( The out of class results demonstrate that our model is capable of generalizing the prediction to a completely novel virus epitope. Since COVID-19 is completely a new virus, we can conclude that our model prediction performance will be accurate. The fact that our model prediction is 100% for SARS out of class test demonstrate its capability of effectively predicting the antibodies for COVID-19 which is from SARS family. In order to be more comprehensive, we created co-mutations out of 5 stable point mutations (C3, C7, C14, C17, C18, see Table S1 in Supporting Information for the list of all 18 candidates). This resulted in 5 new structures (Co1, Co2, Co3, Co4, Co5 in Table S2 ) that were screened using XGBoost for neutralization. Among all 5 co-mutations, Co5 did not neutralize. To check the stability of these 4 neutralizing co-mutations, MD simulations were performed and Co1, Co2 and Co4 were found to be stable (Figure 4b) . The list of the final 8 stable mutations and co-mutations are tabulated in Table 1 and the PDB structures are available as Supporting Information. We have developed a machine learning model for high throughput screening of synthetic antibodies to discover antibodies that potentially inhibit the COVID-19. Our approach can be widely applied to other viruses where only the sequences of viral coat protein-antibody pairs can be obtained. The ML models were trained on 14 different virus types and achieved over 90% fivefold test accuracy. The out of class prediction is 100% for SARS and 84.61% for Influenza, demonstrating the power of our model for neutralization prediction of antibodies for novel viruses like COVID-19. Using this model, the neutralization of thousands of hypothetical antibodies was predicted, and 18 antibodies were found to be highly efficient in neutralizing COVID-19. Using MD simulations, the stability of predicted antibodies were checked and 8 stable antibodies were found that can neutralize COVID-19. In addition, the interpretation of ML model revealed that mutating to Methionine and Tyrosine is highly efficient in enhancing the affinity of antibodies to in Health (CMLH) at Carnegie Mellon University (https://www.cs.cmu.edu/cmlh-cfp) and start-up fund from Mechanical Engineering Department at CMU. The authors would like to thank Prof. Reeja Jayan for her support and Junhan Li for his help in collecting the data. The RMSD and contact distance plots for all the trajectories versus time, the structure of virus antibody complex and the residues at the contact region, native contacts in antigen-antibody complex, the interaction of COVID-19 epitope with 2GHW antibody, tables for all neutralizing point mutations, co-mutations and their neutralization potentials, the structures of stable antibodies in PDB format, and IC50 data interpretation are available online. The VirusNet dataset will be available upon request. Designing antibodies or peptide sequences that can inhibit the COVID-19 virus requires high throughput experimentation of vastly mutated sequences of potential inhibitors. The screening of thousands of available strains of antibodies are prohibitively expensive, and not feasible due to lack of available structures. However, machine learning models can enable the rapid and inexpensive exploration of vast sequence space on the computer in a fraction of seconds. We collected 1933 virusantibody sequences with clinical patient IC50 data. Graph featurization of antibody-antigen sequences creates a unique molecular representation. Using graph representation, we benchmarked and used a variety of shallow and deep learning models and selected XGBoost because of its superior performance and interpretability. We trained our model using a dataset including 1,933 diverse virus epitope and the antibodies. To generate the hypothetical antibody library, we mutated the SARS scaffold antibody of 2006 (PDB:2GHW) and generated thousands of possible candidates. Using the ML model, we classified these sequences and selected the top 18 sequences that will neutralize COVID-19 with high confidence. We used MD simulations to check the stability of the 18 sequences and rank them based on their stability. Antibodies and B Cell Memory in Viral Immunity Development and Clinical Application of A Rapid IgM-IgG Combined Antibody Test for SARS-CoV-2 Infection Diagnosis The MHC Class I Antigen Presentation Pathway: Strategies for Viral Immune Evasion A New Coronavirus Associated with Human Respiratory Disease in China A Deep Learning Approach to Antibiotic Discovery Continuous cultures of fused cells secreting antibody of predefined specificity The Species Severe Acute Respiratory Syndrome-Related Coronavirus : Classifying 2019-NCoV and Naming It SARS-CoV-2 Understanding SARS-CoV-2-Mediated Inflammatory Responses: From Mechanisms to Potential Therapeutic Tools Secondary attack rate and superspreading events for SARS-CoV-2 -The Lancet Receptor ACE2 in Different Populations Potential Therapeutic Agents for COVID-19 Based on the Analysis of Protease and RNA Polymerase Docking SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor Virus against Virus: A Potential Treatment for 2019-NCov (SARS-CoV-2) and Other RNA Viruses Potent Binding of 2019 Novel Coronavirus Spike Protein by a SARS Coronavirus-Specific Human Monoclonal Antibody Genomic Characterisation and Epidemiology of 2019 Novel Coronavirus: Implications for Virus Origins and Receptor Binding Virtual Screening of the Inhibitors Targeting at the Viral Protein 40 of Ebola Virus Initiating a Watch List for Ebola Virus Antibody Escape Mutations Structures of Protective Antibodies Reveal Sites of Vulnerability on Ebola Virus Advances in Antibody Design Computational predictions of protein structures associated with COVID-19 /research/opensource/computational-predictions-of-protein-structures-associated-with-COVID-19 Combining Physics-Based and Evolution-Based Methods to Design Antibodies Against an Evolving Virus Computational Approach to Designing Antibody for Ebola Virus Structural Basis for the Recognition of the SARS-CoV-2 by Full-Length Human ACE2 Novel antibody epitopes dominate the antigenicity of spike glycoprotein in SARS-CoV-2 compared to SARS-CoV JCI Insight -Predicting the broadly neutralizing antibody susceptibility of the HIV reservoir Structural Basis of Neutralization by a Human Anti-Severe Acute Respiratory Syndrome Spike Protein Antibody, 80R Protein Structure and Sequence Re-Analysis of 2019-NCoV Genome Does Not Indicate Snakes as Its Intermediate Host or the Unique Similarity between Its Spike Protein Insertions and HIV CATNAP: A Tool to Compile, Analyze and Tally Neutralizing Antibody Panels CATNAP Tools Structure and Mechanistic Analysis of the Anti-Human Immunodeficiency Virus Type 1 Antibody 2F5 in Complex with Its Gp41 Epitope Optimization of the Solubility of HIV-1-Neutralizing Antibody 10E8 through Somatic Variation and Structure-Based Design Crystallographic Identification of Lipid as an Integral Component of the Epitope of HIV Broadly Neutralizing Antibody 4E10 A Complex of Influenza Hemagglutinin with a Neutralizing Antibody That Binds Outside the Virus Receptor Binding Site A Conformational Switch in Human Immunodeficiency Virus Gp41 Revealed by the Structures of Overlapping Epitopes Recognized by Neutralizing Antibodies A Highly Conserved Neutralizing Epitope on Group 2 Influenza A Viruses A Multiply Substituted G-H Loop from Foot-and-Mouth Disease Virus in Complex with a Neutralizing Antibody: A Role for Water Molecules Lanzavecchia, A. A Neutralizing Antibody Selected from Plasma Cells That Binds to Group 1 and Group 2 Influenza A Hemagglutinins A Potent and Broad Neutralizing Antibody Recognizes and Penetrates the HIV Glycan Shield A Shared Structural Solution for Neutralizing Ebolaviruses An Antibody That Prevents the Hemagglutinin Low PH Fusogenic Transition An Epidemiologically Significant Epitope of a 1998 Human Influenza Virus Neuraminidase Forms a Highly Hydrated Interface in the NA-Antibody Complex Antibody Recognition of a Highly Conserved Influenza Virus Epitope Antigen Distortion Allows Influenza Virus to Escape Neutralization Binding of a Neutralizing Antibody to Dengue Virus Alters the Arrangement of Surface Glycoproteins Broadly Neutralizing Anti-Hepatitis B Virus Antibody Reveals a Complementarity Determining Region H3 Lid-Opening Mechanism Complex of a Protective Antibody with Its Ebola Virus GP Peptide Epitope: Unusual Features of a Vλx Light Chain Computation-Guided Backbone Grafting of a Discontinuous Motif onto a Protein Scaffold Cross-Neutralization of Influenza A Viruses Mediated by a Single Antibody Loop Cryo-EM Structure of a Fully Glycosylated Soluble Cleaved HIV-1 Envelope Trimer Crystal Structure of a Hydrophobic Immunodominant Antigenic Site on Hepatitis C Virus Core Protein Complexed to Monoclonal Antibody 19D9D6 Crystal Structure of a Human Immunodeficiency Virus Type 1 Neutralizing Antibody, 50.1, in Complex with Its V3 Loop Peptide Antigen Crystal Structure of Dimeric HIV-1 Capsid Protein Crystal Structures of Human Immunodeficiency Virus Type 1 (HIV-1) Neutralizing Antibody 2219 in Complex with Three Different V3 Peptides Reveal a New Binding Mode for HIV-1 Cross-Reactivity Crystallographic Definition of the Epitope Promiscuity of the Broadly Neutralizing Anti-Human Immunodeficiency Virus Type 1 Vaccine Design Implications Design and Characterization of Epitope-Scaffold Immunogens That Present the Motavizumab Epitope from Respiratory Syncytial Virus Distinct Conformational States of HIV-1 Gp41 Are Recognized by Neutralizing and Non-Neutralizing Antibodies Focused Evolution of HIV-1 Neutralizing Antibodies Revealed by Structures and Deep Sequencing Germline V-Genes Sculpt the Binding Site of a Family of Antibodies Neutralizing Human Cytomegalovirus Increasing the Potency and Breadth of an HIV Antibody by Using Structure-Based Rational Design Mechanism of Dengue Virus Broad Cross-Neutralization by a Monoclonal Antibody NMR Structure of an Anti-Gp120 Antibody Complex with a V3 Peptide Reveals a Surface Important for Co-Receptor Binding Structural Analysis of a Dengue Cross-Reactive Antibody Complexed with Envelope Domain III Reveals the Molecular Basis of Cross-Reactivity Structural Bases of Coronavirus Attachment to Host Aminopeptidase N and Its Inhibition by Neutralizing Antibodies Structural Basis for Broad and Potent Neutralization of HIV-1 by Antibody VRC01. Science Structural Basis for Diverse N-Glycan Recognition by HIV-1-Neutralizing V1-V2-Directed Antibody PG16 Structural Basis for HIV-1 Neutralization by a Gp41 Fusion Intermediate-Directed Antibody Structural Basis for the Antibody Neutralization of Herpes Simplex Virus Structural Basis for the Binding of the Neutralizing Antibody, 7D11, to the Poxvirus L1 Protein Structural Basis for the Neutralization and Genotype Specificity of Hepatitis E Virus Structural Basis for the Preferential Recognition of Immature Flaviviruses by a Fusion-Loop Antibody Structural Basis of Differential Neutralization of DENV-1 Genotypes by an Antibody That Recognizes a Cryptic Epitope Structural Basis of Hepatitis C Virus Neutralization by Broadly Neutralizing Antibody HCV1 Structural Basis of West Nile Virus Neutralization by a Therapeutic Antibody Structural Evidence for Recognition of a Single Epitope by Two Distinct Antibodies Structural Insights into Immune Recognition of the Severe Acute Respiratory Syndrome Coronavirus S Protein Receptor Binding Domain Structural Insight into Distinct Mechanisms of Protease Inhibition by Antibodies Structural Insights into the Neutralization Mechanism of a Higher Primate Antibody against Dengue Virus Structure of a Core Fragment of Glycoprotein H from Pseudorabies Virus in Complex with Antibody Structure of a Major Antigenic Site on the Respiratory Syncytial Virus Fusion Glycoprotein in Complex with Neutralizing Antibody 101F Structure of Severe Acute Respiratory Syndrome Coronavirus Receptor-Binding Domain Complexed with Neutralizing Antibody Structure of the Ebola Virus Glycoprotein Bound to an Antibody from a Human Survivor Structure of Simian Immunodeficiency Virus Envelope Spikes Bound with CD4 and Monoclonal Antibody 36D5 Structures of the CCR5 N Terminus and of a Tyrosine-Sulfated Antibody with HIV-1 Gp120 and CD4 The Structure of a Complex between the NC10 Antibody and Influenza Virus Neuraminidase and Comparison with the Overlapping Binding Site of the NC41 Unexpected Receptor Functional Mimicry Elucidates Activation of Coronavirus Fusion The Protein Data Bank Convolutional Networks on Graphs for Learning Molecular Fingerprints Extended-Connectivity Fingerprints MoleculeNet: A Benchmark for Molecular Machine Learning XGBoost: A Scalable Tree Boosting System KDD '16 Enhanced SVM-KPCA Method for Brain MR Image Classification Analysis of Multimerization of the SARS Coronavirus Nucleocapsid Protein Amino Acid Substitution Matrices from Protein Blocks Molecular Dynamics Simulations of Biomolecules GROMACS: A Message-Passing Parallel Molecular Dynamics Implementation High Performance Molecular Simulations through Multi-Level Parallelism from Laptops to Supercomputers Definition and Testing of the GROMOS Force-Field Versions 54A7 and 54B7 LINCS: A Linear Constraint Solver for Molecular Simulations Particle Mesh Ewald: An N⋅log(N) Method for Ewald Sums in Large Systems Hot-Solvent/Cold-Solute" Problem Revisited Simulations of proteins with inhomogeneous degrees of freedom: The effect of thermostats -Mor -2008 Coot: model-building tools for molecular graphics A Modern Open Library for the Analysis of Molecular Dynamics Trajectories Native contacts determine protein folding mechanisms in atomistic simulations The authors gratefully acknowledge the use of the supercomputing resource Arjuna provided by the Pittsburgh Supercomputing Center (PSC). This work is supported by Center for Machine Learning