key: cord-0486715-0b77duc8 authors: Hofmarcher, Markus; Mayr, Andreas; Rumetshofer, Elisabeth; Ruch, Peter; Renz, Philipp; Schimunek, Johannes; Seidl, Philipp; Vall, Andreu; Widrich, Michael; Hochreiter, Sepp; Klambauer, Gunter title: Large-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks date: 2020-03-25 journal: nan DOI: nan sha: f601b499ccc36b3c8982aefff48a8e2bfec00d5c doc_id: 486715 cord_uid: 0b77duc8 Due to the current severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, there is an urgent need for novel therapies and drugs. We conducted a large-scale virtual screening for small molecules that are potential CoV-2 inhibitors. To this end, we utilized"ChemAI", a deep neural network trained on more than 220M data points across 3.6M molecules from three public drug-discovery databases. With ChemAI, we screened and ranked one billion molecules from the ZINC database for favourable effects against CoV-2. We then reduced the result to the 30,000 top-ranked compounds, which are readily accessible and purchasable via the ZINC database. Additionally, we screened the DrugBank using ChemAI to allow for drug repurposing, which would be a fast way towards a therapy. We provide these top-ranked compounds of ZINC and DrugBank as a library for further screening with bioassays at https://github.com/ml-jku/sars-cov-inhibitors-chemai. Introduction. Due to the current world-wide crisis of SARS-CoV-2 virus infections, there is a strong need for new therapies. While many efforts are focused on repurposing existing drugs Wang et al., 2020; Ton et al., 2020) , we suggest to test new molecules with potentially higher efficacy. Therefore, we performed a large-scale ligand-based virtual screening run, which resulted in 30,000 potential SARS-CoV-2 inhibitors with favorable properties. We actively outreach to the scientific community to test these molecules and consider them as a custom-designed chemical library. Most current virtual screens are structure-based and use docking methods Huang et al., 2020; Haider et al., 2020; Wang et al., 2020; Fischer et al., 2020; Chen et al., 2020; Ton et al., 2020; Senathilake et al., 2020; Ruan et al., 2020; Jin et al., 2020; Zhang et al., 2020; Gorgulla et al., 2020) while only one screen is ligand-based and uses a similarity-based approach (Zhu et al., 2020) . The largest docking studies screen databases with sizes ranging from roughly 700 million (Fischer et al., 2020) to 1.3 billion (Ton et al., 2020) molecules. Also our study operates on databases of this size, concretely we perform a ligand-based virtual screening of a collection of one billion molecules from the ZINC database. Deep ligand-based virtual screening. "ChemAI" is a deep neural network trained to simultaneously predict a large number of biological effects (Mayr et al., 2018; Preuer et al., 2019) . In more detail, the network is of the type SmilesLSTM (Mayr et al., 2018; Hochreiter & Schmidhuber, 1997 ) and trained on a data set comprised of ChEMBL (Gaulton et al., 2017) , ZINC (Sterling & Irwin, 2015) and PubChem (Kim et al., 2016) , and which is similar to the data set used by Preuer et al. (2018) . ChemAI predicts 6,269 biological outcomes, such as binding to targets, inhibitory or toxic effects. The network was trained in a multi-task setting, in which data from other bioassays was used to enhance the predictive power for SARS-CoV inhibitory effects. Each modelled biological effect is represented by an output neuron of the neural network. We utilized a small set of output neurons associated with SARS-CoV inhibition and a set of output neurons associated with toxic effects to rank compounds. We screen the ZINC database because it contains a large set of diverse molecules and additionally provides links to vendors from which to purchase and physically obtain those molecules. We downloaded 898,196,375 molecules from ZINC and converted them to canonical SMILES (Weininger, 1988) using RDKit (Landrum, 2006) . We then performed inference with ChemAI to obtain predictions for each of those roughly one billion molecules. Overview of the main biological effects considered for ranking the molecules of the virtual screen. "#inact" and "#actAll" report the number of actives and inactives in the training set. All assays are based on inhibition of proteins of SARS-CoV-1. Selecting bioassays for multiple targets of SARS-CoV. The SARS-CoV-2 has two main proteases that are critical for its replication, namely the 3CLpro (3C-like protease) and PLpro (Papain Like Protease), encoded in an open reading frame (Macchiagodena et al., 2020) . A compound that inhibits both proteases could be promising drug candidates (Ledford, 2009; Collison, 2019) . The virus proteases are also strikingly similar to those in SARS-CoV-1 (Macchiagodena et al., 2020) , which is also an implicit assumption by docking-based approaches. We therefore select two groups of assays, one of which measures the inhibition of 3CLpro and the other the inhibition of PLpro (see Table 1 ). For each of those four assays, ChemAI possesses an output unit, which models the ability of small molecules to exhibit the effect measured by the assay. Thus, using the predictions yielded by ChemAI, it is possible to rank compounds by their predicted ability to inhibit the two main proteases of SARS-CoV-1, which can be a proxy for the inhibitory potential for SARS-CoV-2. Consensus ranking. We developed a library of compounds which is enriched for molecules with the ability to inhibit both proteases of the SARS-CoV-2. In order to score the multi-target effect, we calculated a consensus score for each molecule as the average rank of the predictions over the four selected assays (see Table 1 ). We then ranked all compounds by this consensus score. For each of the top-ranked compounds, we also calculated their minimal distance to actives in the training set to be able to identify novel chemical structures. Furthermore, for each compound we also report its number of potential toxic effects (Mayr et al., 2016) . For the distance metric, we used the Jaccard distance based on binary ECFP4 fingerprints folded to a length of 1024, which yields values in the interval [0, 1]. For potential toxic effects, we used 75 output units of ChemAI with high predictive quality, concretely an area under ROC-curve (AUC) larger than 0.80, and counted how many of those output units indicated a toxic effect. This value is reported in Table 2 (column "tox"). Furthermore, we report the clinical toxicity probability predicted by an independent multitask neural network fitted on the ClinTox dataset (Wu et al., 2018) . These probability values are calibrated by Platt scal-ing (Platt et al., 1999) and reported in Table 2 (column "ct"). The additional information contained in these values can be used to obtain a refined ranking for testing the molecules. We implemented the overall process as a two-step approach. In the first step, we reduced the ZINC database of one billion molecules to a smaller set, where we kept all molecules that exhibited some predicted activity on any of the four assays (precisely, at least one of the predictions had to reside in the top-1% quantile). In this way, we obtained an intermediate dataset of 5,672,501 molecules. For those molecules, the consensus score, the toxicity flags and the distance to known actives were calculated. In the second step, we reduced the dataset to the top-ranked 30,000 molecules by the consensus score. Results. With the abovementioned approach, we assembled a library of potential inhibitors of SARS-Cov-2. We report three metrics for each compound: a) predicted inhibitory effect of SARS-CoV proteases b) potential toxicities and c) distance to known actives. This led to a ranked list of compounds of which we provide the top 30,000 as a screening library. The top-ranked molecules are given in Table 2 and Figure 1 . We also checked whether molecules suggested by other publications can be confirmed by ChemAI. Overall, some suggested molecules show at least mild predicted activity against SARS-CoV (see Table 3 ). Drug repurposing of DrugBank molecules. With the same procedure employed for ZINC, we also screened the DrugBank (Wishart et al., 2018) , a database of ≈10,000 drugs. Again, we predicted inhibitory effects on the two viral proteases using ChemAI, and then calculated a consensus score for each drug. In this way, we obtained a ranked list of drugs that could be potential SARS-Cov inhibitors, which could be fast ways to therapies via drug repurposing (Ashburn & Thor, 2004) . We provide this list freely for research institutions (see Availability). Discussion. In this work, we presented the construction of a screening library of small molecules that are potential inhibitors of SARS-CoV-2. Our ligand-based approach uses a neural network trained to predict the outcomes of bioas- Table 2 . Top-ranked molecules by ChemAI. All compounds have a high activity predicted on all four assays (column "score") and are relatively distant (column "dist") to current known inhibitors. The distance measure is the Jaccard distance based on binary ECFP4 fingerprints and resides in the interval [0, 1]. Some of the presented molecules might exhibit a number of toxic effects (column "tox"). Here the number of models indicating a toxic effect is reported, where the total number of toxicity models was 75. We also report the estimated probability to exhibit clinical toxicity (column "ct"). says. From this multi-task models, four tasks have been selected to predict the inhibitory potential against SARS-CoV-1. A consensus between these predictions was used to rank compounds from the ZINC database, of which the 30,000 top-ranked are reported. The approach is limited by the predictive quality of the underlying machine learning method, evaluated via AUC and leading to values in the range of 0.69 to 0.78. While these results are very promising, improved data quality, larger amount of data or machine-learning approaches could lead to increased predictive performance and quality of the library. A promising direction is also to enrich the representation of molecules via already available biological modalities (Simm et al., 2018; Hofmarcher et al., 2019) . We expect that the data for SARS-CoV-1 already has high predictive power for inhibitory effects of compounds on SARS-CoV-2. However, the current predictions can be further adjusted toward SARS-CoV-2 via transfer-learning and the incorporation of new data from SARS-CoV-2. In particular, few shot learning may be utilized for the first measurements for SARS-CoV-2 thus adjusting the multitask model toward SARS-CoV-2. Availability The library of molecules is available at https://github.com/ml-jku/ sars-cov-inhibitors-chemai. Macchiagodena, M., Pagliai, M., and Procacci, P. Inhibition of the main protease 3cl-pro of the coronavirus disease 19 via structure-based ligand design and molecular modeling. arXiv preprint arXiv:2002.09937, 2020. Drug repositioning: identifying and developing new uses for existing drugs Prediction of the sars-cov-2 (2019-ncov) 3c-like protease (3cl pro) structure: virtual screening reveals velpatasvir, ledipasvir, and other drug repurposing candidates Two targets are better than one Inhibitors for novel coronavirus protease identified by virtual screening of 687 million compounds Virtual screening for potential inhibitors of Mcl-1 conformations sampled by normal modes, molecular dynamics, and nuclear magnetic resonance An open-source drug discovery platform enables ultra-large virtual screens In silico discovery of novel inhibitors against main protease (mpro) of sars-cov-2 using pharmacophore and molecular docking based virtual screening from zinc database Long short-term memory Accurate prediction of biological assays with high-throughput microscopy images and convolutional networks Virtual screening and molecular dynamics on blockage of key drug targets as treatment for covid-19 caused by sars-cov-2 Structure-based drug design, virtual screening and high-throughput screening rapidly identify antiviral leads targeting covid-19 Pubchem substance and compound databases RDKit: Open-source cheminformatics One drug, two targets Identification of a zika ns2b-ns3pro pocket susceptible to allosteric inhibition by small molecules including qucertin rich in edible plants. bioRxiv Deeptox: toxicity prediction using deep learning Large-scale comparison of machine learning methods for drug target prediction on chembl Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods Fréchet chemnet distance: a metric for generative models for molecules in drug discovery Interpretable deep learning in drug discovery Potential inhibitors targeting rna-dependent rna polymerase activity (nsp12) of sars-cov-2 Virtual screening of inhibitors against spike glycoprotein of 2019 novel corona virus: a drug repurposing approach Repurposing high-throughput image assays enables biological activity prediction for drug discovery Zinc 15-ligand discovery for everyone Rapid identification of potential inhibitors of sars-cov-2 main protease by deep docking of 1.3 billion compounds Virtual screening of approved clinic drugs with main protease (3clpro) reveals potential inhibitory effects on sars SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules Drugbank 5.0: a major update to the drugbank database for Analysis of therapeutic targets for sarscov-2 and discovery of potential drugs by computational methods Molecu-leNet: A benchmark for molecular machine learning Discovery of anti-sars-cov-2 agents from commercially available flavor via docking screening Network-based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2 D3similarity: A ligand-based approach for predicting drug targets and for virtual screening of active compounds against covid-19. 2020. ZINC ID Trivial name(s) Canonical SMILES Publications ZINC00057060 Melatonin COc1ccc2 Quercetin O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O) C)O3)[C@H](C)O2) c2c(cc3c(c2O)C(=O)c2c(O) cccc2C3=O) ZINC03794794 Mitoxantrone C1=CC(=C2C(=C1NCCNCCO) C(=O)C3=C(C=CC (=C3C2=O)O)O)NCCNCCO ZINC03830332 E155 C1=CC=C2C(=C1)C(=CC=C2S(=O) (=O)O)NN=C3C=C(C(=O)C (=NNC4=CC=C(C5=CC=CC=C54) S(=O)(=O)O)C3=O)CO ZINC14879972 Gar-936 CN(C)c1cc(NC(=O)CNC(C) C)c(O)c2c1C[C@H]1C ZINC00001645 Magnolol C=CCc1ccc(O)c(-c2cc (CC=C)ccc2O)c1 ZINC00014036 Piceatannol Oc1cc(O)cc(/C=C/c2ccc(O)c(O)c2 2020) ZINC16052277 Doxycycline C[C@H]1c2cccc(O)c2C(=O) C2=C(O) ZINC3920266 Idarubicin CC(=O)[C@]1(O)Cc2c(O) O2)C1)C(=O) Compounds suggested in related publications for potential activity against SARS-CoV-2 and which also exhibit at least mild predicted activity against SARS-CoV proteases by ChemAI Funding by the Institute for Machine Learning (JKU). All authors contributed equally to this work.