key: cord-0941832-es2m2f4u authors: Haneczok, Jacek; Delijewski, Marcin title: Machine learning enabled identification of potential SARS-CoV-2 3CLpro inhibitors based on fixed molecular fingerprints and Graph-CNN neural representations date: 2021-05-28 journal: J Biomed Inform DOI: 10.1016/j.jbi.2021.103821 sha: 59b185cb928eed0cdb4e6e25ca1643dd3b69cd86 doc_id: 941832 cord_uid: es2m2f4u Aim Rapidly developing AI and machine learning (ML) technologies can expedite therapeutic development and in the time of current pandemic their merits are particularly in focus. The purpose of this study was to explore various ML approaches for molecular property prediction and illustrate their utility for identifying potential SARS-CoV-2 3CLpro inhibitors. Materials and methods We perform a series of drug discovery screenings based on supervised ML models operating in different ways on molecular representations, encompassing shallow learning methods based on fixed molecular fingerprints, Graph Convolutional Neural Network (Graph-CNN) with its self-learned molecular representations, as well as ML methods based on combining fixed and Graph-CNN learned representations. Results Results of our ML models are compared both with respect to the aggregated predictive performance in terms of ROC-AUC based on the scaffold splits, as well as on the granular level of individual predictions, corresponding to the top ranked repurposing candidates. This comparison reveals both certain characteristic homogeneity regarding chemical and pharmacological classification, with a prevalence of sulfonamides and anticancer drugs, as well as identifies novel groups of potential drug candidates against COVID-19. Conclusions A series of ML approaches for molecular property prediction enables drug discovery screenings, illustrating the utility for COVID-19. We show that the obtained results correspond well with the already published research on COVID-19 treatment, as well as provide novel insights on potential antiviral characteristics inferred from in vitro data. on combining fixed and Graph-CNN learned representations. Results: Results of our ML models are compared both with respect to the aggregated predictive performance in terms of ROC-AUC based on the scaffold splits, as well as on the granular level of individual predictions, corresponding to the top ranked repurposing candidates. This comparison reveals both certain characteristic homogeneity regarding chemical and pharmacological classification, with a prevalence of sulfonamides and anticancer drugs, as well as identifies novel groups of potential drug candidates against COVID- Conclusions: A series of ML approaches for molecular property prediction enables drug discovery screenings, illustrating the utility for COVID-19. We show that the obtained results correspond well with the already published research on COVID-19 treatment, as well as provide novel insights on potential antiviral characteristics inferred from in vitro data. Keywords: AI drug repurposing, machine learning, Graph Convolutional Neural Network, molecular property prediction, SARS-CoV-2, COVID-19; candidates can be identified [1] . In the current urgent need to fight the 7 global COVID-19 pandemic the merits of AI and ML are particularly in 8 focus [2, 3, 4, 5, 6, 7, 8, 9, 10], taking into account that in silico results 9 are still subject to additional in vitro and in vivo experiments and further 10 clinical trials to ensure their safety and efficacy [11] . 11 The current pandemic crisis is caused by a novel coronavirus, named se- [16] , which has a very low success rate, reaching about 6.2% [1, 17] 29 while taking typically 12 to 15 years [18] . 30 The purpose of our study was to identify the best repurposing candidates 31 among the Food and Drug Administration (FDA) approved drugs, based on 32 their predicted antiviral activity against SARS-CoV-2. To this end we have 33 trained supervised machine learning models based on data from a large crys-34 tallographic fragment screen against SARS-CoV-2 3CL protease (3CLpro). 35 The 3CLpro of SARS-CoV-2 known as the main protease is an enzyme which 36 has essential role in processing the polyproteins that are translated from the 37 viral RNA. The 3CLpro operates at no fewer than 11 cleavage sites on the The main contribution of our study is twofold. Firstly, we explore various Secondly, we describe a series of drug discovery screenings based on these ap- 73 proaches, and illustrate their utility for identifying novel groups of potential 74 drug candidates against COVID-19. We show that the obtained results both set of 80 hit structures which were fully modelled and refined [28] . Due predictions offer structural and reactivity information for on-going structure-107 based drug design against SARS-CoV-2 3CLpro [28] . For a screening set of candidate molecules, among which the best repur-109 posing candidates are identified based on their predicted inhibiting potential 110 against SARS-CoV-2, we employed the FDA set of all approved drugs [29] . The set of FDA approved drugs is an important resource for medical prac-112 tice and consists of compounds that are safe and efficacious drug products, algorithm, and y i is its property or activity score, in our context binary label 120 corresponding to the activity or inactivity of the compound. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 where C > 0 is a hyper-parameter corresponding to the inverse of the regu-164 larization strength. where the margin maximization problem can be conveniently reformulated as the following convex optimization problem kernel evaluations for all pairs of training points and λ = 1 C with a tuning 168 parameter C > 0 controlling the cost of violation to the separation margin. Given the solutionsα 0 andα the decision function is given by sign(f (x)) = 170 sign(α 0 + n i=1α i k(x, x i )). 171 1 Every feature map ϕ defines a positive definite kernel via k(x, x ) = ϕ(x), ϕ(x ) F , which is typically interpreted as a measure of dissimilarity between the inputs x and x . The approach based on Gradient Boosted Trees (GBT) adopted in this study is based on [41] and relies on ensembling m = 1, ..., M decision trees , whose predictions are given by where R representing the edge (bond) between nodes v and w and the forward pass consist of two phases: a message passing phase consisting of T steps (convolutions) creating the molecular representations (self-learned fingerprints) and 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 a readout phase using the final representations for making predictions. The message passing phase is initiated by mapping atom features x v to another set of vectors h 0 v termed hidden states. In the t-th step a message m t+1 v is created according to where N (v) is the set of neighbors of v in graph G and M t is a message function, and used further for updating the hidden states by where U t is a vertex update function. The readout phase uses a readout function R to map the final hidden states representing the molecule to the final output of the neural network We The motivation for this model design is to avoid messages that loop back to their preceding node, which can introduce noise [47, 46] . For similarities between this edgebased message passing design and belief propagation in probabilistic graphical models see [45, 48] 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 where ReLU is the rectifier activation function, W 0 ∈ R h×h 0 is a weight matrix and [x v , e vw ] ∈ R h 0 is the concatenation of the atom features x v and bond features e vw . The message passing update equation is taken as h t kv and the hidden state updates are calculated using the same function at each step t according to where W m ∈ R h×h is a weight matrix. After the final convolution step T the final representation of the v-th atom of the molecule is calculated as 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 where f is a feed-forward neural network. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 model or level-1 model. We employ the following basic stacking algorithm 204 using 2-fold cross-validation: Python library [50] . Regularized logistic regression is implemented using 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 zanilides) and melphalan (alkylating nitrogen mustard), which both belong 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 Table 2 . Here, as deeper interaction effects between the molecular 349 features are taken into account, we observe a greater chemical and pharmaco-350 logical diversity among the obtained repurposing candidates for COVID-19. Beside the so far identified characteristic drug classes 10 , the novel pharmaco- First, we report on the results based on LogReg and GBT models en-372 hanced with Graph-CNN self-learned embeddings, as described in Section 2.6.2. These results remain generally consistent in terms of the prevalence of sul-374 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 fonamides 11 and anticancer 12 drugs. Stacking ensembles combining Graph- CNN with the GBT and the LogReg models, respectively, as described in 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 to reducing production of proinflammatory and chemoattractant cytokines 13 [60]. It may be worth to mention, that ibrutinib has been among the best 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 potential of improving hemodynamics and prevention of lung fibrosis by re-478 ducing profibrotic and proinflammatory cytokines, like IL-2, IL-6, IL-8 and IFN-γ levels. 480 We would also like to underline that another promising candidate identi- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 deep learning approaches based on self-learned embeddings. Among the con-528 sidered types of fixed fingerprints, we observe that for the predictive task 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 Deep Learning Based Drug Screening for Novel Coro-594 navirus 2019-nCov Application of Artificial Intelligence in COVID-19 drug repur-597 posing Virtual screening and repurposing of FDA 599 approved drugs against COVID-19 main protease Utilizing drug repurposing against covid-602 19-efficacy, limitations, and challenges World Health Organization, Coronavirus disease (covid-19) weekly epi-604 demiological update and weekly operational update Organization Weekly Epidemiological Update and Weekly Opera-606 tional Update A novel coronavirus 609 outbreak of global health concern The 611 architecture of sars-cov-2 transcriptome Perspectives for repurposing drugs for the coro-618 navirus disease 2019 Estimation of clinical trial success 621 rates and related parameters Drug development costs when 623 financial risk is measured using the fama-french three-factor model, 624 Crystal structure of sars-cov-2 main 627 protease provides a basis for design of improved α-ketoamide inhibitors Predicting novel drugs for sars-cov-2 using ma-630 chine learning from a > 10 million chemical space Repurposing potential of fda approved 633 and investigational drugs for covid-19 targeting sars-cov-2 spike and 634 main protease and validation by machine learning algorithm Main protease structure and xchem 637 fragment screen, Diamond Light Source. Harwell Science and In-638 novation Campus New tools 642 and functions in data-out activities at protein data bank japan (pdbj) Protein data bank japan (pdbj): updated 646 user interfaces, resource description framework, analysis tools for large 647 structures Old-649 field Onedep: unified wwpdb system for deposition, biocuration, and vali-651 dation of macromolecular structures in the pdb archive Structure of m pro from sars-cov-2 and 655 discovery of its inhibitors Production of authentic sars-cov mpro with en-658 hanced activity: application as a novel tag-cleavage endopeptidase for 659 protein overproduction Crystallographic and electrophilic fragment screening of 664 the sars-cov-2 main protease Zinc 15-ligand discovery for everyone An overview of 668 fda-approved new molecular entities Molecular property prediction: recent trends 671 in the era of artificial intelligence Smiles, a chemical language and information system. 1. 674 introduction to methodology and encoding rules The entrance of informatics into combinatorial chem-677 istry Deep learning Analyzing Learned Molecular Rep-685 resentations for Property Prediction Reoptimization 688 of mdl keys for use in drug discovery Daylight theory: Fingerprints The generation of a unique machine description for chemi-693 cal structures-a technique developed at chemical abstracts service The elements of statistical learn-696 ing: data mining, inference, and prediction Support-vector networks Xgboost: A scalable tree boosting system A new model for learning in graph 704 domains Geometric deep learning: going beyond euclidean data Neural message passing for quantum chemistry Discriminative embeddings of latent variable 713 models for structured data Analyzing learned 717 molecular representations for property prediction Extensions of 720 marginalized graph kernels Probabilistic graphical models: principles and 723 techniques Stacked generalization Vander-728 plas Machine learning in Python Adam: A method for stochastic optimization Attention is all you need Moleculenet: a benchmark for molecular 738 machine learning Dual-histamine receptor blockade with 741 cetirizine-famotidine reduces pulmonary symptoms in covid-19 patients Why not consider an endothelin receptor antago-744 nist against sars-cov-2? Imatinib for covid-19: a case report Potential 751 inhibitors for novel coronavirus protease identified by virtual screening 752 of 606 million compounds Three novel prevention, diagnostic and 755 treatment options for covid-19 urgently necessitating controlled random-756 ized trials Hypothesis 758 of covid-19 therapy with sildenafil The btk-inhibitor ibrutinib may 762 protect against pulmonary injury in covid-19 infected patients Clinical evidences on the antiviral proper-765 ties of macrolide antibiotics in the covid-19 era and beyond Anti-inflammatory properties of antidiabetic 768 drugs: a "promised land" in the covid-19 era? Doxycycline treatment of high-risk covid-19-positive patients 772 with comorbid pulmonary disease Dapsone and doxycycline could be potential treat-775 ment modalities for covid-19 Btk/itk dual inhibitors: Modulat-777 ing immunopathology and lymphopenia for covid-19 therapy Inhibitors targeting bruton's 780 tyrosine kinase in cancers: drug development advances Protective role of bruton tyrosine kinase inhibitors in pa-784 tients with chronic lymphocytic leukaemia and covid-19 Ibrutinib 788 treatment improves t cell number and function in cll patients Case fatality rate of cancer patients with covid-19 in a new york hospital 793 system Concomitant 795 imatinib and ibrutinib in a patient with chronic myelogenous leukemia 796 and chronic lymphocytic leukemia Sulfonamides: Antiviral 798 strategy for neglected tropical disease virus Antiviral 801 sulfonamide derivatives Famotidine use and quantitative symptom track-805 ing for covid-19 in non-hospitalised patients: a case series Ivermectin, famotidine, and doxycycline: 807 A suggested combinatorial therapeutic for the treatment of covid-19 The potential effects of clinical antidiabetic 811 agents on sars-cov-2 AI drug discovery screening for covid-19 re-813 veals zafirlukast as a repurposing candidate We are grateful to Simone Vettorel for the insightful discussions and in-565 valuable comments.