key: cord-0198535-mts7xvh7
authors: Ward, Logan; Bilbrey, Jenna A.; Choudhury, Sutanay; Kumar, Neeraj; Sivaraman, Ganesh
title: Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19
date: 2021-02-09
journal: nan
DOI: nan
sha: 2e91f46d17e3acc19a68751c1068a3e7a728e2b0
doc_id: 198535
cord_uid: mts7xvh7

Design of new drug compounds with target properties is a key area of research in generative modeling. We present a small drug molecule design pipeline based on graph-generative models and a comparison study of two state-of-the-art graph generative models for designing COVID-19 targeted drug candidates: 1) a variational autoencoder-based approach (VAE) that uses prior knowledge of molecules that have been shown to be effective for earlier coronavirus treatments and 2) a deep Q-learning method (DQN) that generates optimized molecules without any proximity constraints. We evaluate the novelty of the automated molecule generation approaches by validating the candidate molecules with drug-protein binding affinity models. The VAE method produced two novel molecules with similar structures to the antiretroviral protease inhibitor Indinavir that show potential binding affinity for the SARS-CoV-2 protein target 3-chymotrypsin-like protease (3CL-protease).

Deep learning has demonstrated the potential to revolutionize drug design by reducing the initial search space in the early stages of discovery [25, 33, 49, 51] . By applying the appropriate algorithm trained on the appropriate data, novel molecules can be generated with target properties [13, 24, 29, 43] . Here, we evaluate methods for the high-performance intelligent design of small-molecule drug candidates with anti-SARS activity, with a specific focus on SARS-CoV-2, otherwise known as COVID-19. However, discovering potential lead candidates for COVID-19 presents a challenge to the scientific community due to the long timescale of the drug development process. There is a need to accelerate the design process through AI-driven workflows for effective drug compound development.

The potential drug space is composed of over 10 20 compounds, and candidates with suitable activity against specific proteins only narrows the search space to 10 4 − 10 5 structures. Machine learning (ML) methods are actively used in this search-and-screen process. Candidate molecules generated by ML methods are passed to downstream verification via virtual high-throughput drug-protein binding techniques, drug synthesis, biological assay, and finally clinical trials [5, 16, 44] .

Heterogeneous graphs provide a natural representation for smallmolecule organic compounds, with nodes representing atoms in the molecular structure and edges representing bonds between the atoms [48] . This approach motivated the exploration of graphgenerative models such as graph convolutional policy networks [49] , variational autoencoders [26, 27, 40] , and variants of deep reinforcement learning [45, 52] for the target-driven optimization of drug molecules. The drug-molecule design task is defined as generating a set of graphs such that for each graph ∈ , ( ) ≥ , where is a property optimization function and is a user-specified threshold. Most methods optimize target molecules for properties that can be derived from the molecular structure, such as the octanol-water partition coefficient (logP) [2] and quantitative estimate of druglikeness (QED) [7] .

The focus of our work is two-fold to generate compounds for drug discovery, specifically for SARS-CoV-2. In the first similarity approach, a junction-tree based variational autoencoder (JT-VAE) [27] is trained on a database of known drug molecules that has pIC 50 activity [15] . The trained VAE is used to generate novel molecules optimized towards specific properties using Bayesian optimization (BO). The second approach extends a graph-based deep reinforcement learning (DQN) method [52] to generate molecules that are not constrained by their proximity to past anti-SARS compounds. For comparison, we use the same set of property scoring functions ( ) as the respective optimization and reward functions for benchmarking the JT-VAE and DQN approaches.

Contributions. 1) The goal of our study is to perform multiobjective optimization for generating molecular structures by considering critical bioactivity properties that are typically not considered in structure-activity relationship approaches. Specifically, we focus on pIC 50 [42] , which captures the potency of the drug towards a protease target and cannot be accurately estimated from the chemical structure [3] . 2) In this context, we assembled a new protease dataset with molecules that are active against various enzymatic assays. This is considered to be one of the key properties while generating new drug molecules. These molecules are filtered from experimental pharmacology data in CheMBL, BindingDB, and ToxCat [6] . 3) Finally, we validate all top-ranking molecules against a Drug Target Binding Affinity (DBTA) classifier [25] to asses potential anti-SARS-CoV-2 activity. Full details of our implementation and source code are available at https://github.com/exalearn/coviddrug-design. 

In response to the COVID-19 global pandemic, researchers have pushed to identify marketed drugs that can be repurposed for SARS-CoV-2 treatment [18, 20, 35, 44, 51] . Born et al. amended their Pac-cMann RL approach, originally intended to generate anticancer drugs, to generate molecules with affinity to viral target proteins and controlled toxicity [10] . Batra et al. applied an ML-based screening approach to find known compounds with binding affinity to either the isolated SARS-CoV-2 S-protein at its host receptor region or to the S-protein-human ACE2 interface complex [5] . Huang et al. developed a deep learning toolkit for drug repurposing, called Deep-Purpose, with the goal of recommending candidates with high binding affinity to target amino acid sequences from known drugs [25] . These approaches screen large databases of known compounds, which have the potential to miss novel molecules with anti-SARS activity. Chenthamarakshan et al. developed the generative modeling framework CogMol to design drug candidates specific to a given target protein [13] .

Here we discuss which properties should be considered during lead optimization [14] . LogP is a measure of lipophilicity, which provides an understanding of the behavior of a drug in the body. Compounds intended for oral administration should have a logP no greater than 5, according to Lipinski's Rule of 5 [34] . Further analysis has shown that logP values between 1 and 3 may be more appropriate considering the effect of logP on absorption, distribution, metabolism, elimination, and toxicology (ADMET) properties [47] . Though oral bioavailability is an important factor, a sole focus on logP has the potential to screen out otherwise useful compounds [50] . QED has been proposed as a more holistic druglikeness metric [8] , from 0 (low) to 1 (high). Druglikeness provides a general metric for ADMET properties, but does not indicate the activity or effectiveness of a drug towards a specific target. The half maximal inhibitory concentration (IC 50 ), on the other hand, provides a quantitative measure of the potency of a compound to inhibit a specific biological process. IC 50 is obtained by measurement, and no universal ab initio method of computing its value exists. A number of methods have been developed to approximate IC 50 , many based on QSPR and recently some based on machine learning [1, 3, 9, 38] . Similarity to known drugs is also an important factor in drug discovery [30, 36] , as is the ability to synthesize the molecule, which can be estimated by the synthetic accessibility (SA) score [from 1 (easy) to 10 (difficult)] [17] .

We trained a message-passing neural network (MPNN) [19, 41] to predict pIC 50 (the inverse log of IC 50 ) for a given molecular structure. Following the formalism of Gilmer et al. [19] , our network is composed of message, update and readout operations (eqns. 1-3) and our choices for these functions are based on networks developed by St. John et al. for polymer property prediction. [28] The original state of each atom (ℎ ) and bond ( ) in our molecule ( ) is a 256-length vector with values defined by an embedding table based on the atomic number and bond type (e.g., single, double, aromatic). The states of these atoms are modified by eight successive "message" layers. Each message layer uses a two-layer multi-layer perceptron (MLP) with sigmoid activations to compute a message that uses the state of an atom (ℎ ), the state of the neighboring atom (ℎ ) and the bond which joins them ( ). The atom and bond states are updated according to the following equations:

The atom states output from the last layer (ℎ ) are used to predict the 50 of the molecule using a "readout" function ( ).

We use several variants of the readout function in our study. We tested both "molecular fingerprints," where the states of each node are combined before using a multi-layer perceptron (MLP) to reduce to compute pIC 50 , and an "atomic contribution, " where we combine the nodes after MLP to compute a per-node contribution to pIC 50 . We experimented with the use of five different functions to reduce the atomic state/contributions to a single value for each graph: summation, mean, maximum, softmax, and attention. The attention functions are created by learning an attention map by passing the node states through a MPNN. We tested all combinations of "molecular fingerprint" vs. "atomic contribution" and the five readout functions, for a total of 10 networks, training each on network the same 90% of our pIC 50 dataset and comparing its performance on the withheld 10% of the data. We used an MPNN that uses attention maps to reduce contributions from each atom to a single pIC 50 of a molecule in all subsequent experiments.

We use a junction-tree (JT) based variational autoencoder (VAE) method [27] for generating molecules with high proximity to anti-SARS drug molecules. This model generates novel molecular graphs by laying a tree-structured scaffold over substructures in the molecule, which are then combined into a valid molecule using a MPNN. The JT-VAE model allows for the incremental expansion of a molecular graph through the addition of subgraphs, or "vocabulary" of valid components (denoted ), derived from the training set ( Figure 1 ). The subgraphs are used to encode a molecule into its vector representation and decode latent vectors into valid molecular graphs. The use of subgraphs, rather than building a molecule atom-by-atom, maintains chemical validity at each step, while also incorporating functional groups common to the training set. Chemical graphs generated from the vocabulary will be structurally similar to those in the training set, which is a benefit when attempting to design drugs with similar properties to those in the training set.

Given a molecular graph = ( , ), JT-VAE coarsens into a junction tree data structure T = ( , ) such that each node ∈ in T represents a subgraph of . The subgraphs can be coarsened by using either automated JT construction algorithms from the graphical model literature [32] or a vocabulary based mapping approach that reduces to important building blocks of chemical structures.

Next, the JT-VAE method uses a message-passing network (as described in section 3.1) to encode T ( Figure 1 ) into a vector representation Z T . We refer the reader to [27] for specific implementations of the MPNN-based encoders for the input graph and the junction tree. The decoder component of the VAE learns to generate the same junction tree structure from Z T by maximizing the likelihood function (T |Z T ) .

Once the VAE model is trained, the next phase involves a twostep process for generating a molecular graph structure that is optimized for a target property. The first step involves drawing a sample from the latent space learnt by the VAE and transforming the sampled vector representation into a corresponding junction tree structure. The second step pursues a Bayesian optimization (BO) strategy to map the junction tree to a molecular graph that maximizes the target properties. Using the JT-VAE trained on our SARS-related database, we perform Bayesian optimization (BO), using the method of Kusner et al. [31] to produce novel molecules with target properties described in section 3.4.

We follow the Q-learning approach of Zhou et al. for our deep reinforcement learning approach [52] . We consider tasks in which an agent interacts with an environment E represented as a molecular graph. The agent starts with a null graph. At each time-step the agent selects an action from an action space A that includes addition of single atoms, changing the type of bonds or removing bonds from the graph. The agent also receives a reward R at each time step depending on a property scoring function. All the property scoring functions are described in section 3.4.

In this setting, we cast the molecule generation problem as a Markov Decision Process (MDP) [37] to learn a policy network that determines the best sequence of actions that start with an initial molecular graph and transform it a larger graph with desirable properties in a step-by-step fashion (Figure 1 ). At each step, we enumerate all possible actions and then select those which produce valid molecules (e.g., respect valency rules). Next, we train a multilayer perceptron (MLP) to predict to predict the value [37] of a certain action by passing the Morgan fingerprints [39] as input. The MLP approximates the value of an action computed using the Bellman equation, where the score of a state and the maximum score of the subsequent state is multiplied by a decay factor. As established with other Deep Q-Learning approaches, the addition of the value of the next state increases the value of moves which will lead to higher future rewards.

The scoring functions described in this section are used for both Bayesian optimization in the VAE based approach and reward computation for the deep reinforcement-learning based approach. Following Jin et al. [27] , we first compute a score that penalizes logP by the SA score (recall from section 2 that higher SA values are discouraged) and number of cycles with more than 6 atoms (eqn. 5). Considering that QED is a more comprehensive heuristic than logP, we also use a similar scoring function composed of QED penalized by the SA score and number of long cycles (eqn. 6). We then examine the utility of a SARS-specific scoring function based on the pIC 50 predicted by our MPNN penalized by the SA score and number of long cycles. Finally, we examine a multi-objective scoring function that accommodates both pIC 50 

We conduct experiments to answer two questions: Q1: What are the best possible ways to generate molecules with targeted activity towards SARS-CoV-2? Q2: How do we evaluate the novelty of our generated molecules?

We assembled a dataset with molecules active against various enzymatic assays filtered from experimental pharmacology data logged in CheMBL, BindingDB, and ToxCat [6] . Details of this dataset are provided in section 5.1. Following the workflow in Figure 1 , we first train a pIC 50 model, and subsequently use it to train the JT-VAE and DQN models. We begin with a discussion of the pIC 50 model given its key role in training the JT-VAE and DQN models.

We tested multiple variants of the pIC 50 prediction model. We found that MPNNs that limited the computation of the pIC 50 to contributions from only a few specific atoms in the molecule performed the best. As shown in Table S1 , the models which use max and softmax functions for aggregating the atom-level representations to a molecule-level representation (equation 4) have higher 2 scores than those which use summation or mean outputs. The relative performance of different networks can be explained by the physical mechanism behind the performance of anti-viral drugs. The presence or absence of a specific pattern in the molecular structure (e.g., functional groups, substituents) controls whether the molecule will bind with a certain portion of a target virus protein. We hypothesize that the "maximum" function, in particular, models this "all-or-nothing" physics well. The other atoms in the molecule play a role in determining whether the molecule will stay affixed at the target site. The reduced but non-negligible effect of the side groups could explain why the "atomic"-contribution model, which only uses the contribution from a single atom, performs less well than the whole-molecule fingerprint. These trends give us confidence that the MPNNs are operating based on well-known physics, but we would require comparison of which atoms are contributing to the predicted pIC 50 to results from molecular binding simulations to determine if the networks are indeed interpretable.

Molecule generation setup using JT-VAE. We trained JT-VAE for 8,300 iterations on the full database, with the following hyperparameters: hidden state dimension of 450, latent code dimension of 56, and graph message passing depth of 3. Analysis of the trained JT-VAE is given in Figure S1 . To optimize towards the specified scoring functions, we trained a sparse Gaussian process (SGP) to predict a score given the latent representation learned by JT-VAE and then perform 10 iterations of batched BO (sampling = 50) based on the expected improvement. Molecule generation setup using DQN. Our DQN approach builds up molecules from a single atom to large molecules atom-byatom and bond-by-bond. Each "episode" starts with a blank slate and the RL agent is allowed up to 40 steps. We update a model of the Q-function to predict the value of each move after each step in each episode and, as this model improves during training, we smoothly turn down the probability that we will choose a random move over the prediction of this model.

The DQN finds tens of thousands of candidate molecules with high pIC 50 , as shown in Figure 2 and Table 1 . Using the multiobjective reward function led to fewer molecules with pIC 50 > 8, but with increased druglikeness. Approximately two-thirds of the molecules with pIC 50 > 8 found with the multi-objective reward also show QED> 0.5. In contrast, only 5% of the molecules in the pIC 50based search show QED > 0.5. The addition of QED reduces the total number of high-pIC 50 molecules found by 25%, but increases the number of high-pIC 50 molecules found by over 10 times. Therefore, we recommend incorporating synthesizability and/or druglikeness into RL-driven searches for drug-like molecules.

Comparing JT-VAE and DQN. Table 1 shows the three highestscoring molecules generated by the two generative models with each scoring function. The DQN models always outperform JT-VAE in finding a molecule with a superior value of the scoring function being optimized. The performance disparity is particularly apparent when optimizing for logP: the maximum logP from DQN is 12.6 compared to only 4.1 for JT-VAE. We attribute the difference in optimization performance to JT-VAE implicitly sampling from a distribution of drug-like molecules and DQN having no such constraints.

The candidate molecules generated by JT-VAE have consistently better druglikeness and SA scores even when those values are not explicitly optimized for. When optimizing towards logP, the top-3 molecules generated by JT-VAE have moderate-to-high QEDs, while the top-3 from DQN are below 0.11. We attribute the exceptionally large disparity in optimal logP and associated QED values between the two methods to the fact that drugs typically have logP values between -0.4 and 5.6. The molecules from which JT-VAE was trained were all drug-like molecules, which makes it improbable to sample molecules with logP values far outside that range. Similarly, the molecules in the JT-VAE training set were experimentally synthesized. The SA score for these molecules is low, which could explain why the optimized molecules from the JT-VAE are also low even when this property was not optimized for. The RL agent uses no information about the space of experimentally studied drug molecules during its training process and, accordingly, finds molecules far from it. Overall, we find two different purposes for JT-VAE and RL-based molecular optimization. JT-VAE implicitly uses the distribution of molecules in its training set to bias towards realistic molecules, albeit at the expense of finding better candidates. The RL-based approach lacks such constraints and, for better or worse, can optimize without even implicitly regarding synthesizability or any other characteristic not explicitly encoded in the scoring function.

We observed an interesting structural trend in the molecules generated by JT-VAE when using pIC 50 as the scoring function. Figure  3 shows the structures of both the molecules with pIC 50 > 8 and the anti-HIV drug Indinavir, an antiretroviral protease inhibitor. A common backbone is shared between Indinavir and the top-6 predictions. The Tanimoto similarity scores [4] of these six generated molecules against Indinavir range from 0.65 to 0.91. Indinavir has been proposed as a drug to treat SARS-CoV-2 due to favorable docking to the coronavirus 3-chymotrypsin-like protease (3CL-protease), a promising drug target for combating coronavirus infections [11, 21, 22] . Notably, three of the generated molecules have a higher predicted pIC 50 than does Indinavir. Experimental confirmation of Drug-Target-Interaction (DTI) is challenging and time-consuming [46] . In silico Drug-Target Binding Affinity (DTBA) methods offer an alternative to evaluate DTI [23] . We employ a ML-based DTBA model to validate the interaction of molecules generated by JT-VAE against a target SARS-CoV2-L protease [12] . We trained a DBTA binary classification model using extended connectivity fingerprint [39] encoding for the drug molecule and the target protease sequence encoding using a Convolution Neural Network (CNN) as implemented in the DeepPurpose toolkit [25] . The default hyperparameters provided in the DeepPurpose toolkit were found to be sufficient. The DBTA model classified four of the top 11 molecules (including the top two in Figure 3 ) with probability > 0.5 to have interaction with SARS-CoV2-3CL protease.

We compared two graph generative models, JT-VAE and DQN, for the task of discovering potential small-molecule candidates with activity against SARS-CoV-2. DQN always outperformed JT-VAE in finding a molecule with a superior value of the scoring function being optimized. However, JT-VAE generated molecules that were more structurally similar to those in the database due to substructure representation, which produced a lower SA score and logP < 5. JT-VAE tended to produce what looked to be drug molecules, while DQN produced precursor-like candidates with optimized properties, which could be used as starting structures to add additional substituents aimed at the specific target.

Preparation. We prepared and assembled the protease datasets with molecules active against various protease in enzymatic assays filtered from experimentally pharmacology data such as CheMBL, BindingDB, and ToxCat [6] . The database was filtered out with the IC 50 activity standard types and their potency. Molecules with size larger than 1000 Dalton were removed due to the limitation of the representation of large molecules in cheminformatics. We also filtered out non-drug like molecules containing metals and polypeptides. The curated data was standardized using the logarithmic scale -log10 of a numeric value in nM for all compounds. We used the mean average for a molecule with more than one IC 50 value. The resulting dataset contains 6545 unique molecules accompanied by their SMILES strings and experimental IC 50 values. Quantitative Characterization. We computed the metrics included in our scoring functions for each molecule in the database. The range of values can be seen in Figure S1 . The highest pIC 50 We computed the Tanimoto similarity for all pairs of compounds to gain insight into the structural diversity of molecules in our database ( Figure S2 ). The entries in the matrix were ordered in increasing pIC 50 values. The similarity is represented by the color bar, with yellow representing low similarity (0) and red high similarity (1). We observe that structures tend to become more similar to their neighbors as pIC 50 increases, indicating that compounds with high pIC 50 values tend to be structurally similar, supporting the consideration of molecular similarity during drug discovery. Figure S1 : We generated 1,000 molecules using the trained JT-VAE, of which 560 were unique. The figure shows the comparison of the QED, logP, SA score, and pIC 50 of compounds in the database and those generated by the JT-VAE. The JT-VAE reproduced the range of values present in the database, minus outliers. The similar values indicate that the JT-VAE is able to reproduce the wide range of structures present in the database. The pIC 50 values for generated molecules were estimated by our MPNN. Figure S2 : Heat map of the Tanimoto similarity between all compounds in the database. Entries are ordered in increasing pIC 50 . The Tanimoto similarity is generally higher among structures with high pIC 50 (located towards bottom right of the matrix), indicating the importance of considering structural similarity in drug discovery.

Table S1: 2 scores for MPNN models trained to predict the pIC 50 of drugs from their molecular structure. Each model was trained using a different readout function (columns) to combine atomic contributions to pIC 50 or to create a single molecular fingerprint. Bold indicates the model used in our experiments and underscore indicates the best-performing model. 

Classification of drug molecules considering their IC50 values using mixedinteger linear programming based hyper-boxes method

pH-Metric logP 10. Determination of liposomal membrane-water partition coefficients of lonizable drugs

Development of Quantum Chemical Method to Calculate Half Maximal Inhibitory Concentration (IC50)

Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations

Screening of Therapeutic Agents for COVID-19 using Machine Learning and Ensemble Docking Simulations

The ChEMBL bioactivity database: an update

Quantifying the chemical beauty of drugs

Sorel Muresan, and Andrew L. Hopkins. 2012. Quantifying the chemical beauty of drugs

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

PaccMannRL on SARS-CoV-2: Designing antiviral candidates with conditional generative models

Potential Therapeutic Agents for COVID-19 Based on the Analysis of Protease and RNA Polymerase Docking

Prediction of the SARS-CoV-2 (2019-nCoV) 3C-like protease (3CL pro) structure: virtual screening reveals velpatasvir, ledipasvir, and other drug repurposing candidates

Target-specific and selective drug design for covid-19 using deep generative models

Drug-target residence time and its implications for lead optimization

SARS and MERS: recent insights into emerging coronaviruses

Biomolecular simulation: a computational microscope for molecular biology

Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions

Repurposing of clinically approved drugs for treatment of coronavirus disease 2019 in a 2019-novel coronavirusrelated coronavirus model

Neural message passing for quantum chemistry

Potential inhibitors of coronavirus 3-chymotrypsin-like protease (3CL(pro)): an in silico screening of alkaloids and terpenoids from African medicinal plants

Coronavirus puts drug repurposing on the fast track

SimBoost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines

Molecular Design in Synthetically Accessible Chemical Space via Deep Reinforcement Learning

Deep-Purpose: a Deep Learning Based Drug Repurposing Toolkit

Multi-Objective Molecule Generation using Interpretable Substructures. arXiv e-prints

Junction Tree Variational Autoencoder for Molecular Graph Generation

Message-passing neural networks for high-throughput polymer screening

DeepGraphMol, a multi-objective, computational strategy for generating molecules with desirable properties: a graph convolution and reinforcement learning approach

Advances in the Development of Shape Similarity Methods and Their Application in Drug Discovery

Local computations with probabilities on graphical structures and their application to expert systems

Multi-objective de novo drug design with conditional graph generative model

Lead-and drug-like compounds: the rule-of-five revolution

Drug treatment options for the 2019-new coronavirus (2019-nCoV)

Molecular similarity in medicinal chemistry: miniperspective

Playing atari with deep reinforcement learning

Prediction of IC50 Values for ACAT Inhibitors from Molecular Structure

Extended-connectivity fingerprints

Nevae: A deep generative model for 6 molecular graphs

The graph neural network model

Guidelines for accurate EC50/IC50 estimation

A machine learning workflow for molecular analysis: application to melting points

Repurposing Therapeutics for COVID-19: Supercomputer-Based Docking to the SARS-CoV-2 Viral Spike Protein and Viral Spike Protein-Human ACE2 Interface

Deep reinforcement learning for multiparameter optimization in de novo drug design

Comparison Study of Computational Prediction Tools for Drug-Target Binding Affinities

Lipophilicity in drug discovery

SMILES. 3. DEPICT. Graphical depiction of chemical structures

Graph convolutional policy network for goal-directed molecular graph generation

Drug discovery beyond the 'rule-of-five'

Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2

Optimization of Molecules via Deep Reinforcement Learning