key: cord-0127414-nvlcmgb8 authors: Atkinson, Timothy; Saremi, Saeed; Gomez, Faustino; Masci, Jonathan title: Automatic design of novel potential 3CL$^{text{pro}}$ and PL$^{text{pro}}$ inhibitors date: 2021-01-28 journal: nan DOI: nan sha: 2703d4698ab352b98cc845ed3e0f8804f5c0c3f5 doc_id: 127414 cord_uid: nvlcmgb8 With the goal of designing novel inhibitors for SARS-CoV-1 and SARS-CoV-2, we propose the general molecule optimization framework, Molecular Neural Assay Search (MONAS), consisting of three components: a property predictor which identifies molecules with specific desirable properties, an energy model which approximates the statistical similarity of a given molecule to known training molecules, and a molecule search method. In this work, these components are instantiated with graph neural networks (GNNs), Deep Energy Estimator Networks (DEEN) and Monte Carlo tree search (MCTS), respectively. This implementation is used to identify 120K molecules (out of 40-million explored) which the GNN determined to be likely SARS-CoV-1 inhibitors, and, at the same time, are statistically close to the dataset used to train the GNN. Over the past year, the search for molecules which may inhibit key receptor sites of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) has emerged as a central research objective within the scientific community [22] . The already widespread use of Deep Learning (DL) techniques as predictors in molecular chemistry [15, 17] has facilitated their rapid redeployment to this task e.g. [3, 7, 55] (see [32] for a comprehensive review). There are a number of structural similarities between both the main 3CL pro protease and the PL pro protease of SARS-Cov-2, both used in reproduction, and those of Severe Acute Respiratory Syndrome Coronavirus-1 (SARS-CoV-1) [34] . Based on these similarities, it has been suggested that the lack of available large-scale data for SARS-Cov-2 inhibition can be overcome using existing datasets of SARS-CoV-1 inhibitors [19] . This paper proposes Molecular Neural Assay Search (MONAS), a general data-driven automatic molecule design framework consisting of three core components ( Figure 1 ): • MONAS-P : Property predictor. A predictive model of molecular activity which is used to identify new molecules with desirable properties. • MONAS-E : Energy Model. A general probabilistic model that learns the distribution of molecules in an unsupervised manner. This energy model assigns a scalar (energy) to any molecule and effectively provides a measure of statistical similarity between generated molecules and molecules in the dataset. • MONAS-S : Search. A search procedure to explore over the space of possible molecules that is guided by the inhibition predictor and the energy model. This component discovers new molecules to which the other two models assign both a high likelihood of inhibition with a high similarity to known molecules. In the particular instantiation of the framework used in this work, MONAS-P is a Graph Neural Network (GNN; [2, 5, 18, 43] ), the MONAS-E is a Deep Energy Estimator Network [41, 42] , and Monte Carlo Tree Search (MCTS; [6] ) is used as MONAS-S to search over derivations of molecules in a Backus-Naur form grammar [30] . However, we stress that this is only one possible implementation of this general framework. For example, MONAS-P could instead use LSTM to predict behaviour from SMILES strings [35] , DEEN could be replaced with a variational autoencoder [26] for MONAS-E, and MONAS-S could be performed by a genetic algorithm [23] or deep reinforcement learning [45] . Further, while in this work we focus particularly Dataset of SARS-CoV-1 inhibition . . . Training data Feature space on inhibition of SARS-CoV-1 and SARS-CoV-2, in general this framework may be used for any data-driven molecular optimization task. Existing approaches to the identification of SARS-CoV-2 inhibitors and reactants generally use DL to predict inhibition in order to screen sets of known inhibitors [3, 14, 19, 47, 53] . In contrast, we propose a de novo approach to identifying and designing novel inhibitors for SARS-CoV-2. There are two main motivations for doing this: 1. The combinatorial space of small drug-like molecules is huge (approx. 10 33 [38] ) so that it is unlikely that even datasets with billions of molecules will contain the best SARS-CoV-2 inhibitors. Therefore, directed search through the full space may yield higher-quality results. 2. It allows the system to discover potential inhibitors which are previously unknown molecules and therefore not patented. If any discovered molecule is eventually tested and found viable for treatment, this may substantially improve the general availability of treatment. Unlike other de novo approaches which utilize docking simulations [9] or structural information [46] , MONAS maximizes the output of a neural model trained to predict SARS-CoV-1 inhibition. The use of a predictive network greatly reduces the cost of sampling so that millions of molecules can be explored efficiently. The rest of this work is organized as follows. In Section 2, the four datasets used to train a GNN-based Inhibition Predictor are described. The core concepts of GNNs and the specific architecture used for MONAS-P are described in Section 3. A brief overview of neural empirical Bayes which is used to learn an energy function on the feature space of the GNN (i.e. MONAS-E) is provided in Section 4. The directed search MONAS-S over molecules, guided by these neural models, is described in Section 5. The performance of the neural models and the results of inhibitor design experiments are provided in Section 6. Finally, our findings are summarized in Section 7. Assay ID Target protease #Inactive molecules #Active molecules 1,706 3CL pro 290,321 405 1,879 3CL pro 244 136 485,353 PL pro 322,433 602 652,038 PL pro 735 198 Total -331,480 1,095 Table 1 : Assays used throughout experiments. All assays are taken from PubChem [24] . The calculation of the totals takes into account that the same molecule may appear in multiple assays. The total number of active molecules is the number of molecules which are active in at least one assay. As in Hofmarcher et al. [19] , four publicly available assays, taken from PubChem [24] , were used. Each assay provides a set of molecules (represented by their SMILES strings [50] ) and the results of tests to determine whether or not each molecule inhibits a given protease for SARS-CoV-2. Each assay can be viewed as a mapping from its associated set of molecules M A to the binary vector {0, 1} where 0 indicates no inhibition and 1 indicates inhibition. A summary of the assays used is provided in Table 1 . To use these assays with a GNN, each molecule is then formatted as a graph according to the following: 1. For each molecule, convert its SMILES representation to a molecular representation using RdKit [31] . (a) For each atom, create a node with the following features: mass, valence, the total number of Hydrogens, whether it is aromatic, and formal charge. (b) For each bond between two atoms, create an edge in each direction between its corresponding nodes and additionally, create a self-loop on each node. 1 Each edge is categorized as one of: single bond, double bond, triple bond, aromatic bond or whether it is one of the synthetically added self-loops. 3. Each graph is assigned four binary classifications: one for each of the four assays, maintaining missing values. Graph neural networks are a relatively recent form of deep learning architecture for reasoning about structured relational data. As molecules are naturally structured, with atoms as nodes and bonds as edges, there is a clear affinity between graph neural networks and predictive molecular chemistry that has led to a number of developments in recent years [16, 20, 44] . In this work, the predictive component MONAS-P is instantiated as a GNN architecture. While there are a variety of unique GNN designs (see [56] ), a typical construction is a layer which takes a directed labelled multi-graph G K = (V K , E K ) and computes new node and edge features based on the existing features and adjacency, yielding a new graph G K+1 = (V K+1 , E K+1 ) with the same structure but new labels. Here, the set V = {v i } i=1:N V is the set of N V nodes where each v i is the ith node's features, and E K = (e j , s j , t j ) j=1:N E is the set of N E edges where each e j is the jth edge's features, s j ∈ V K is the jth edge's source node and t j ∈ V K is the jth edge's target node. The transformed G K+1 can then be passed to another graph neural network layer, have its node features used to classify nodes [28, 37] or predict edges [27, 49] , or features can be pooled and passed to some other neural network to perform classification or regression on the entire input graph [11, 54] . In the following it will be helpful to refer to the neighborhood of a given node v i , e.g. the set of edges which target it. This is defined by N i = {j | e j ∈ E and t j = v i }. In this work a simple model of graph neural network layers of the form in Battaglia et al. [2] is used. To compute G K+1 from G K , first compute E K+1 , given by E K+1 = {e j , s j , t j } j=1:N E , where e j is a function of each edge's features, source node features and target node features: e j = Φ E (e j , v sj , v tj ), where Φ E is a multi-layer perceptron (MLP). Then the mean of the new edge features of the neighborhood of each node v i is computed as, The new node features V K+1 are then computed as a function of each node's features and its mean aggregated neighborhood, giving the updated graph G K+1 . As before, Φ V is a multi-layer perceptron. An important property of the above construction is that, because each update is in the context of a node or edge's adjacency, and the permutation invariant mean aggregation of the neighborhood is used, the entire layer is invariant to permutations of the node and edge sets. Therefore, two isomorphic graphs will be updated in the same manner irrespective of the order in which the nodes (and edges) are indexed. When classifying graphs, it is often helpful to 'coarsen' the graph by merging nodes [56, Section V.C] to reduce the number of parameters and avoid over-fitting. In this work the edge contraction pooling operator EdgePool [10] is used to achieve this effect, which is extended to support edge features. EdgePool allows the graph coarsening to be learned through parameters W and b, such that a raw score can be assigned to each edge e j from source node s j to target node t j as, where ⊕ denotes concatenation. This is transformed into a score value s(e j ) by taking the softmax of the neighborhood of source node s j : s(e j ) = softmax Ns j r(e j ). Edges are then iteratively contracted according to their score. That is, starting at the edge with the highest score and continuing in descending order, contract an edge if and only if its source and target nodes have not been involved in any other edge contraction. When an edge e j is contracted, it is removed, its source node s j and its target node t j are merged, replacing all 3 items with a single node with the average of the node features multiplied by the score edge's associated score: Any incoming or outgoing edges of either s j or t j now are instead connected to v j . Furthermore, any edges which have become parallel as a result of this merge have their features merged, again by averaging. The result of these steps is a process whereby the graph is coarsened in a learned fashion while connectivity is preserved. In contrast, in alternative learnable pooling approaches such as Top-K pooling [8] and Selfattention Graph pooling [33] , the fact that nodes are dropped, rather than merged, means that connectivity is lost. We treat prediction of inhibition as multi-task and create a single network which predicts inhibition for all four assays. The architecture is shown in Figure 2 . Initially, the input graph G is passed through three 'GNN Layer -Edge Contraction' (GEC) blocks. Each GEC block is a graph neural network layer, consisting of The multi-task graph neural network used to predict SARS-CoV-1 protease inhibition activity for molecules. G is an input graph. Each GEC block is decomposed into a GNN layer followed by edge contraction pooling with outputs passed to the next block. The output of each block is also passed through global mean and max pooling, with the results of all pooling concatenated ( ) together to form a vector representation of the input graph. This vector is passed through an MLP for each target assay, yielding a binary prediction for each task f (G). The latent representation X is then passed to the Deep Energy Estimator Network (DEEN) to generate an energy ϕ(X), see Section 4. node MLP Φ V and edge MLP Φ E , followed by edge contraction pooling. In every GEC block, both Φ V and Φ E are 2-layer MLPs with 96 neurons per layer, and batch normalization applied after each layer followed by ReLU activation. Each edge contraction pooling operator removes approximately half of the edges from the graph so that the graph, after the third GEC block, is expected to have 1/8th of the edges of the input graph. After each GEC block, global mean and global max pooling are applied to the 96-dimensional node features, yielding a 192-dimensional representation of the input. The three representations from the GEC blocks are concatenated together, yielding the 574-dimensional latent representation of the input, denoted X. For each assay, a 'head' is used, which is a specialist MLP that predicts only inhibition for that assay. Each head is provided with the latent representation X and applies two 128-neuron feed-forward layers, with dropout applied before each layer (p = 0.25 and p = 0.5, respectively), and batch normalization and ReLU activation after. Then, the 128-dimensional vector is passed to a single logit neuron which gives a prediction for the given assay. The predictions are concatenated and passed through a softmax, yielding the prediction vector M (G). To handle the multi-task context while maintaining efficiency, the collective 331, 480 molecules are treated as a single dataset and are passed through the GNN in a mini-batch fashion. However, adjustments are made to a standard binary cross-entropy loss function to ensure that the network is not biased towards any given assay and that missing data is appropriately handled. For a given graph (molecule) G with ground truth K A for assay A, the loss is computed for the A-th head of the model, f A (G), as where the loss for one single molecule/graph across all assays is given by and the total loss (θ) to be minimized, where θ are the parameters of the GNN (omitted from Eq. 7 for cleaner notation). The factors α A and β A are weights towards each task and positive instances of that task, respectively to handle the extreme imbalances of the data (see Table 1 ). For assay A with I training inhibitors, J training non-inhibitors and N total training samples, α A = N/(I + J), and, β A = (I + J)/I. The role of α A ≥ 1 is to ensure that each assay contributes equally to the total loss averaged across all N training samples, whereas β A ≥ 1 ensures that positive and negative samples for each assay contribute to the total loss averaged across all I + J training molecules for assay A. Constructing a probabilistic model for the molecules and a statistical measure of how similar two molecules are based on dataset samples faces two main challenges: (i) In high dimensions, classical techniques like kernel density estimation break down because the volume of space grows exponentially (i.e. curse of dimensionality) and nonparametric methods which are based on Euclidean similarity metric become inadequate. (ii) One typically resorts to parameterizing distributions in directed or undirected graphical models, but, in both cases, the inference over latent variables is in general intractable. At the root of the problem lies in estimating the normalizing constant (a.k.a. the partition function) of probability distributions in high dimensions. There is however a class of models, called energy models, that are formulated around learning unnormalized densities or energy functions. For MCTS, where only the likelihood of generated molecules relative to the likelihood of the molecules in the dataset is of importance (this will become clearer in the next section), the distribution needs only be known up to a constant. Therefore, learning energy functions is sufficient. In this section, we review a recent development [41] that formulates the problem of learning energy functions in terms of the empirical Bayes methodology [36, 40] . The framework is referred to by "DEEN" due to its origin in Deep Energy Estimator Networks [42] . In this work, the problem is further simplified by learning the energy function on the feature space X = R d X of the GNN (here d X = 4018). Instead of learning a probabilistic model for the i.i.d. sequence G 1 , . . . , G n , the representation of the sequence in the feature space X 1 , . . . , X n is studied instead (see Figure 2 ). This simplification is due to technical reasons, since the denoising methodology of empirical Bayes is formulated in the Euclidean space where specifically the isotropic Gaussian N (0, σ 2 I d ) that defines the noise model plays a central role. Geometrically, the model is designed such that the negative gradient of the energy function evaluated on the noisy data is directed towards the clean data (see Fig. 3 ). In other words, learning can be viewed as "shaping" the energy function ϕ such that −∇ϕ (known as the score function [21] ) points toward the data manifold. Some technical aspects of DEEN are reviewed next. Consider the Gaussian noise model and corrupt the i.i.d. samples X 1 , . . . , X n as follows: An important result in empirical Bayes is that the Bayes estimator of X, given a noisy measurement Y = y, is given byx(y) = y + σ 2 ∇ log p(y), where ∇ is the gradient with respect to the noisy data y, and p(y) is the probability density function of the random variable Y = X + N (0, σ 2 I d ). The key step in DEEN is to parameterize the energy function of Y with a neural network ϕ ϑ : R d → R, where the Bayes estimator takes the parametric form:x ϑ (y) = y − σ 2 ∇ϕ ϑ (y). Since the Bayes estimator is the least-squares estimator, the learning objective follows immediately: where the expectation is over the empirical distribution over the samples (X i , Y ij ) (see Eq. 10) and · 2 is the Euclidean norm. The main appeal of the algorithm is that it transforms learning the energy function into an optimization problem because it sidesteps posterior inference or MCMC approximations in latent variable models [48] . At optimality, given a neural network with sufficient capacity, and enough unlabelled samples (n 1), finding a good approximation to the energy function: ϕ(y, ϑ * ) ≈ − log p(y) is theoretically guaranteed (modulo a constant). Summarizing, in terms of learning the distribution of graph-valued molecules, three simplifications are made: 1. The problem is transformed into learning the distribution in the Euclidean space X, defined by the GNN classifier. This simplification is indeed well-suited in this context, and is in line with the desire of having the molecules generated by MCTS be classified as inhibitors. 2. The statistical model is un-normalized, i.e., the energy function is defined modulo an additive constant. This is allowed since only the difference between energies appear in the reward function (introduced next), and the constant cancels out. 3. The algorithm is designed around learning the smoothed density associated with noisy data with the hyperparameter σ. This relaxation is in fact key for regularizing the learning [41] , but not critical in the MCTS reward function as will become clear. To automatically design new inhibitors, the conventional Upper Confidence Trees (UCT) variant of MCTS [29] that searches over a BNF grammar of SMILES strings [30] , is used 2 . MCTS searches a tree of derivations, i.e. sequences of steps which generate valid SMILES strings. In each iteration of MCTS, the following steps are taken (Figure 4 depicts each of the four steps schematically): 1. Selection: starting from a root node with an associated initial non-terminal ('S'), successive child nodes are selected according to the UCB1 formula until a previously unvisited node is encountered. In Figure 4 the root node is initially unexpanded, so it is selected. 3. Rollout: a random derivation is generated for the selected child, such that the partially completed SMILES string at the selected node is completed at random yielding a valid SMILES string. In our case, the rollout is performed uniformly at random, except when a terminal symbol is available; in which case decisions are made uniformly at random over available terminal symbols. The completed SMILES string is then converted into a molecular representation and evaluated using the reward function w. Backpropagation: the evaluated reward w β (G) of the molecule G is backpropagated through the tree. Every node that was visited has its average rewardw and maximum reward w max updated. Additionally, the number of visits to each visited node n is incremented. We use a mild adjustment of the UCB1 (Upper Confidence Bound 1 applied to trees) formula [1] to determine the appropriate node/production to use in each step of the selection process. The modified UCB1 formula used is w max where for node i, w max i is the maximum reward observed,w i is the average reward observed, n i is the total number of samples, N i is the total number of samples at the parent of node i, and c is the exploration/exploitation parameter. The introduction of a maximum reward term w max i is simply to help the system pursue particularly promising samples which are otherwise avoided due to a few poor-quality samples from that same node. Once all non-terminals have been cleared either by selection or rollout, a complete SMILES string is generated for evaluation. As stated, the rollout policy is random, but prioritizes terminals where possible in order to bias our sampling towards smaller molecules. Every time MCTS generates a SMILES string, it is converted to a molecule and then a graph, as described in Section 2. By passing the sample molecule G through the GNN, a prediction of inhibition f A (G) is obtained for assay A, and a latent representation X of the molecule which is further passed through the DEEN model to yield an energy ϕ(X). With φ min defined as the minimum energy for any known inhibitor in the test set, the reward returned to MCTS is then meaning that the sample molecule is heavily penalized for either a low inhibition prediction or high energy. The β term is a hyperparameter controlling the smoothness of the energy component of the reward function, with a default value of (maximum and minimum are computed over the test set). A simple but important property of the reward function is that it is invariant under ϕ → ϕ+C. The reward function must be invariant to this transformation since the energy function itself is defined modulo an additive constant. The automatic molecule design framework comprises three steps: 1. The ensemble of GNNs is trained to predict inhibition for the four assays ( §6.1). 2. DEEN is then trained on the feature space of molecules generated by the ensemble of GNNs to learn an energy model of the dataset ( §6.2). 3. The GNN ensemble and the DEEN model are used in the reward function (Eq. 14) to guide MCTS, generating potential novel inhibitors ( §6.3). To further boost the performance, given the relatively small number of positive samples, a simple Bootstrap-Aggregating ensemble of models [4] was used in which the outputs of the ensemble members are averaged to generate predictions. To build the ensemble, the training data is split into five non-overlapping folds of equal size. Each model is trained on a different set of four folds in a standard cross-fold manner, yielding five models which have been trained under different conditions. For each model, the unused fold serves as a validation set. The parameters of each at the epoch at which the validation loss is lowest is used to construct the final ensemble. The data was first split into two disjoint subsets: 80% of the data training, 20% for testing. The training data is then split into five folds, and a network is trained on each subset of four folds, yielding five models. The various subsets of data are stratified such that each contains approximately the same number of positive and negative samples for each task. Each model was trained using the ADAM optimizer [25] , with an initial learning rate of 2 × 10 −5 , batch size of 128 and weight decay of 1 × 10 −4 . Each model was trained until the validation loss did not improved for 20 epochs. Here, the validation loss is computed on the unused fold for each model. A typical training cycle is shown in Figure 5 . The lowest validation loss for this model was observed at epoch 15. A full listing of ROC AUC scores of the models is given in Table 2 , alongside those of the ensemble computed by averaging the outputs of the member models. The ensemble achieves higher scores than the average score of its individual members across all assays. In general, the ensemble has a 75 − 80% likelihood of ranking an inhibitor higher than a non-inhibitor across all tasks. Although there is room for further improvement, meaningful classification performance is clearly demonstrated. To validate the MONAS-P architecture, it was compared empirically with five other ensembles, each using a different type of GNN found in molecular chemistry literature: Message Passing Neural Networks (MPNNs; [16] ), Directed Message Passing Neural Networks (D-MPNN; [52] ), Attentive FingerPrints (Attentive FP; [51] ), and the full Graph Networks architecture of Battaglia et al. [2] with global conditioning. The implementation details for each method are provided in the Appendix. The results of these comparisons are shown in Table 3 . all for the same reserved subset of test data. MONAS-P out-performs all other approaches studied, except on assay 652, 038 where it is comparable to Attentive FP and Graph Networks and is marginally outperformed by D-MPNN. Furthermore, MONAS-P provides the highest ROC AUC score averaged across all four assays, confirming that it is well-suited to the problem. Note that the intent of these experiments is not to conduct an exhaustive empirical comparisons on this Table 2 ), measured during training and reported for each assay. The validation loss is lowest at epoch 15. Table 3 : Inhibition prediction performance comparison. For each assay the best performing model's ROC AUC score is highlighted in bold. Each result is reported for a bagging ensemble of five models each trained on different sub-sets of the training data. task, but is instead to provide context and verify that MONAS-P is both appropriate and comparable to other state-of-the-art techniques. DEEN is trained on the feature space of the training molecules with Gaussian noise with σ = 0.25. The 574-dimensional latent representations taken across all five members of the ensemble are concatenated to yield a single 2880-dimensional latent representation (labeled X in Section 4). An MLP was used with four layers of 3072, 2048 and 1024 neurons, respectively, followed by a single linear output neuron. Each layer, except for the first and last layer, uses a skip connection so that it takes, as input, the concatenation of the two previous layer outputs. All neurons use the smoothed ReLU activation function u/(1 + exp(−u)) which is known by two different names: SiLU [12] and Swish [39] . The choice of a smooth activation function is important here 3 due to the double backpropagation (one in y to compute the loss, the other in ϑ to compute the gradient of the loss) for a single SGD update. The ADAM optimizer is again used, with a learning rate of 10 −5 and a batch-size of 128. DEEN is trained for 200 epochs. For each assay, MCTS optimizes the reward function for one million samples. This process is repeated ten times, yielding a total search of ten million molecules per assay. Once MCTS has terminated, the top 30K unique sampled molecules are identified for each assay, where uniqueness is determined by comparing canonical SMILES strings. Additionally, in order to bias search toward smaller molecules, any derivation branch which reaches a depth ≥ 30 immediately prioritizes terminals to pursue termination of the derivation. The exploration constant c is set to √ 2. Figure 7 shows eight known inhibitors for assay 1, 706 in comparison to eight inhibitors discovered with and without energy regularization. This result clearly indicates the importance of the energy model in the reward function. The inhibitors discovered with DEEN enabled closely resemble molecules found within the dataset. In contrast, those inhibitors discovered without energy regularization (β = 0) are often unusually simple and linear. To highlight this point, Figure 8 compares the energies (as computed by the energy model) of various discovered inhibitors with and without energy regularization, and known (testing) inhibitors. Many of the proposed inhibitors generated without energy regularization are far from the learned manifold of the known inhibitors, while the distribution of inhibitors discovered with regularization are centered within the range of known-inhibitor energy values. In fact, they are much more concentrated than the known inhibitors, suggesting that perhaps the best explored inhibitors lie within a similar subspace of the overall combinatorial space of molecules. Figure 9 demonstrates that the energy model has also captured the general 'synthesizability' property of the original dataset. In this figure, the estimation of synthetic accessibility score [13] of each discovered potential inhibitor is plotted against energy for assay 1, 706, both with (red) and without (blue) energy regularization. Molecules discovered with regularization tend to have substantially lower estimation of synthetic accessibility scores, meaning that it is more likely that they are practically synthesizable. With the goal of designing novel inhibitors for SARS-CoV-1 and SARS-CoV-2, MONAS brings together GNNs for discriminating 3CL pro and PL pro inhibitors based on a publicly available database, DEEN for learning the energy function of molecules in the feature space of the trained GNN, and MCTS for searching within the space of inhibitors. MCTS was guided via a novel reward function whose form was dictated by both the GNN's learned discriminator and DEEN's learned energy function. The use of an energy function here can be informally thought of as a form of soft constraint satisfaction-the "constraint" is to be statistically close to the space of molecules in the dataset-where both β and σ control the "softness". Apart from hyperparameter search, which is inherently problem specific, the machinery developed here is quite general, and could be applied to any labeled dataset of molecules for drug discovery in particular and in biotechnology at large. The inhibitors discovered with energy regularization visually resemble molecules found within the dataset. In contrast, those discovered without energy regularization are often simple and linear. Figure 8 : Energies of the molecules discovered with and without energy regularization for assay 1, 706, compare to those of known (testing) inhibitors. Figure 9 : Energy vs. estimation of synthetic accessibility [13] with (red) and without (blue) energy regularization for assay 1, 706. Finite-time analysis of the multiarmed bandit problem Relational inductive biases, deep learning, and graph networks Predicting commercially available antiviral drugs that may act on the novel coronavirus (sars-cov-2) through a drug-target interaction deep learning model Bagging predictors. Machine learning Geometric deep learning: Going beyond euclidean data A survey of monte carlo tree search methods Deep learning system to screen coronavirus disease 2019 pneumonia Towards sparse hierarchical graph classifiers Evolutionary multi-objective design of sars-cov-2 protease inhibitor candidates Edge contraction pooling for graph neural networks Convolutional networks on graphs for learning molecular fingerprints Sigmoid-weighted linear units for neural network function approximation in reinforcement learning Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions Potential inhibitors for novel coronavirus protease identified by virtual screening of 606 million compounds Deep learning in drug discovery Neural message passing for quantum chemistry Deep learning for computational chemistry Representation learning on graphs: Methods and applications Large-scale ligandbased virtual screening for sars-cov-2 inhibitors using deep neural networks Predicting molecular properties with covariant compositional networks Estimation of non-normalized statistical models by score matching Jedi grand challence stage 1: Ultimate list of lead compounds by screening a billion molecules against covid-19 A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space Pubchem 2019 update: improved access to chemical data Adam: A method for stochastic optimization Neural relational inference for interacting systems Semi-supervised classification with graph convolutional networks Bandit based monte-carlo planning Grammars and reinforcement learning for molecule optimization Rdkit: Open-source cheminformatics Leveraging data science to combat covid-19: A comprehensive review Self-attention graph pooling Inhibition of the main protease 3cl-pro of the coronavirus disease 19 via structure-based ligand design and molecular modeling Large-scale comparison of machine learning methods for drug target prediction on chembl An empirical Bayes estimator of the mean of a normal population Deepwalk: Online learning of social representations Estimation of the size of drug-like chemical space based on gdb-17 data Swish: a self-gated activation function An empirical Bayes approach to statistics Neural empirical Bayes Deep energy estimator networks The graph neural network model Message-passing neural networks for high-throughput polymer screening Deep reinforcement learning for multiparameter optimization in de novo drug design Ai-aided design of novel targeted covalent inhibitors against sars-cov-2. bioRxiv Rapid identification of potential inhibitors of sars-cov-2 main protease by deep docking of 1.3 billion compounds Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning Dynamic graph CNN for learning on point clouds Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism Analyzing learned molecular representations for property prediction Deep learning based drug screening for novel coronavirus 2019-ncov Hierarchical graph pooling with structure learning Potential non-covalent sars-cov-2 3c-like protease inhibitors designed using generative deep learning approaches and reviewed by human medicinal chemist in virtual reality Graph neural networks: A review of methods and applications Each ensemble of networks is trained using the same five folds of training data, with each member of the ensemble using a different fold as a validation set for early stopping. In each case the final model chosen is that which achieved the lowest loss on its validation set, and early stopping is determined when the validation loss has not decreased for 20 epochs. For each network variant, the hyper-parameters were chosen by hand-tuning the parameter settings • Weight decay = 10 − Graph Network[2] Parameters • Node, edge and global dimensionality = 96 • Number of message passing layers T = 3 • Dropout on output layers p = 0.25 and p = 0.5, respectively • Batch size = 128, • Learning rate = 2 × 10 −5