key: cord-239720-efbfqnem
authors: Axelrod, Simon; Gomez-Bombarelli, Rafael
title: GEOM: Energy-annotated molecular conformations for property prediction and molecular generation
date: 2020-06-09
journal: nan
DOI: nan
sha: 
doc_id: 239720
cord_uid: efbfqnem

Machine learning outperforms traditional approaches in many molecular design tasks. Although molecules are often thought of as 2D graphs, they in fact consist of an ensemble of inter-converting 3D structures called conformers. Molecular properties arise from the contribution of many conformers, and in the case of a drug binding a target, may be due mainly to a few distinct members. Molecular representations in machine learning are typically based on either one single 3D conformer or on a 2D graph that strips geometrical information. No reference datasets exist that connect these graph and point cloud ensemble representations. Here, we use first-principles simulations to annotate over 400,000 molecules with the ensemble of geometries they span. The Geometrical Embedding Of Molecules (GEOM) dataset contains over 33 million molecular conformers labeled with their relative energies and statistical probabilities at room temperature. This dataset will assist benchmarking and transfer learning in two classes of tasks: inferring 3D properties from 2D molecular graphs, and developing generative models to sample 3D conformations.

Machine learning outperforms traditional rulebased baselines in many molecule-related tasks, including property prediction and virtual screening [1] [2] [3] , inverse design using generative models [4] [5] [6] [7] [8] [9] [10] [11] , reinforcement learning [12] [13] [14] [15] , differentiable simulators [10, 16, 17] , and synthesis planning and retrosynthesis [18, 19] . These applications have been enabled by reference datasets and tasks [20] , and by algorithmic improvements, especially in representation learning. In particular, graph convolutional [21] [22] [23] , and more recently equivariant neural network architectures [24] [25] [26] , achieve state-of-the-art performance in a variety of tasks.

Unlike other data structures, molecules do not have an obvious basic representation. Strictly, they exist as ensembles of 3D point clouds [27] . In chemistry, they are typically represented as graphs with domain-specific annotations to describe spatial arrangement. Molecular representations in machine learning, and the existing reference datasets, typically use either graphs [28] , or a single point cloud per molecule [29] . Because of the non-overlapping datasets and tasks, the interplay between graph and 3D features remains unexplored from a representation learning perspective.

Molecular representations geared for processing by humans are well-studied [27] . A molecule is a stable spatial arrangement of atoms. This arrangement fluctuates in time, as it exists on an energy surface with many local minima. It is generally possible to identify chemical bonds that connect pairs of nearby atoms in a molecule. These bonds can be classified in qualitative classes (single, double, etc.). Molecules are represented in chemistry through 2D projections of the atom point cloud onto onto a plane, with bonds symbolized as lines and atoms in the nodes represented as their atomic symbol. The stereochemical formula utilizes perspective and node and edge notation to capture 3D orientation beyond a simple undirected graph, and can also be transformed into a string representation like SMILES [30] or InChi [31] .

Multiple 3D structures can have the same connectivity but different spatial arrangements. The set of stable spatial structures that can inter-convert at room temperature are called conformers. The ensemble of conformers a molecule can access, and their relative populations, are dictated by their relative energies. Conformers are typically not represented explicitly, and the projection formula is understood to embed all possible conformers. It is possible to annotate a molecular entity with one or more valid geometries through physics-based simulations. Finding the conformer with the lowest possible energy, or enumerating all thermally-accessible ones is computationally challenging [32] . Indeed, utilizing generative and autoregressive models to guess valid, high-likelihood conformations from a molecular connectivity is an exciting and active area of development [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] .

Thus, two classes of learning tasks and two corresponding representations are generally of interest. Surrogate modeling tasks aim to replace physics simulators, and relate molecular entities existing in one specific point cloud state (conformer) to an expensive simulation outcome. These are natively 3D point cloud tasks. For predicting of experimental properties of chemical species, only the stereochemical projection formula is available as input. Graph-based neural networks, however, cannot process stereochemical features natively and flatten the representation into a plain graph, and thus struggle to differentiate between stereoisomers. Since the low energy conformers will be most abundant in the ensemble, it is common to annotate chemical species data with one arbitrary conformer to add stereochemical data back. These point cloud annotations of the graph are obtained from cheap simulations with poor guaranties of finding representative minima. Furthermore, this addition does not result in significant performance gains [43] . This may be due to the fact that the predicted property emerges from the ensemble of conformers, and not from one member.

A number of open questions thus arise in relating these disparate tasks and representations, such as (i) the capacity of graph-based architectures to infer the ensemble of 3D conformers from which properties arise, (ii) novel methods to embed the stereochemical formula without relying on explicitly sampling 3D conformations, (iii) the ability of generative models to replace expensive physics-based generation of conformations, and (iv) whether exhaustive annotation with not only one but a complete conformer ensemble can enhance property prediction.

Because a reference dataset is needed to probe such questions, we report the GEOM (Geometrical Embedding Of Molecules) dataset of conformer ensembles annotated with their relative energies and populations. The dataset covers a broad chemical space of combinatorially generated small molecules and drug-like molecules. We propose synthetic regression tasks on properties of the conformational ensemble spanned by each stereochemical formula, and report the performance of existing and modified graph-and 3D-based baselines. In addition to benchmarks for regression models, GEOM provides data for pre-training on 3D-related tasks and for training generative models for molecular structure.

A number of reference datasets exist for surrogate simulators and for chemical species property prediction. Simulated data on molecular entities include QM9 [44] , with density functional theory (DFT) calculations of molecular properties for one low-energy equilibrium conformer of molecules with fewer than 9 heavy atoms, and the ANI suite of datasets that sample non-equilibrium conformations for QM9 and larger molecules, as well as with higher levels of theory [45] [46] [47] . MD-17 has thoroughly-sampled conformations for a small number of molecules with ab initio wavefunction methods and long molecular dynamics simulations [48] .

Property prediction of macroscale experimental properties of molecules include quantitative (ESOL [49] , FreeSolv [50] , PDBbind-F [51] ) and categorical (BACE, HIV, ClinTox, Tox21, SIDER, BBBP). MoleculeNet [52] and Deepchem [20] collects and hosts many of these. Unlabeled molecular sets for generation tasks such as ChemBL [53] , ZINC [54] and subsets [5] , and related benchmarks are also reported [55, 56] .

The dataset is available online at [57]. A tutorial for loading the data can be found at [58] . We used the CREST [59] software to generate conformers for 292,035 drug-like molecules and 133,318 molecules from the QM9 dataset. The drug-like molecules were accessed as part of AICures [60] , an open machine learning challenge to predict which drugs can be repurposed to treat COVID-19 and related illnesses. In particular, we generated conformers for 278,622 drugs that have been tested for in-vitro inhibition of SARS-CoV 3CL [61] (data accessed from [62]; 411 hits), 5,755 drugs from the Broad Repurposing Hub [63] (data accessed from [64]; SARS-CoV 3CL activity treated here as unknown), and 218,632 molecules tested for in vitro inhibition of SARS-CoV PL protease [65, 66] (660 hits, with 98% of the molecules also contained in [61] ). Finally, the dataset contains 2,062 molecules that have been screened for growth inhibition of Pseudomonas aeruginosa (data accessed by request from [67]; 23 hits) and 1,580 molecules screened for E. Coli inhibition [1, 68] (57 hits). Secondary infections of COVID-19 patients can be caused by both of these pathogens.

Statistics of molecular descriptors for the dataset are given in Table 1 . The dataset of drug-like molecules consists of medium-sized organic compounds, containing an average of 44.2 atoms (24.8 heavy atoms), up to a maximum of 181 atoms (91 heavy atoms). They contain a large variance in flexibility, as demonstrated by the mean (6.5) and maximum (39) number of rotatable bonds. 15% (43,509) of the molecules have specified stereochemistry, while 26% (75,612) have specified or unspecified stereochemistry. The QM9 dataset is limited to 9 heavy atoms (29 total atoms), with a much smaller molecular mass and few rotatable bonds. 72% (95,734) of the species have specified stereochemistry.

Generation of conformers ranked by energy is computationally complex. The exhaustive method is to enumerate all the possible rotations around every bond, but this approach scales exponentially [69, 70] . Basic algorithms are available in cheminformatics packages such as RDKit [71] , but suffer from two flaws. First, they explore conformational space very sparsely through a combination of pre-defined distances and stochastic samples [72] and can miss many low-energy conformations. Second, conformer energies are determined with classical force fields, which are rather inaccurate [73] . By contrast, molecular dynamics simulations, in particular meta-dynamics approaches, can sample conformational space more exhaustively but need to evaluate an energy function many times. Likewise, ab initio methods, such as DFT, can accurately assign energies to conformers but are also orders of magnitude more computationally demanding than force fields.

An efficient balance is offered by the newly developed CREST software [59] . This program uses semi-empirical tight-binding density functional theory (GFN2-xTB) for energy calculation. The predicted energies are significantly more accurate than classical force fields, accounting for electronic effects, rare functional groups, and bond-breaking/formation labile bonds, but are computationally less demanding than full DFT. Moreover, the search algorithm is based on metadynamics, a well-established thermodynamic sampling approach that can efficiently explore the low-energy search space. Finally, the CREST software identifies and groups rotamers, conformers that are identical except for atom re-indexing. It then assigns each conformer a probability through

Here p i is the statistical weight of the i th conformer, d is its degeneracy (i.e., how many chemically and permutationally equivalent rotamers correspond to the same conformer), E i is its energy, k B is the Boltzmann constant, T is the temperature, and the sum is over all conformers.

Crest runs on the drug dataset took an average of 2.8 hours of wall time on 32 cores on Knights Landing (KNL) nodes (89.1 core hours), and 0.63 hours on 13 cores on Cascade Lake and Sky Lake nodes (8.2 core hours). QM9 jobs were only performed on the latter two nodes, and took an average of 0.04 wall hours on 13 cores (0.5 core hours). A total of 13 million KNL core hours and 1.2 million Cascade Lake/Sky Lake core hours were used in total.

The GEOM dataset is significant for three key reasons. The first is that it provides high-quality 3D structures, energies and probabilities for a large number of drug-like molecules. These expensive annotations may result in increased performance in property prediction tasks. If one is interested in drug repurposing, rather than generation of entirely new molecules, the search space of existing drugs is already annotated and no new conformers are needed. Second, the dataset can be used for training generative models to predict conformations. These models can be used to generate conformations of unseen molecules to bypass ab initio simulations. Third and most important, the dataset provides summary statistics for each molecule that are related to conformational degrees of freedom (conformational entropy, Gibbs free energy, average energy, and number of unique conformers). All of these are aggregate properties that represent the 3D ensemble, but emerge from the molecular graph in a known way. Hence, this dataset allows to test representation learning strategies throughout across graph ⇔ single conformer ⇔ conformer ensemble on tasks that are ultimately 3D, but fully emergent. This is applicable both as a benchmark task for new architectures, or as a pre-training strategy to be transferred to low-data 3D-driven tasks like drug-target binding.

Where focus on the this third application. We compare different neural network architectures and their ability to predict summary statistics. Moreover, we ask whether limited 3D information, such as that of only the highest-probability conformer, can improve predictive performance.

Here we discuss various 2D and 3D message-passing neural network architectures [74] used to predict molecular properties.

A molecule can be thought of as a graph, consisting of a set of nodes (atoms) connected to each other by a set of edges (bonds). Both the nodes and edges have features. The atoms, for example, can be characterized by their atomic number and partial charge. The bonds can be characterized by bond order. Message-passing neural networks use these node and edge features to create a learned fingerprint (representation) for the molecule. This is called the message passing phase. The fingerprint is used as input to a function that predicts a property. This stage is called the readout phase [75] .

The message passing phase consists of T steps, or convolutions. In what follows, superscripts denote the convolution number. The node features of the v th node are x v , and the edge features between nodes v and w are e vw . The atom features x v are initially mapped to another set of vectors h 0 v , termed hidden states. In the t th convolution, a message m t+1 v is created, which combines h v and h w for each pair of nodes v and w with edge features e vw [74, 75] :

where N (v) is the set of neighbors of v in graph G, and M t is a message function. The hidden states are updated using a vertex update function U t :

The readout phase then uses a function R to map the final hidden states to a property y, througĥ

For 2D graphs we adopt the directed message-passing approach of Ref. [76] with the implementation used in Ref. [75] , the latter of which is called ChemProp. The detailed analysis of Ref. [75] showed that ChemProp achieves state-of-the-art performance on a wide range of regression and classification tasks. The ChemProp code was accessed through [77] . In this implementation, hidden states h t vw and messages m t vw are used, rather than node-based states h t v and messages m t v . Here the direction matters, as in general h t vw = h t wv and m t vw = m t wv . This implementation helps to avoid messages that loop back to the original node [75, 78] .

where W i ∈ R h×hi is a learned matrix, cat(x v , e vw ) ∈ R hi is the concatenation of the atom features

x v for atom v and the bond features e vw for bond vw, and τ is the ReLU activation function [79] . The message passing function is simply

The edge update function is the same neural network at each step:

where W m ∈ R h×h is a learned matrix with hidden size h. Each message-passing phase is then

for t ∈ 1, ..., T . After the final convolution, the atom representation of the molecule is recovered through

The hidden states are then summed to give a feature vector for the molecule: h = v∈G h v . Properties are predicted throughŷ = f (h), where f is a feed-forward neural network. In ChemProp the atom features are atom type, number of bonds, formal charge, chirality, number of bonded hydrogen atoms, hybridization, aromaticity, and atomic mass. The bond features are the bond type (single, double, triple, or aromatic), whether the bond is conjugated, whether it is part of a ring, and whether it contains stereochemistry (none, any, E/Z or cis/trans). All features are one-hot encodings. Non-learnable features are incorporated through concatenation with h before applying the readout network. Details of architecture hyperparameters can be found in the SM.

A variety of graph convolutional models have been proposed for learning force fields, which map a set of 3D atomic positions of a molecular entity to an energy. Architectures designed for force fields typically do not incorporate graph information [29, 45, 46, 80] since these are broken during chemical reactions and may not be clearly defined. This is contrasted with architectures for property prediction, which are typically graph-based [21, 75, 76] but can benefit from 3D information [74] .

Here we explore both possibilities. In one case we modify the SchNet force field architecture [29, 80] (code adapted from [81] ) to predict properties. In a second case we modify the ChemProp model to include distance-based edge features between bonded-and non-bonded atoms. This is in addition to the regular graph edge features between bonded atoms. We call this model ChemProp3D.

In the SchNet model, the feature vector of each atom is initialized with an embedding function. This embedding generates a random vector that is unique to every atom with a given atomic number, and is also learnable. The edge features at each step t are generated through a so-called filter network W t . The filter network converts a distance between two atoms, ||r v − r w ||, into an edge vector e vw . This is accomplished by expanding the distance between atoms v and w in a basis of Gaussian functions. The centers of these Gaussians are evenly distributed up to a cutoff radius, taken here to be 5.0 Å. This converts a distance into a vector. Further linear and non-linear (shifted-softplus) operations are applied. Because only the distance between two atoms is used to create e vw , the features produced are invariant to rotations and translations.

In each convolution t + 1, the new messages and hidden vectors are given by

Here, • denotes element-wise multiplication and I t denotes the so-called interaction block. The interaction block consists of a set of linear and non-linear operations applied atom-wise to the atomic features. These operations are applied before and after multiplication with W t . In the original SchNet implementation, the readout layer converted each atomic feature vector into a single number, and the numbers were summed to give an energy. Consistent with the ChemProp model and the notion of property prediction, we here instead convert the node features into a molecular fingerprint, and then apply the readout function to the fingerprint. Details of our implementation of the SchNet model can be found in the SM.

We follow the spirit of both SchNet and ChemProp to produce the ChemProp3D model. Rather than only considering neighbors bonded to an atom, we consider all neighbors within a 5 Å cutoff. For bonded neighbors, edge features are a concatenation of bond features and distance features. For non-bonded neighbors, edge features are a zero-array concatenated with distance features. Distances are expanded in a set of 50 Gaussian functions, distributed evenly every 0.1 Å up to a maximum of 5 Å. They are then followed by a fully-connected layer and activation function. That is, in the t th convolution, distances are converted to vectors through

where r vw = ||r v − r w || is the distance, g i is the i th Gaussian function, W t S is a learned SchNet matrix, b t is a bias, and τ is the SchNet activation function. Consistent with the original SchNet paper we use the shifted softplus for the distance activation, but use the ReLU in all other places. We also use the ring size as an atomic feature for atoms in rings.

Learnable Graph 3D (lowest state) 3D (10 lowest states) The above discussion applies to molecules associated with one geometry. In the GEOM dataset, however, multiple conformers can be used for a single stereochemical formula. There are two immediate possibilities for pooling these conformers. The first, which we call WeightPool, is to create molecular fingerprints for each conformer, multiply each by its statistical weight, and add them.

The second, which we call NnPool, is to use the fingerprint and the statistical weight as inputs to a neural network that generates a final fingerprint. The different pooling options are then:

The first case multiplies the i th fingerprint by its weight p i and sums the result. The second case multiplies a learned matrix W pool , of dimension h i × (h i + 1), with the concatenation of p i and h i before summing the result, adding a bias b pool , and applying a non-linear operation τ . NnPool is of interest for applications in which the target property is dominated by conformers of low statistical weight. This can often be the case in therapeutics, in which a single low-probability conformer can result in high-affinity binding. We trained different models to predict three quantities related to conformational information. The first quantity is the ensemble entropy, S = −R i p i log p i [59] , where the sum is over the statistical probabilities p i of the i th conformer, and R is the gas constant. The conformational Gibbs free energy is related to S through G = −T S [59] . The conformational entropy is a measure of the conformational degrees of freedom available to a molecule. A molecule with only one conformer has an entropy of exactly 0, while a molecule with equal statistical weight for an infinite number of conformers has infinite conformational entropy. The conformational Gibbs free energy is an important quantity for predicting the binding affinity of a drug to a target. The affinity is determined by the change in Gibbs free energy of the molecule and protein upon binding, which includes the loss of molecular conformational free energy [82] . The second quantity is the average conformational energy. The average energy is given by E = i p i E i , where E i is the energy of the i th conformer. Each energy is defined with respect to the lowest-energy conformer. The third quantity is the number of unique conformers for a given molecule, as predicted by CREST within a maximum energy window [59] .

The models include varying degrees of 3D information and various levels of learnable molecular embeddings. A summary of the information contained in each approach is given in Table 3 . In quantum chemistry one often attempts to optimize a geometry so that its energy is at a global minimum. We asked how much this ground state geometry could improve training by incorporating its 3D information in the SchNet and ChemProp3D models. We also considered the impact of graph information, which is contained in all models except for SchNet, as well as non-learnable features, through the inclusion of Morgan [83] and E3FP [84] fingerprints. Morgan fingerprints contain only graph information, while E3FP fingerprints also contain 3D information. It is informative to know if limited knowledge of the conformers of a molecule, e.g. through a short MD run, could improve training further. To this end we also incorporated a statistical weight of E3FP fingerprints using only the 10 lowest conformers from each molecule. In this case a fingerprint was produced by multiplying the E3FP fingerprint of each molecule by its statistical weight (properly re-normalized to account for missing conformers) and adding the results.

A description of the architecture and training hyperparameters used for each model can be found in the SM. We used published architecture hyperparameter values and we optimized dropout rates with SigOpt [85] . 250,000 molecules were used, with 80% for training, 10% for validation, and 10% for testing. The same training, validation, and test splits were used for all models. 2D models were trained for 30 epochs and 3D models for 100 epochs due to slower convergence. We checked that training 2D models past 30 epochs did not improve performance. The mean average error (MAE) was used as a performance metric. In all cases the models with the best validation scores were selected for evaluation on the test set.

The model performance is shown in Table 4 , and can be contextualized by analyzing the dataset statistics in Table 2 . Note that chemical accuracy for energy prediction is typically considered to be 1 kcal/mol, and sub-chemical accuracy to be 0.24 kcal/mol. It is clear that learnable fingerprint embeddings significantly improve performance, since Morgan and E3FP embeddings alone result in poor performance. The SchNet model, trained on the 3D geometry of the lowest energy conformer, performs comparably to ChemProp in all categories. The combination of ChemProp with 3D information also leads to comparable performance, though it outperforms the ChemProp entropy prediction by 11%. With this learning architecture, one-conformer 3D information can moderately improve performance in some contexts, but the advantage is far from decisive.

Finally, we asked whether 3D models trained on pooled conformers could outperform 2D models. Specifically we asked whether SchNet, a model built to predict energies, could implicitly learn the entropy and average energy associated with an ensemble of geometries. To this end we re-trained SchNet on a sample of 25,000 species, using up to 25 conformers per molecule, for a total of 625,000 geometries. We used both WeightPool and NnPool for SchNet, and compared results to ChemProp trained on the stereochemical formula of the same species. To differentiate this model from the earlier instance of SchNet, which used only a single geometry, we call the models SchNet25-WeightPool and SchNet25-NnPool. Details of the training can be found in the SM. The results are given in Table  5 . Interestingly, we see that conformer pooling leads only to minor improvement over 2D models. In particular, the SchNet25-WeightPool model is only 2% better than ChemProp at predicting the entropy, while the NnPool model is significantly worse. The ordering is reversed for the average energy. To see why this is surprising, consider the average energy task as an example. If the model sees 25 conformers per molecule, as well as the average energy, one would expect it to learn which conformations are high-energy. Therefore, when shown a set of conformers for a new species, one would expect it to identify the high-energy structures. ChemProp, by contrast, has no access to this information and must learn from the graph alone. It is therefore intriguing that 3D information from the conformer set would offer no clear advantage. Given the similar performance of the WeightPool model to ChemProp, it appears that the state-of-the-art approaches to fingerprinting 3D structures can be improved for ensemble prediction tasks. This offers an attractive challenge to the machine learning and chemistry communities.

3D coordinates are important for predicting single-point properties such as energies and forces for one conformation of one molecular entity. However, here we have found mixed results regarding their ability to enhance the accuracy of ensemble-averaged quantity prediction. 2D-based approaches to ensemble property prediction are not significantly improved by 3D information. These results indicate that either 3D information is not useful for these prediction tasks, or that the current models we have used do not leverage 3D information in an optimal way. With access to our dataset, the community can develop improved models for leveraging 3D information for property prediction. Using our data for training, they will also be able to develop generative models that may obviate the need for expensive conformer simulations.

There are four datasets, organized by molecule type (drugs or QM9), and whether they contain the original CREST information (drugs_crude.msgpack.tar.gz and qm9_crude.msgpack.tar.gz) or post-processed feature information (drugs_featurized.msgpack.tar.gz and qm9_featurized.msgpack.tar.gz).

The notebook tutorial at [58] explains how each dataset is organized and how to extract the data.

To explain the necessity of the featurized files, consider that CREST simulations allow for reactivity, so not all geometries arising from a calculation correspond to the same molecular graph that they started with. Indeed, we have found a number of simulations in which bonds are broken and reformed, as in tautomerization. For this reason it was necessary to examine each geometry individually after simulation to determine its molecular graph. To this end we used a locally modified version of xyz2mol [86] (code accessed from [87] ) to generate an RDKit mol object [71] . The mol object's graph information is contained in the featurized files through dictionaries of atom and bond features. The SMILES string generated by xyz2mol, as well as the canonical form of this string, are also given.

Additionally, for 3D-based machine learning models, each convolution aggregates atomic information from atoms within a given cutoff radius r cut . The most efficient method of storing this information is to generate a so-called neighbor list for each atom before training. The neighbor list consists of a set of pairs of indices, each of which corresponds to two atoms that are within r cut of each other. We included a neighbor list with r cut = 5 Å in the featurized files.

Two notes are in order:

1. Both the SMILES from xyz2mol and its corresponding canonical form may be different from the original SMILES. This may be because the graph contains more information than the original SMILES (e.g., because the original did not specify stereochemistry, meaning that one random stereoisomer was chosen to seed the CREST simulations), because the SMILES strings are resonance structures, or because the connectivity and bond types are different. The latter is the case when a chemical reaction occurs, such as in the case of tautomerism.

2. Not all conformers could be successfully converted to graphs, and so the featurized files contain fewer SMILES strings than the crude files.

Our approach was to optimize dropout rates and use published values for other hyparameters when possible. This was based on our experience with smaller datasets, which showed that dropout rates were the most important factor for 3D models and models with non-learnable fingerprints. In all cases we optimized the natural logarithm of the dropout rate with SigOpt [85] , using a data subset of 60,000, an optimization budget of 20, and a train/validation/test split of 80/10/10. The allowed range of log(dropout) was [-5, 0] in all cases. The dropout rate with the lowest test error was selected. The dropout rates were optimized separately for each architecture and for each prediction quantity.

Hidden state dimension 300 Readout layers 2 Convolutions 3 Activation ReLU 

We used the default architecture values given in [77] and shown in Table S1 . The hidden state dimension in the two readout layers was reduced according to 300 → 300 → 1. A dropout layer was placed after the activation functions following W m and W a (see main text), and before the linear layers in the readout phase, as implemented in [77] . The optimized dropout rates are shown in Table S2 . Note that hyperparameters were optimized for the prediction of conformers, rather than log 10 (conformers). These parameters were then used for models predicting conformers and for models predicting log 10 (conformers). We only reported the prediction of log 10 (conformers) in the main text as the prediction performance was far better.

The fixed SchNet hyperparameters are given in Table S3 . The atomic fingerprint length, convolution activation, Gaussian spacing, and number of readout layers are all those given in [80] . The original SchNet paper used three convolutions and a cutoff radius of 10 Å. However, because ChemProp used only three convolutions, and because bond distances are typically under 2 Å, we reduced the cutoff radius to 5 Å and used only two convolutions. Also as in ChemProp we used the ReLU activation for the readout and a molecular fingerprint length of 300. Since the atomic and molecular fingerprints had different lengths, a single linear layer was used to convert the summed atomic fingerprints to a molecular fingerprint. As in the original SchNet paper, the fingperint dimension dim was reduced in the readout layers according to dim → dim/2 → 1.

Dropout layers were placed before the linear layers in the readout phase and before linear layers in the convolution phase. The dropout rates were optimized separately. Optimized values are shown in Table S4 .

The same hyperparameters were used for SchNet25 as for SchNet, with the exception of the molecular fingerprint. This was reduced from 300 to 64 to reduce memory use. Dropout rates optimized for SchNet were also used for SchNet25.

S dropout E dropout log 10 (conformers) dropout Convolution dropout 0.020 0.015 0.008 Readout dropout 0.007 0.031 0.009 Table S4 : Optimized SchNet dropout rates.

S dropout E dropout log 10 (conformers) dropout Convolution dropout 0.255 0.076 0.020 Readout dropout 0.033 0.007 0.100 Table S5 : Optimized ChemProp3D dropout rates.

We used an identical architecture to that of ChemProp, with the addition of 50 Gaussians, a linear layer, and the ReLU activation to map distances to edge features. The linear layer and ReLu activation converted the 50 Gaussians into an edge feature vector of length 64. We also used the dim → dim/2 → 1 SchNet approach in the readout layer. Dropout layers were placed before the linear layers in the readout phase and before linear layers in the convolution phase. The dropout rates were optimized separately. Optimized values are shown in Table S5 .

In all cases we used the Adam optimizer and mean square error loss for training. For ChemProp we used the default training hyperparameters given in [77] . The learning rate scheduler described in [88] was used with an initial and final learning rate of 10 −4 , a maximum learning rate of 10 −3 , two warmup epochs, 30 total epochs, and a batch size of 50. We verified that performance did not improve when using more than 30 epochs.

For all 3D models we used a batch size of 25 and an initial learning rate of 10 −4 . We used a scheduler that decreased the learning rate by half if validation performance had not improved in 10 epochs. 100 epochs were used in all models except for SchNet25, which required 200 epochs for convergence. Finally, the presence of a small number of outlier geometries initially led to divergences in the SchNet25 training. To account for this, the optimizer did not take a step if the batch loss was divergent. Our reported results in the main text exclude divergent predictions.

A deep learning approach to antibiotic discovery

Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach

Deep learning enables rapid identification of potent DDR1 kinase inhibitors

Generative Models for Automatic Chemical Design

Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Junction Tree Variational Autoencoder for Molecular Graph Generation

MolGAN: An implicit generative model for small molecular graphs

Learning Deep Generative Models of Graphs

Syntax-Directed Variational Autoencoder for Structured Data

Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning

Molecular De Novo Design through Deep Reinforcement Learning

Learning To Navigate The Synthetically Accessible Chemical Space Using Reinforcement Learning

Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models

Deep reinforcement learning for de novo drug design

End-to-End Differentiable Learning of Protein Structure

Learning Protein Structure with a Differentiable Simulator

Planning chemical syntheses with deep neural networks and symbolic {AI}

Prediction of Organic Reaction Outcomes Using Machine Learning

Deep Learning for the Life Sciences

Convolutional Networks on Graphs for Learning Molecular Fingerprints

Molecular graph convolutions: moving beyond fingerprints

Analyzing Learned Molecular Representations for Property Prediction

Cormorant: Covariant Molecular Neural Networks

Tensor field networks: Rotation-and translation-equivariant neural networks for 3D point clouds

Directional Message Passing for Molecular Graphs

Gold Book")

A Comprehensive Survey on Graph Neural Networks

SchNet: A continuous-filter convolutional neural network for modeling quantum interactions

SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules

InChI, the IUPAC International Chemical Identifier

Complexity of protein folding

Data-Driven Approach to Encoding and Decoding 3-D Crystal Structures

Reinforcement Learning for Molecular Design Guided by Quantum Mechanics

Adversarial Reverse Mapping of Equilibrated Condensed-Phase Molecular Structures

Generating valid Euclidean distance matrices

Deep Generative Models for 3D Linker Design

Molecular Geometry Prediction using a Deep Generative Graph Neural Network

Bayesian optimization for conformer generation

Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules

Generating equilibrium molecules with deep neural networks

Coarse-graining auto-encoders for molecular dynamics

Molecule Attention Transformer

Quantum chemistry structures and properties of 134 kilo molecules

ANI-1, A data set of 20 million calculated offequilibrium conformations for organic molecules

ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost

Less is more: Sampling chemical space with active learning

Machine learning of accurate energy-conserving molecular force fields

ESOL: Estimating aqueous solubility directly from molecular structure

FreeSolv: A database of experimental and calculated hydration free energies, with input files

The PDBbind database: Collection of binding affinities for protein-ligand complexes with known three-dimensional structures

{MoleculeNet}: a benchmark for molecular machine learning

ChEMBL: towards direct deposition of bioassay data

ZINC 15-Ligand Discovery for Everyone

{GuacaMol}: Benchmarking Models for de Novo Molecular Design

Alan Aspuru-Guzik, and Alex Zhavoronkov. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Geom: Energy-annotated molecular conformations

Exploration of chemical compound, conformer, and reaction space with meta-dynamics simulations based on tight-binding quantum chemical calculations

QFRET-based primary biochemical high throughput screening assay to identify inhibitors of the SARS coronavirus 3C-like Protease (3CLPro)

qHTS of yeast-based assay for SARS-CoV PLP

qHTS of yeast-based assay for SARS-CoV PLP: Hit validation

Nontargeted metabolomics reveals the multilevel response to antibiotic perturbations

Conformations and 3D pharmacophore searching

Confab -Systematic generation of diverse low-energy conformers

RDKit: Open-source cheminformatics

Conformational analysis using distance geometry methods

A sobering assessment of small-molecule force field methods for low energy conformer predictions

Neural message passing for quantum chemistry

Analyzing learned molecular representations for property prediction

Discriminative embeddings of latent variable models for structured data

Chemprop Machine Learning for Molecular Property Prediction

Extensions of marginalized graph kernels

Rectified linear units improve restricted boltzmann machines

SchNet-A deep learning architecture for molecules and materials

Neural networks for Atomistic systems

Conformational entropy in molecular recognition by proteins

Extended-connectivity fingerprints

A simple representation of three-dimensional molecular structure

Universal structure conversion method for organic molecules: from atomic connectivity to three-dimensional geometry

Converts and [sic] xyz file to an RDKit mol object

Attention is all you need

The authors thank the XSEDE COVID-19 HPC Consortium, project CHE200039, for compute time.