key: cord-0145425-5ux5j9ck authors: Clyde, Austin; Ramanathan, Arvind; Stevens, Rick title: Scaffold Embeddings: Learning the Structure Spanned by Chemical Fragments, Scaffolds and Compounds date: 2021-03-11 journal: nan DOI: nan sha: 22b06e336e079cc0b485d8e0069fb296b21220c6 doc_id: 145425 cord_uid: 5ux5j9ck Molecules have seemed like a natural fit to deep learning's tendency to handle a complex structure through representation learning, given enough data. However, this often continuous representation is not natural for understanding chemical space as a domain and is particular to samples and their differences. We focus on exploring a natural structure for representing chemical space as a structured domain: embedding drug-like chemical space into an enumerable hypergraph based on scaffold classes linked through an inclusion operator. This paper shows how molecules form classes of scaffolds, how scaffolds relate to each in a hypergraph, and how this structure of scaffolds is natural for drug discovery workflows such as predicting properties and optimizing molecular structures. We compare the assumptions and utility of various embeddings of molecules, such as their respective induced distance metrics, their extendibility to represent chemical space as a structured domain, and the consequences of utilizing the structure for learning tasks. The enormous design space of chemical compounds, estimated to be about 10 60 [1] , motivates an immediate need for efficient and often automated exploration for synthesis and assay development for various applications, including drug discovery and materials design. Computational enumeration of chemical space is a long-studied problem since the early ages of computing [2] . The current state of the art projects have enumerated around 2 billion drug-like compounds, and GDB has around 166 billion compounds of up to 17 atoms of C, N, O, S, and halogens [3, 4] . Even with these vast libraries, recent work has shown a vast difference between the diversity enumerated in ultra-large libraries and the underlying space [5] . Especially in the context of drug discovery, an emerging need in the cheminformatics community is the ability to navigate this enormous design space in the hopes of generating new molecules (or designs) that can optimally bind to a protein/drug-target of interest or refine molecules based on specific physio-chemical and safety features that make it attractive as a drug that can be formulated for the market. Given the vastness of drug-like chemical space, how can we computationally explore it? In 1875, Caley published a short note on his enumeration of alkanes utilizing a tree structure [6] . Though Caley's enumeration ended up having a few errors, it is a very early account of treating chemical space as a structured mathematical object [7] . Over 100 years later, the ideas of enumerating structurally similar compounds and comparing their activity became known as quantitative structure relationship studies (QSAR/SAR). QSAR/SAR is the standard method in medicinal chemistry for taking an interesting chemical compound to an optimized and potent drug lead. In 1984, Klopman developed Computer-Automated Structure Evaluation (CASE), which "perform[s] automatically all operations related to the structure-activity analysis" [8] . A success in its own right, CASE utilized the graph topology of molecules to gener-ate QSAR studies or predict activity based on fragments. This graph structure naturally leads to studying subgraphs and their relations, such as decomposing the graph into a class of similar molecules sharing a framework (scaffold), linkers connecting rings, and sidechains [9] . Utilizing these ideas, various tool-kits and genetic algorithms have been designed to combine or grow molecular fragments into optimized drugs [10, 11] . While these ideas in organization lay the framework for certain practices of medicinal chemistry, the methods do not address the problem of enumerating compounds in an organized way to find diverse chemical scaffolds. Deep learning (DL) offers a new set of tools and algorithms for generating novel molecular pieces. With the introduction of generative models which can be sampled, such as variational autoencoders [12] , or generative adversarial networks [13] , de-novo molecular generation took hold as a practice in drug discovery [14] . Molecules were embedded into a continuous representation and then given a decoder, sampled from continuous spaceallowing property optimization and molecular generation based on some distance metric in the latent representation. These approaches have had much success. We extend on this work by focusing on computational organization and enumeration specifically-seeking more structure than N(X, Σ) or R n . Our contribution in this paper is twofold: (1) representing large molecular libraries using molecular building blocks (i.e., fragments, scaffold, linkers/decorations), and (2) learning to navigate latent representations of molecular hypergraphs leveraging transformer networks to operate on molecular building blocks to generate new molecules. This project is distinct from prior molecular generation problems as we focus on enumerability and organization over property prediction. We demonstrate that our building blocks representation provides a natural mechanism to organize large chemical spaces in a statistically meaningful manner. Further, the transformer networks suggest the design (a) Starting at the bottom, a chemical scaffold with three rings is decomposed into two 2-ring scaffolds, which can be decomposed further into 1-ring scaffolds. (b) Remdesivir is decomposed from the molecule, to its scaffold, rings and linkers, and sidechains. of novel molecules that can 'expand' on a given scaffold design, which can be used for subsequent rounds of virtual screening studies. Given the significance and applications of molecular design, several approaches have explored how to organize the chemical design space, including the use of strings (e.g., SMILES), molecular/chemical descriptor data (e.g., Modred features), and molecular graphs for both representing and generating new molecular designs. Each representation presents some opportunities and challenges in capturing the complexity of the chemical landscape and has successfully designed new molecules for drug targets and new catalysts, and other materials. However, no single representation can sufficiently capture the diversity (in chemical species) and the statistical diversity in their representations. For e.g., while SMILES strings provide a convenient means to encode chemical information, two nearly identical molecules can have significantly different SMILES representations, presenting unique challenges when embedding them into a latent manifold. Similar observations can be made for other molecular representations. A reemerging principle in small molecule-based property prediction models is the similar property principle [15] . This principle has been widely applied in the context of determining quantitative structure-activity relationships (QSAR) in medicinal chemistry: how compounds and their activities (against a drug target) can potentially improve (or degrade) based on modifying certain chemical scaffolds (or addition/deletion of R-groups) [16, 17] . Consider the set of drug-like molecules M. M is not directly computable as it is a concept class for molecules. For computation, molecules require a computable representation, and this is the start of the difficulty. Representations are models of molecules which can be identified with a molecule. Graphs are a natural model of molecules, where nodes are atoms and edges are vertices [18] . SMILES are another representation of molecules, which are a breadth-first search over the graph in a particular syntax. SMILES, unlike graphs, are not injective over molecules (if two SMILES strings are not equal, it does not imply the underlying molecules are not equivalent) [19] . There are other representations which are less common such as point clouds, junction trees, or voxelization [20] . We define R X to be a general representation mapping from molecules to some set X from M. Embeddings are distinct from representations. Embeddings are functions which take a representation X to embedding space Y. For instance, molecular fingerprints are an algorithm which takes graphs of molecules to R n by utilizing a hashing function around the nodes or regions of a graph [21] . Node2vec models take graphs to R n . A simple variational autoencoder's encoder can take SMILES to a Gaussian unit ball N(X, Σ). The junction tree variational autoencoder takes a junction tree to a latent unit ball. In the later two examples, the idea of sampling from a normal unit ball is essential for maintaining the density of the sampling space-an important aspect of creating a generative model (see SI section 2 on sampling). Given a decoder, these embedding spaces can be sampled to produce potentially new molecules or molecules through a constrained optimization problem. The two embedding spaces so far have convenient distance metrics, denoted δ Y . A number of papers have focused on generative models for the design of new molecules [22, 18, 23, 24, 25] . These approaches either use a string representation (e.g., SMILES representation mapped onto a molecular graph) or an explicit molecular graph representation (e.g., [26] ) to encode the molecular data into a continuous representation from which new examples can be drawn. While these methods are very successful at certain property predictions and general optimization, they do not solve the enumerability problem. Both R n and N(X, Σ) are continuous and not countable. In particular, every molecule has an open ball around it in embedding space of equivalent points which is a problem for enumerating discrete sets of molecules. In other words, if ϕ −1 is a decoder from an embedding R n → X, and ≡ is an equivalence relation on the representation X, there exists y 1 , y 2 ∈ R n and > 0 such that 0 < δ R n (y 1 , y 2 ) < so In order to structure the embedding space to be conducive for enumeration, we must find an embedding space that is countable and discrete, just as Caley sought out by means of a tree. Molecular scaffolds are well defined through algorithms, decompose well into networks, and offer a general description of global properties (such as orientation in a protein binding region) [9, 27] . Molecular scaffolds represent the core of a molecule, typically defined around the number of rings in the structure. Non-ring sturctures in molecules include linkers and sidechains which get collapsed in this representation to a single scaffold representative. In figure 1, we show a molecular scaffold decomposing into smaller scaffolds. In this way, we can take a graph or SMILES representation of a molecule and map it to this discrete embedding structure. The mapping into the scaffold structure is unique. As other authors rely on decoders to decode the embedding space, we will rely on decoders to sample the scaffold for the variety of molecules a part of it. is inside a single scaffold class. Scaffold classes are related through common substructures, forming a hierarchy of classes. Penimocycline, for example, belongs to a scaffolding class from far Penicillin-g's or Amoxicillin's class, while Pipracil is a direct successor of the Penicillin-g class. The Predecessor function is defined via an algorithm, and the Successor Φ function requires a generative model when working without data (i.e., given a single scaffold you cannot compute its successor unless you understand chemistry, thus have parameters Φ, but you can compute all of it is predecessors recursively without knowing how to generate new compounds). The conceptual machinery for treating chemical space as a hypergraph structured through scaffolds is developed. There is an elegant statement of the principle of fragment-based drug design through the operations among scaffolds. Further, the framework developed provides intuitive concepts for understanding the diversity and size of chemical space explored or discussed by a model or computational research program. As a computational learning problem, we use transformer as seq2seq models to implement large graph navigation in practice. Utilizing the concept of scaffolds developed in section 2, we assume the operation Scaffold as a given oracle such that Scaffold is injective and defined for every molecule. We define S as the set of all scaffolds. A hypergraph is a generalized graph where edges group more than two vertices. A hypergraph is n-regular when every vertex is contained in exactly n edges. Scaffolds as hypergraph edges over molecules form a 1-regular graph, as every molecule belongs to exactly one scaffold class, thus every vertex has degree 1 in the hypergraph. We denote the hypergraph as H = (M, S). Operations on scaffolds. We denote computational operations in Monospace font, and add a subscript Φ to represent parameters which may be required for the operations (i.e. Expand Φ ). We utilize RDKit to compute Scaffold via the MurckoScaffold module [28] . We note a model can be trained for this task; however, given the efficiency of the algorithm it did not seem fruitful at this time. 2. Successor Φ and Predecessor: the successors of a scaffold S 1 are the set of all scaffolds S which contains S 1 as a substructure ( figure 2) . The predecessors of a scaffold S 1 are all scaffolds S which S 1 is a superstructure. In general, there is no algorithm for successor given only a scaffold, as it requires sampling chemical space. However, predecessor has an efficient algorithm with a structure that can always be fragmented into smaller scaffolds without sampling other data. These operations are the atomic building blocks of navigating between scaffold classes (and induces a strict partial ordering (S, ≺)). These operations are from S to S. We also consider the standard graph structure induced by the relation Successor Φ and Predecessor, and denote it S G = (S,Successor Φ ) where Successor Φ can be used to determine the edge relation. This graph can be directed or undirected, but for our case we consider the undirected graph mostly. 3. Union Φ and Intersection: two scaffolds S 1 and S 2 can be combined to form a union. More formally, the union of S 1 and S 2 is the set of scaffolds that contain S where S has S 1 , and S 2 has immediate predecessors. Similarly, the intersection of S 1 and S 2 is simply the maximum common substructure (MCS) of S 1 and S 2 , for which an efficient algorithm exists for small drug-like molecules. [29, 30] . In general, MCS is NP-complete, but there are heuristics for drug-like molecules that provide a rather efficient algorithm [31] . These operations are from S to S. These basic operations can be combined into more complex operations such as Upper cones of scaffold classes are actually a common object of interest for drug discovery. For instance, Penimocycline is in the upper cone of Penicillin-g's scaffold class (see figure 2 ). Successful exploration of upper cones is the theoretical cornerstone of fragment based drug design [32, 33] . Recently, fragment X-ray crystalgraphic screens have been performed on important drug targets such as SARS-CoV-2 proteases in search of an inhibitor [34] . Given a set of fragment hits for a protein target in a binding region,{m i } i∈H , take the scaffold classes of those hit, {S h i } i∈H . The principle of fragment based drug design can be expressed as there exists some index set I * such that whereĤ is a set of scaffold classes,Ĥ is not empty, and some molecule in a scaffold inĤ is a likely candidate. In other words, a set of fragments can be grown to sets of larger druglike molecules, and some intersection of those possible larger molecules will be a hit that is likely a drug lead for this protein target. In an embedding space such as R n , the same principal does not apply, and is dependent on the embedding context (for instance, based on a particular property [35] ). Furthermore, there no guarantees about molecules in an interval between two molecules, whereas the intersection of upper cones, for example, does have such guarantees (if it is not empty). Given there is no algorithm for producing successor scaffolds without relying on sampling chemical space, we treat the problem as a learning problem. We note that we cannot rely on fragments as a vocabulary given this construction as other methods have (for instance, [26] utilized a finite vocabulary containing one member rings, linkers, and sidechains from the dataset). When using such a vocabulary, there are chains of scaffolds that cannot be represented as the Successor Φ function can only sample scaffold classes S for which every one ring member in the LowerCone(S ) is in the finite vocabulary. While the method outlined has no constraints on compounds' synthetic accessibility, it is a necessary and essential aspect of chemical space exploration for drug discovery. To focus on synthetic accessibility while paying attention to maximizing library size, we utilize a dataset from Synthetically Accessible Virtual Inventory (SAVI) [3] . SAVI contains over 1.7 billion reaction products (along with rich reaction and metadata). We utilize only the SMILES of the products. We build two datasets from SAVI. The first utilizes RDKit to determine the scaffold for each of the compounds listed [28] . We utilized a 200M sample from the entire dataset and extended the data by a factor of 5 by randomizing the SMILES both for the target (scaffold) and source (molecule) [36] . A set of 20M molecules with a unique scaffold class are held out as validation data. A second dataset is created by taking a subsample of the prior dataset, 20M, and utilizing the ScaffoldGraph package to decompose each scaffold into a network of scaffolds [27] . We sample edges (representing the successor of two scaffold nodes), resulting in a dataset of five million successor pairs. This dataset is extended to 50M utilizing random smiles sampling. Predecessor data is flipping the columns (sources become targets, and targets become sources) for the successor datasets. On the one hand, Successor Φ and Expand Φ are generative models-given a scaffold, those operators are required to sample the space of successor scaffolds or molecules that have that scaffold. On the other hand, they are seq2seq task, taking one sequence to a different sequence. This combination of wanting a dense sampling strategy combined with seq2seq modeling differs from applications we have found in the literature. Common approaches to generative models have been utilizing VAEs or GANs to train some encoder-decoder model on sample reconstruction error with some regularization [20, 37, 38] . Seq2seq approaches in this space have focused on solving problems with a relatively small optimal solution set such as reaction modeling [39] . With the recent success of transformer models performing well on large datasets and seq2seq problems, we decided to follow the modeling as a seq2seq problem as Schwaller et al. have. We utilize a transformer seq2seq model from the ONMT project [40] . Other works have utilized RNNs, but we utilize a transformer for both the encoder and decoder of the model [41] . Given the goal of not simple generation but rather generalizing a very large hypergraph for which a pure algorithmic solution is intractable, transformer models are a good fit compared to simpler RNN models. Code is compiled into a GitHub repository with scripts for data gathering, data preparation, model training, and sampling. The interface is geared towards developing frontend functions for quick medicinal chemistry questions regarding sampling molecular space. We assess the structure of scaffolding chemical space, focusing on understanding the size of scaffold classes, how many scaffold groups there are in drug-like chemical space, and how they connect. We impose a structure on M by creating scaffold classes S = {S } i∈I s such that every molecule m belongs to one and only one scaffold class, and all classes in {S } i∈I s are disjoint. We also assign a hierarchy to scaffolds based on the number of rings. H n is the set of all scaffold classes with ring size n. H 0 is the smallest hierarchy, which consists of only one scaffold class S 0 , the set of all molecules with no rings (ring-less fragments, linkers, and side-chains). H 1 is the set of all scaffold classes with one ring. The order of H 2 is proportion to |H 1 | choose 2 plus the combination of linkages and sidechain modifications from H 0 . We see growth similar to the partition function in theory. However, in practice, the distribution of molecules in real-world datasets typically follows a normal distribution with the mean around three rings (see figure 4 ). Given this added structure of scaffolds, do scaffolds reduce the search space over molecules by many magnitude orders? If this is the case, we can search through a computable number of scaffolds, and once a few interesting classes are found, we can enumerate the molecules in that set. This strategy does not face the curse of 10 68 drug-like molecules the current unstructured domain M faces. Given a 200M sample from SAVI, we found only 11.4M (5.7%) scaffold classes were needed to cover the entire dataset, and, in practice, there exists a large subset of molecules (165M) with only 685,000 (0.41%) scaffold classes. This reduction via scaffolds implies for a large subset of molecules, there is a reasonable 5 order of magnitude gain in search over scaffolds than pure molecules (from a database or chemical library perspective). We train three operations (Expand Φ , Successor Φ , Predecessor) utilizing three separate models. While there is an algorithm for Predecessor, we can compare it directly to the algorithm performance. Each model was trained for approximately two days on eight GPUs (NVIDIA Tesla V100). Each model was trained for 500,000 steps with a batch size of 8192. Further details of the training procedure can be found in (SI) and on GitHub. 2 To sample Expand Φ and Successor Φ we utilize beam search with a temperature of 1.5, beam size of 5, and randomizing the SMILES input. Samples are then validated utilizing RDKit. In Given the density of some scaffold classes in the data compared to others (figure 4), more advanced sampling methods required for Expand Φ on these classes. For scaffold classes with over 10 6 members in the data (mostly 1-ring and 2-ring common scaffolds), resampling validation data from the model is difficult (table 4. 2). Given the uniqueness of sampling based on a category like scaffolds, rather than pure sampling points in a distribution or R n , comparisons to generative models' reconstruction accuracy are not reasonable. Figure 6 is an example of a series of compounds which belong to a single scaffold class, but are sampled with different sidechains. The variety of sidechains while maintaining the single scaffold core is the basis of a QSAR series. This paper outlined a set of ordered equivalence classes via molecular scaffolds over the drug-like chemical space (forming a 1-regular hypergraph). We utilize seq2seq models to move between scaffolds, or classes of compounds, and between the scaffold hierarchy and the underlying molecules themselves. These operations ultimately form a set of algebraic tools for manipulating and navigating the chemical space. This algebra is expressive-enough to represent algorithms in drug design, such as the principle of fragment-based drug design or similar property principle of molecular scaffolds. This construction over M offers a unique take on the enumerability of the chemical space by collapsing the space into scaffold classes, which can zoomed-in or zoomed-out of. We aim to understand better the distribution of synthetically accessible drug space and its relation to scaffolds as we hope scaffold classes reduce the space's overall size. Future work will unify the algebra into a single model for navigating the space and introduce more concept classes for finer and coarser granularity. We believe that to accelerate exploring the estimated 10 60 drug-like molecules, a navigation strategy besides standard databases and compound enumeration is needed. The art and practice of structure-based drug design: a molecular modeling perspective A machine with chemical intuition silico generation of billions of easily synthesizable compounds through expert-system type rules. Scientific data Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17 Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis Ueber die analytischen figuren, welche in der mathematik bäume genannt werden und ihre anwendung auf die theorie chemischer verbindungen On cayley's enumeration of alkanes (or 4-valent trees) Artificial intelligence approach to structure-activity studies. computer automated structure evaluation of biological activity of organic molecules The properties of known drugs. 1. molecular frameworks Evolutionary algorithms in drug design The medicinal chemist's toolbox for late stage functionalization of drug-like molecules Semi-supervised learning with deep generative models Molecular de-novo design through deep Learning the Structure Spanned by Chemical Fragments, Scaffolds and Compounds 7 reinforcement learning Concepts and applications of molecular similarity Molecular Similarity Measures The Ups and Downs of Structure-Activity Landscapes Molecular graph convolutions: moving beyond fingerprints Towards a universal smiles representation-a standard method to generate canonical smiles based on the inchi Deep learning for molecular design-a review of the state of the art A comprehensive comparison of molecular feature representations for use in predictive modeling Convolutional networks on graphs for learning molecular fingerprints Neural message passing for quantum chemistry Automatic chemical design using a data-driven continuous representation of molecules Grammar variational autoencoder Junction tree variational autoencoder for molecular graph generation ScaffoldGraph: an open-source library for the generation and analysis of molecular scaffold networks and scaffold trees Open-source cheminformatics software. GitHub and SourceForge A maximum common substructure-based algorithm for searching and predicting drug-like compounds Computer-aided interpretation of mass spectra. 20. molecular structure comparison program for the identification of maximal common substructures Computers and intractability. A Guide to the High-throughput crystallography: reliable and efficient identification of fragment hits Structural biology in fragment-based drug design Crystallographic and electrophilic fragment screening of the sars-cov-2 main protease Improved chemical prediction from scarce data sets via latent space enrichment Randomized smiles strings improve the quality of molecular generative models Generative recurrent networks for de novo drug design Bidirectional molecule generation with recurrent neural networks Learning the Structure Spanned by Chemical Fragments, Scaffolds and Compounds 8 Molecular transformer: a model for uncertaintycalibrated chemical reaction prediction OpenNMT: Open-source toolkit for neural machine translation Attention is all you need