key: cord-0580782-6ms2cc3k authors: Huang, Kexin; Fu, Tianfan; Gao, Wenhao; Zhao, Yue; Roohani, Yusuf; Leskovec, Jure; Coley, Connor W.; Xiao, Cao; Sun, Jimeng; Zitnik, Marinka title: Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development date: 2021-02-18 journal: nan DOI: nan sha: 54ca116f1e9a45768a3a2c47a4608ff34adefa0c doc_id: 580782 cord_uid: 6ms2cc3k Therapeutics machine learning is an emerging field with incredible opportunities for innovatiaon and impact. However, advancement in this field requires formulation of meaningful learning tasks and careful curation of datasets. Here, we introduce Therapeutics Data Commons (TDC), the first unifying platform to systematically access and evaluate machine learning across the entire range of therapeutics. To date, TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and types of meaningful data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards. All resources are integrated and accessible via an open Python library. We carry out extensive experiments on selected datasets, demonstrating that even the strongest algorithms fall short of solving key therapeutics challenges, including real dataset distributional shifts, multi-scale modeling of heterogeneous data, and robust generalization to novel data points. We envision that TDC can facilitate algorithmic and scientific advances and considerably accelerate machine-learning model development, validation and transition into biomedical and clinical implementation. TDC is an open-science initiative available at https://tdcommons.ai. TDC is a platform with AI-ready datasets and learning tasks for therapeutics, spanning the discovery and development of safe and effective medicines. TDC provides an ecosystem of tools and data functions, including strategies for systematic model evaluation, meaningful data splits, data processors, and molecule generation oracles. All resources are integrated and accessible via a Python package. TDC also provides community resources with extensive documentation and tutorials, and leaderboards for systematic model comparison and evaluation. algorithmic innovation. Finally, datasets and benchmarks in TDC lend themselves to the study of the following open questions in machine learning and can serve as a testbed for a variety of algorithmic approaches: • Low-resource learning: Prevailing methods require abundant label information. However, labeled examples are typically scarce in drug development and discovery, considerably limiting the methods' use for problems that require reasoning about new phenomena, such as novel drugs in development, emerging pathogens, and therapies for rare-disease patients. • Multi-modal and knowledge graph learning: Objects in TDC have diverse representations and assume various data modalities, including graphs, tensors/grids, sequences, and spatiotemporal objects. • Distribution shifts: Objects (e.g., compounds, proteins) can change their behavior quickly across Facilitate algorithmic and scientific advance in therapeutics Figure : Therapeutics Machine Learning. Therapeutics machine learning offers incredible opportunities for expansion, innovation, and impact. Datasets and benchmarks in TDC provide a systematic model development and evaluation framework. We envision that TDC can considerably accelerate development, validation, and transition of machine learning into production and clinical implementation. biological context (e.g., patients, tissues, cells), meaning that models need accommodate underlying distribution shifts and have robust generalizable performance on previously unseen data points. • Causal inference: TDC contains datasets that quantify response of patients, molecules and cells to different kinds of perturbations, such as treatment, CRISPR gene over-expression, and knockdown perturbations. Observing how and when a cellular, molecular or patient phenotype is altered can provide clues about the underlying mechanisms involved in perturbation and, ultimately, disease. Such datasets represent a natural testbed for causal inference methods. Facilitating algorithmic and scientific advance in the broad area of therapeutics. We envision TDC to be the meeting point between domain scientists and ML scientists ( Figure ) . Domain scientists can pose learning tasks and identify relevant datasets that are carefully processed and integrated into the TDC and formulated as a scientifically valid learning tasks. ML scientists can then rapidly obtain these tasks and ML-ready datasets through the TDC programming framework and use them to design powerful ML methods. Predictions and other outputs produced by ML models can then facilitate algorithmic and scientific advances in therapeutics. To this end, we strive to make datasets and tasks in TDC representative of real-world therapeutics discovery and development. We further provide realistic data splits, evaluation metrics, and performance leaderboards. Organization of this manuscript. This manuscript is organized as follows. We proceed with a brief review of biomedical and chemical data repositories, machine learning benchmarks and infrastructure (Section ). We then give an overview of TDC (Section ) and describe its tiered structure and modular design (Section ). In Sections -, we provide details for each task in TDC, including the formulation, the level of generalization required for transition into production and clinical implementation, description of therapeutic products and pipeline, and the broader impact of each task. For each task, we also describe a collection of datasets included in TDC. Next, in Sections -, we overview TDC's ecosystem of tools, libraries, leaderboards, and community resources. Finally, we conclude with a discussion and directions for future work (Section ). ) Tools and community resources. TDC includes numerous data functions that can be readily used with any TDC dataset. To date, TDC's programmatic functionality can be organized into the following categories: • strategies for model evaluation: TDC implements a series of metrics and performance functions to debug models, evaluate model performance for any task in TDC, and assess whether model predictions generalize to out-of-distribution datasets. • types of dataset splits: TDC implements data splits that reflect real-world learning settings, including random split, scaffold split, cold-start split, temporal split, and combination split. • molecule generation oracles: Molecular design tasks require oracle functions to measure the quality of generated entities. TDC implements molecule generation oracles, representing the most comprehensive collection of molecule oracles, each tailored to measure the quality of generated molecules in a specific dimension. • data processing functions: Datasets cover a range of modalities, each requiring distinct data processing. TDC provides functions for data format conversion, visualization, binarization, data balancing, unit conversion, database querying, molecule filtering, and more. ) Leaderboards. TDC provides leaderboards for systematic model evaluation and comparison. For a model to be useful for a particular therapeutic question, it needs to perform well across multiple related datasets and tasks. For this reason, we group individual benchmarks in TDC into meaningful groups, which we refer to as benchmark groups. Datasets and tasks in a benchmark group are carefully selected and centered around a particular therapeutic question. Dataset splits and evaluation metrics are also carefully selected to indicate challenges of real-world implementation. The current release of TDC has leaderboards ( = + + + ; see Figure ) . Section describes a subset of selected leaderboards and presents extensive empirical results. Next, we describe the modular design and organization of datasets and learning tasks in TDC. TDC has a unique three-tier hierarchical structure, which to our knowledge, is the first attempt at systematically organizing machine learning for therapeutics ( Figure ) . We organize TDC into three distinct problems. For each problem, we provide a collection of learning tasks. Finally, for each task, we provide a series of datasets. In the first tier, we identify three broad machine learning problems: • Single-instance prediction single_pred: Predictions about individual biomedical entities. • Multi-instance prediction multi_pred: Predictions about multiple biomedical entities. • Generation generation: Generation of biomedical entities with desirable properties. In the second tier, TDC is organized into learning tasks. TDC currently includes learning tasks, covering a range of therapeutic products. The tasks spans small molecules and biologics, including antibodies, peptides, microRNAs, and gene editing. Further, TDC tasks can be mapped to the following drug discovery pipelines: Figure : Tiered design of Therapeutics Data Commons. We organize TDC into three distinct problems. For each problem, we give a collection of learning tasks. Finally, for each task, we provide a collection of datasets. In the first tier, we have three broad machine learning problems: (a) single-instance prediction is concerned with predicting properties of individual entities; (b) multi-instance prediction is concerned predicting properties of groups of entities; and (c) generation is concerned with the automatic generation of new entities. For each problem, we have a set of learning tasks. For example, the ADME learning task aims to predict experimental properties of individual compounds; it falls under single-instance prediction. At last, for each task, we have a collection of datasets. For example, TDC.Caco _Wang is a dataset under the ADME learning task, which, in turn, is under the single-instance prediction problem. This unique three-tier structure is, to the best of our knowledge, the first attempt at systematically organizing therapeutics ML. • Target discovery: Tasks to identify candidate drug targets. • Activity modeling: Tasks to screen and generate individual or combinatorial candidates with high binding activity towards targets. • Efficacy and safety: Tasks to optimize therapeutic signatures indicative of drug safety and efficacy. • Manufacturing: Tasks to synthesize therapeutics. Finally, in the third tier of TDC, each task is instantiated via multiple datasets. For each dataset, we provide several splits of the dataset into training, validation, and test sets to simulate the type of understanding and generalization needed for transition into production and clinical implementation (e.g., the model's ability to generalize to entirely unseen compounds or to granularly resolve patient response to a polytherapy). Tasks Table lists learning tasks included in TDC to date. For each task, TDC provides multiple datasets that vary in size between and million data points. We provide the following information for each learning task in TDC: Background and a formal definition of the learning task. Impact. The broader impact of advancing research on the task. Understanding needed for transition into production and clinical implementation. Product. The type of therapeutic product examined in the task. Pipeline. The therapeutics discovery and development pipeline the task belongs to. Table gives an overview of datasets included in TDC to date. Next, we give detailed information on learning tasks in Sections -. Following the task description, we briefly describe each dataset for the task. For each dataset, we provide a data description and statistics, together with the recommended dataset splits and evaluation metrics and units in the case of numeric labels. In this section, we describe single-instance learning tasks and the associated datasets in TDC. . single_pred.ADME: ADME Property Prediction Definition. A small-molecule drug is a chemical and it needs to travel from the site of administration (e.g., oral) to the site of action (e.g., a tissue) and then decomposes, exits the body. To do that safely and efficaciously, the chemical is required to have numerous ideal absorption, distribution, metabolism, and excretion (ADME) properties. This task aims to predict various kinds of ADME properties accurately given a drug candidate's structural information. Impact. Poor ADME profile is the most prominent reason of failure in clinical trials (Kennedy ) . Thus, an early and accurate ADME profiling during the discovery stage is a necessary condition for successful development of small-molecule candidate. Generalization. In real-world discovery, the drug structures of interest evolve over time (Sheridan ) . Thus, ADME prediction requires a model to generalize to a set of unseen drugs that are structurally distant to the known drug set. While time information is usually unavailable for many datasets, one way to approximate the similar effect is via scaffold split, where it forces training and test set have distant molecular structures (Bemis & Murcko ) . Product. Small-molecule. Pipeline. Efficacy and safety -lead development and optimization. . . Datasets for single_pred.ADME The human colon epithelial cancer cell line, Caco-, is used as an in vitro model to simulate the human intestinal tissue. The experimental result on the rate of drug passing through the Cacocells can approximate the rate at which the drug permeates through the human intestinal tissue (Sambuy et al. ) . This dataset contains experimental values of Caco-permeability of drugs (Wang, Dong, Deng, Zhu, Wen, Yao, Lu, Wang & Cao ) . Suggested data split: scaffold split; Evaluation: MAE; Unit: cm/s. TDC.HIA_Hou: When a drug is orally administered, it needs to be absorbed from the human gastrointestinal system into the bloodstream of the human body. This ability of absorption is called human intestinal absorption (HIA) and it is crucial for a drug to be delivered to the target (Wessel et al. ) . This dataset contains drugs with the HIA index (Hou et al. ) . Suggested data split: scaffold split; Evaluation: AUROC. TDC.Pgp_Broccatelli: P-glycoprotein (Pgp) is an ABC transporter protein involved in intestinal absorption, drug metabolism, and brain penetration, and its inhibition can seriously alter a drug's bioavailability and safety (Amin ) . In addition, inhibitors of Pgp can be used to overcome multidrug resistance (Shen et al. ) . This dataset is from Broccatelli et al. ( ) and contains , drugs with their activities of the Pgp inhibition. Suggested data split: scaffold split; Evaluation: AUROC. TDC.Bioavailability_Ma: Oral bioavailability is measured by the ability to which the active ingredient in the drug is absorbed to systemic circulation and becomes available at the site of action (Toutain & BOUSQUET-MÉLOU a) . This dataset contains drugs with bioavailability activity from Ma et al. ( ). Suggested data split: scaffold split; Evaluation: AUROC. TDC.Lipophilicity_AstraZeneca: Lipophilicity measures the ability of a drug to dissolve in a lipid (e.g. fats, oils) environment. High lipophilicity often leads to high rate of metabolism, poor solubility, high turnover, and low absorption (Waring ) . This dataset contains , experimental values of lipophilicity from AstraZeneca ( ). We obtained it via MoleculeNet (Wu et al. ) . Suggested data split: scaffold split; Evaluation: MAE; Unit: log-ratio. TDC.Solubility_AqSolDB: Aqeuous solubility measures a drug's ability to dissolve in water. Poor water solubility could lead to slow drug absorptions, inadequate bioavailablity and even induce toxicity. More than % of new chemical entities are not soluble (Savjani et al. ) . This dataset is collected from AqSolDb (Sorkun et al. ) , which contains , drugs curated from different publicly available datasets. Suggested data split: scaffold split; Evaluation: MAE; Unit: log mol/L. As a membrane separating circulating blood and brain extracellular fluid, the bloodbrain barrier (BBB) is the protection layer that blocks most foreign drugs. Thus the ability of a drug to penetrate the barrier to deliver to the site of action forms a crucial challenge in development of drugs for central nervous system (Abbott et al. ) . This dataset from Martins et al. ( ) contains , drugs with information on drugs' penetration ability. We obtained this dataset from MoleculeNet (Wu et al. ) . Suggested data split: scaffold split; Evaluation: AUROC. The human plasma protein binding rate (PPBR) is expressed as the percentage of a drug bound to plasma proteins in the blood. This rate strongly affect a drug's efficiency of delivery. The less bound a drug is, the more efficiently it can traverse and diffuse to the site of actions (Lindup & Orme ) . This dataset contains , drugs with experimental PPBRs (AstraZeneca ). Suggested data split: scaffold split; Evaluation: MAE; Unit: % (binding rate). The volume of distribution at steady state (VDss) measures the degree of a drug's concentration in body tissue compared to concentration in blood. Higher VD indicates a higher distribution in the tissue and usually indicates the drug with high lipid solubility, low plasma protein binidng rate (Sjöstrand ) . This dataset is curated by Lombardo & Jing ( ) and contains , drugs. Suggested data split: scaffold split; Evaluation: Spearman Coefficient; Unit: L/kg. The CYP P genes are essential in the breakdown (metabolism) of various molecules and chemicals within cells (McDonnell & Dang ) . A drug that can inhibit these enzymes would mean poor metabolism to this drug and other drugs, which could lead to drug-drug interactions and adverse effects (McDonnell & Dang ) . Specifically, the CYP C gene provides instructions for making an enzyme called the endoplasmic reticulum, which is involved in protein processing and transport. This dataset is from Veith et al. ( ), consisting of , drugs with their ability to inhibit CYP C . Suggested data split: scaffold split; Evaluation: AUPRC. The role and mechanism of general CYP system to metabolism can be found in CYP C Inhibitor. CYP D is responsible for metabolism of around % of clinically used drugs via addition or removal of certain functional groups in the drugs (Teh & Bertilsson ) . This dataset is from Veith et al. ( ), consisting of , drugs with their ability to inhibit CYP D . Suggested data split: scaffold split; Evaluation: AUPRC. The role and mechanism of general CYP system to metabolism can be found in CYP C Inhibitor. CYP A oxidizes the foreign organic molecules and is responsible for metabolism of half of all the prescribed drugs (Zanger & Schwab ) . This dataset is from Veith et al. ( ), consisting of , drugs with their ability to inhibit CYP A . Suggested data split: scaffold split; Evaluation: AUPRC. The role and mechanism of general CYP system to metabolism can be found in CYP C Inhibitor. CYP A is induced by some polycyclic aromatic hydrocarbons (PAHs) and it is able to metabolize some PAHs to carcinogenic intermediates. It can also metabolize caffeine, aflatoxin B , and acetaminophen. This dataset is from Veith et al. ( ), consisting of , drugs with their ability to inhibit CYP A . Suggested data split: scaffold split; Evaluation: AUPRC. The role and mechanism of general CYP system to metabolism can be found in CYP C Inhibitor. Around drugs are metabolized by CYP C enzymes. This dataset is from Veith et al. ( ), consisting of , drugs with their ability to inhibit CYP C . Suggested data split: scaffold split; Evaluation: AUPRC. TDC.Clearance_AZ: Drug clearance is defined as the volume of plasma cleared of a drug over a specified time period and it measures the rate at which the active drug is removed from the body (Toutain & Bousquet-Mélou b) . This dataset is from AstraZeneca ( ) and it contains clearance measures from two experiments types, hepatocyte (TDC.Clearance_Hepatocyte_AZ) and microsomes (TDC.Clearance_Microsome_AZ). As studies (Di et al. ) have shown various clearance outcomes given these two different types, we separate them. It has , drugs for microsome clearance and , drugs for hepatocyte clearance. Suggested data split: scaffold split; Evaluation: Spearman Coefficient; Unit: uL.min −1 .(10 6 cells) −1 for Hepatocyte and mL.min −1 .g −1 for Microsome. Definition. Majority of the drugs have some extents of toxicity to the human organisms. This learning task aims to predict accurately various types of toxicity of a drug molecule towards human organisms. Impact. Toxicity is one of the primary causes of compound attrition. Study shows that approximately % of all toxicity-related attrition occurs preclinically (i.e., in cells, animals) while they are strongly predictive of toxicities in humans (Kramer et al. ) . This suggests that an early but accurate prediction of toxicity can significantly reduce the compound attribution and boost the likelihood of being marketed. Generalization. Similar to the ADME prediction, as the drug structures of interest evolve over time (Sheridan ), toxicity prediction requires a model to generalize to a set of novel drugs with small structural similarity to the existing drug set. Pipeline. Efficacy and safety -lead development and optimization. TDC.LD _Zhu: Acute toxicity LD measures the most conservative dose that can lead to lethal adverse effects. The higher the dose, the more lethal of a drug. This dataset is from Zhu et al. ( ), consisting of , drugs with experimental LD values. Suggested data split: scaffold split; Evaluation: MAE; Unit: log( /(mol/kg)). Human ether-à-go-go related gene (hERG) is crucial for the coordination of the heart's beating. Thus, if a drug blocks the hERG, it could lead to severe adverse effects. This dataset is from Wang, Sun, Liu, Li, Li & Hou ( ), which has drugs and their blocking status. TDC.DILI: Drug-induced liver injury (DILI) is fatal liver disease caused by drugs and it has been the single most frequent cause of safety-related drug marketing withdrawals for the past years (e.g. iproniazid, ticrynafen, benoxaprofen) (Assis & Navarro ) . This dataset is aggregated from U.S. FDA's National Center for Toxicological Research and is collected from Xu et al. ( ). It has drugs with labels about their ability to cause liver injury. Suggested data split: scaffold split; Evaluation: AUROC. TDC.Skin_Reaction: Exposure to chemicals on skins can cause reactions, which should be circumvented for dermatology therapeutics products. This dataset from Alves et al. ( ) contains drugs with their skin reaction outcome. Suggested data split: scaffold split; Evaluation: AUROC. A drug is a carcinogen if it can cause cancer to tissues by damaging the genome or cellular metabolic process. This dataset from Lagunin et al. ( ) contains drugs with their abilities to cause cancer. Suggested data split: scaffold split; Evaluation: AUROC. Tox is a data challenge which contains qualitative toxicity measurements for , compounds on different targets, such as nuclear receptors and stree response pathways (Mayr et al. ) . Depending on different assay, we have different number of drugs. They usually range around , drugs. Suggested data split: scaffold split; Evaluation: AUROC. The clinical toxicity measures if a drug has fail the clinical trials for toxicity reason. It contains , drugs from clinical trials records (Gayvert et al. ) . Suggested data split: scaffold split; Evaluation: AUROC. . single_pred.HTS: High-Throughput Screening Definition. High-throughput screening (HTS) is the rapid automated testing of thousands to millions of samples for biological activity at the model organism, cellular, pathway, or molecular level. The assay readout can vary from target binding affinity to fluorescence microscopy of cells treated with drug. HTS can be applied to different kinds of therapeutics however most available data is from testing of small-molecule libraries. In this task, a machine learning model is asked to predict the experimental assay values given a small-molecule compound structure. High throughput screening is a critical component of small-molecule drug discovery in both industrial and academic research settings. Increasingly more complex assays are now being automated to gain biological insights on compound activity at a large scale. However, there are still limitations on the time and cost for screening a large library that limit experimental throughput. Machine learning models that can predict experimental outcomes can alleviate these effects and save many times and costs by looking at a larger chemical space and narrowing down a small set of highly likely candidates for further smaller-scale HTS. Generalization. The model should be able to generalize over structurally diverse drugs. It is also important for methods to generalize across cell lines. Drug dosage and measurement time points are also very important factors in determining the efficacy of the drug. Pipeline. Activity -hit identification. An in-vitro screen of the Prestwick chemical library composed of , approved drugs in an infected cell-based assay. Given the SMILES string for a drug, the task is to predict its activity against SARSCoV (Touret et al. , MIT ) . Suggested data split: scaffold split; Evaluation: AUPRC. TDC.SARSCoV _ CLPro_Diamond: A large XChem crystallographic fragment screen of drugs against SARS-CoV-main protease at high resolution. Given the SMILES string for a drug, the task is to predict its activity against SARSCoV CL Protease (Diamond Light Source , MIT ). Suggested data split: scaffold split; Evaluation: AUPRC. The HIV dataset consists of , drugs and the task is to predict their ability to inhibit HIV replication. It was introduced by the Drug Therapeutics Program AIDS Antiviral Screen (NIH , Wu et al. ) . Suggested data split: scaffold split; Evaluation: AUPRC. . Definition. The motion of molecules and protein targets can be described accurately with quantum theory, i.e., Quantum Mechanics (QM). However, ab initio quantum calculation of many-body system suffers from large computational overhead that is impractical for most applications. Various approximations have been applied to solve energy from electronic structure but all of them have a trade-off between accuracy and computational speed. Machine learning models raise a hope to break this bottleneck by leveraging the knowledge of existing chemical data. This task aims to predict the QM results given a drug's structural information. A well-trained model can describe the potential energy surface accurately and quickly, so that more accurate and longer simulation of molecular systems are possible. The result of simulation can reveal the biological processes in molecular level and help study the function of protein targets and drug molecules. A machine learning model trained on a set of QM calculations require to extrapolate to unseen or structurally diverse set of compounds. Product. Small-molecule. Pipeline. Activity -lead development. . Definition. Vast majority of small-molecule drugs are synthesized through chemical reactions. Many factors during reactions could lead to suboptimal reactants-products conversion rate, i.e. yields. Formally, it is defined as the percentage of the reactants successfully converted to the target product. This learning task aims to predict the yield of a given single chemical reaction (Schwaller et al. ) . To maximize the synthesis efficiency of interested products, an accurate prediction of the reaction yield could help chemists to plan ahead and switch to alternate reaction routes, by which avoiding investing hours and materials in wet-lab experiments and reducing the number of attempts. Generalization. The models are expected to extrapolate to unseen reactions with diverse chemical structures and reaction types. Product. Small-molecule. Pipeline. Manufacturing -Synthesis planning. TDC.USPTO_Yields: USPTO dataset is derived from the United States Patent and Trademark Office patent database (Lowe ) using a refined extraction pipeline from NextMove software. We selected a subset of USPTO that have "TextMinedYield" label. It contains , reactions with reactants and products. Suggested data split: random split; Evaluation: MAE; Unit: % (yield rate). TDC.Buchwald-Hartwig: Ahneman et al. ( ) performed high-throughput experiments on Pd-catalysed Buchwald-Hartwig C-N cross coupling reactions, measuring the yields for each reaction. This dataset is included as recent study (Schwaller et al. ) shows USPTO has limited applicability. It contains , reactions (reactants and products). Suggested data split: random split; Evaluation: MAE; Unit: % (yield rate). . Definition. Antibodies, also known as immunoglobulins, are large, Y-shaped proteins that can identify and neutralize a pathogen's unique molecule, usually called an antigen. They play essential roles in the immune system and are powerful tools in research and diagnostics. A paratope, also called an antigenbinding site, is the region that selectively binds the epitope. Although we roughly know the hypervariable regions that are responsible for binding, it is still challenging to pinpoint the interacting amino acids. This task is to predict which amino acids are in the active position of antibody that can bind to the antigen. Identifying the amino acids at critical positions can accelerate the engineering processes of novel antibodies. Generalization. The models are expected to be generalized to unseen antibodies with distinct structures and functions. Pipeline. Activity, efficacy and safety. ) filtered by quality such as resolution and sequence identity. There are in total antibody chain sequence, covering both heavy and light chains. Suggested data split: random split; Evaluation: Average-AUROC. . Definition. CRISPR-Cas is a gene editing technology that allows targeted deletion or modification of specific regions of the DNA within an organism. This is achieved through designing a guide RNA sequence that binds upstream of the target site which is then cleaved through a Cas -mediated double stranded DNA break. The cell responds by employing DNA repair mechanisms (such as non-homologous end joining) that result in heterogeneous outcomes including gene insertion or deletion mutations (indels) of varying lengths and frequencies. This task aims to predict the repair outcome given a DNA sequence. Impact. Gene editing offers a powerful new avenue of research for tackling intractable illnesses that are infeasible to treat using conventional approaches. For example, the FDA recently approved engineering of T-cells using gene editing to treat patients with acute lymphoblastic leukemia (Lim & June ) . However, since many human genetic variants associated with disease arise from insertions and deletions (Landrum ) , it is critical to be able to better predict gene editing outcomes to ensure efficacy and avoid unwanted pathogenic mutations. Generalization. van Overbeek et al. ( ) showed that the distribution of Cas -mediated editing products at a given target site is reproducible and dependent on local sequence context. Thus, it is expected that repair outcomes predicted using well-trained models should be able to generalize across cell lines and reagent delivery methods. Pipeline. Efficacy and safety. TDC.Leenay: Primary T cells are a promising cell type for therapeutic genome editing, as they can be engineered efficiently ex vivo and then transferred to patients. This dataset consists of the DNA repair outcomes of CRISPR-CAS knockout experiments on primary CD + T cells drawn from donors (Leenay et al. ) . For each of the , unique genomic locations from genes, the -nucleotide guide sequence is provided along with the -nucletoide PAM sequence. repair outcomes are included for prediction: fraction of indel reads with an insertion, average insertion length, average deletion length, indel diversity, fraction of repair outcomes with a frameshift. Suggested data split: random split; Evaluation: MAE; Units: # for lengths, % for fractions, bits for diversity. In this section, we describe multi-instance learning tasks and the associated datasets in TDC. Definition. The activity of a small-molecule drug is measured by its binding affinity with the target protein. Given a new target protein, the very first step is to screen a set of potential compounds to find their activity. Traditional method to gauge the affinities are through high-throughput screening wet-lab experiments (Hughes et al. ) . However, they are very expensive and are thus restricted by their abilities to search over a large set of candidates. Drug-target interaction prediction task aims to predict the interaction activity score in silico given only the accessible compound structural information and protein amino acid sequence. Impact. Machine learning models that can accurately predict affinities can not only save pharmaceutical research costs on reducing the amount of high-throughput screening, but also to enlarge the search space and avoid missing potential candidates. Generalization. Models require extrapolation on unseen compounds, unseen proteins, and unseen compound-protein pairs. Models also are expected to have consistent performance across a diverse set of disease and target groups. Pipeline. Activity -hit identification. TDC.BindingDB: BindingDB is a public, web-accessible database that aggregates drug-target binding affinities from various sources such as patents, journals, and assays (Liu et al. ) . We partitioned the BindingDB dataset into three sub-datasets, each with different units ( . Definition. Drug-drug interactions occur when two or more drugs interact with each other. These could result in a range of outcomes from reducing the efficacy of one or both drugs to dangerous side effects such as increased blood pressure or drowsiness. Polypharmacy side-effects are associated with drug pairs (or higher-order drug combinations) and cannot be attributed to either individual drug in the pair. This task is to predict the interaction between two drugs. Impact. Increasing co-morbidities with age often results in the prescription of multiple drugs simultaneously. Meta analyses of patient records showed that drug-drug interactions were the cause of admission for prolonged hospital stays in % of the cases (Thomsen et al. , Lazarou et al. ) . Predicting possible drug-drug interactions before they are prescribed is thus an important step in preventing these adverse outcomes. In addition, as the number of combinations or even higher-order drugs is astronomical, wet-lab experiments or real-world evidence are insufficient. Machine learning can provide an alternative way to inform drug interactions. As there is a very large space of possible drug-drug interactions that have not been explored, the model needs to extrapolate from known interactions to new drug combinations that have not been prescribed together in the past. Models should also taken into account dosage as that can have a significant impact on the effect of the drugs. Pipeline. Efficacy and safety -adverse event detection. This dataset is manually sourced from FDA and Health Canada drug labels as well as from the primary literature. Given the SMILES strings of two drugs, the goal is to predict their interaction type. It contains , drug-drug interaction pairs between , drugs and interaction types (Wishart et al. ) . Suggested data split: random split; Evaluation: Macro-F , Micro-F . This dataset contains , , drug-drug interaction pairs between drugs (Tatonetti et al. ) . Given the SMILES strings of two drugs, the goal is to predict the side effect caused as a result of an interaction. Suggested data split: random split; Evaluation: Average-AUROC. Definition. The same drug compound could have various levels of responses in different patients. To design drug for individual or a group with certain characteristics is the central goal of precision medicine. For example, the same anti-cancer drug could have various responses to different cancer cell lines (Baptista et al. ) . This task aims to predict the drug response rate given a pair of drug and the cell line genomics profile. Impact. The combinations of available drugs and all types of cell line genomics profiles are very large while to test each combination in the wet lab is prohibitively expensive. A machine learning model that can accurately predict a drug's response given various cell lines in silico can thus make the combination search feasible and greatly reduce the burdens on experiments. The fast prediction speed also allows us to screen a large set of drugs to circumvent the potential missing potent drugs. A model trained on existing drug cell-line pair should be able to predict accurately on new set of drugs and cell-lines. This requires a model to learn the biochemical knowledge instead of memorizing the training pairs. Pipeline. Activity. (Yang et al. ) . We include two versions of GDSC, with the second one uses improved experimental procedures. The first dataset (TDC.GDSC ) contains , measurements across cancer cells and drugs. The second dataset (TDC.GDSC ) contains , pairs, cell lines, and drugs. Suggested data split: random split; Evaluation: MAE; Unit: µM. . Definition. Synergy is a dimensionless measure of deviation of an observed drug combination response from the expected effect of non-interaction. Synergy can be calculated using different models such as the Bliss model, Highest Single Agent (HSA), Loewe additivity model and Zero Interaction Potency (ZIP). Another relevant metric is CSS which measures the drug combination sensitivity and is derived using relative IC values of compounds and the area under their dose-response curves. Impact. Drug combination therapy offers enormous potential for expanding the use of existing drugs and in improving their efficacy. For instance, the simultaneous modulation of multiple targets can address the common mechanisms of drug resistance in the treatment of cancers. However, experimentally exploring the entire space of possible drug combinations is not a feasible task. Computational models that can predict the therapeutic potential of drug combinations can thus be immensely valuable in guiding this exploration. It is important for model predictions to be able to adapt to varying underlying biology as captured through different cell lines drawn from multiple tissues of origin. Dosage is also an important factor that can impact model generalizability. Pipeline. Activity. This dataset contains the summarized results of drug combination screening studies for the NCI-cancer cell lines (excluding the MDA-N cell line). A total of drugs are tested across cell lines resulting in a total of , unique drug combination-cell line pairs. For each of the combination drugs, its canonical SMILES string is queried from PubChem (Zagidullin et al. ) . For each cell line, the following features are downloaded from NCI's CellMiner interface: , gene features capturing transcript expression levels averaged from five microarray platforms, microRNA expression features and proteomic features that capture the abundance levels of a subset of proteins (Reinhold et al. ) . The labels included are CSS and four different synergy scores. Suggested data split: drug combination split; Evaluation: MAE; Unit: dimensionless. A large-scale oncology screen produced by Merck & Co., where each sample consists of two compounds and a cell line. The dataset covers distinct combinations, each tested against human cancer cell lines derived from different tissue types. Pairwise combinations were constructed from diverse anticancer drugs ( experimental and approved). The synergy score is calculated by Loewe Additivity values using the batch processing mode of Combenefit. The genomic features are from ArrayExpress database (accession number: E-MTAB-), and are quantile-normalized and summarized by Preuer, Lewis, Hochreiter, Bender, Bulusu & Klambauer ( ) using a factor analysis algorithm for robust microarray summarization (FARMS (Hochreiter et al. ) ). Suggested data split: drug combination split; Evaluation: MAE; Unit: dimensionless. Definition. In the human body, T cells monitor the existing peptides and trigger an immune response if the peptide is foreign. To decide whether or not if the peptide is not foreign, it must bound to a major histocompatibility complex (MHC) molecule. Therefore, predicting peptide-MHC binding affinity is pivotal for determining immunogenicity. There are two classes of MHC molecules: MHC Class I and MHC Class II. They are closely related in overall structure but differ in their subunit composition. This task is to predict the binding affinity between the peptide and the pseudo sequence in contact with the peptide representing MHC molecules. Identifying the peptide that can bind to MHC can allow us to engineer peptides-based therapeutics such vaccines and cancer-specific peptides. Generalization. The models are expected to be generalized to unseen peptide-MHC pairs. Pipeline. Activity -peptide design. . multi_pred.AntibodyAff: Antibody-Antigen Binding Affinity Prediction Definition. Antibodies recognize pathogen antigens and destroy them. The activity is measured by their binding affinities. This task is to predict the affinity from the amino acid sequences of both antigen and antibodies. Impact. Compared to small-molecule drugs, antibodies have numerous ideal properties such as minimal adverse effect and also can bind to many "undruggable" targets due to different biochemical mechanisms. Besides, a reliable affinity predictor can help accelerate the antibody development processes by reducing the amount of wet-lab experiments. Generalization. The models are expected to extrapolate to unseen classes of antigen and antibody pairs. Pipeline. Activity. . Definition. MicroRNA (miRNA) is small noncoding RNA that plays an important role in regulating biological processes such as cell proliferation, cell differentiation and so on (Chen et al. ) . They usually function to downregulate gene targets. This task is to predict the interaction activity between miRNA and the gene target. Impact. Accurately predicting the unknown interaction between miRNA and target can lead to a more complete knowledge about disease mechanism and also could result in potential disease target biomarkers. They can also help identify miRNA hits for miRNA therapeutics candidates (Hanna et al. ) . Generalization. The model needs to learn the biochemicals of miRNA and target proteins so that it can extrapolate to new set of novel miRNAs and targets in various disease groups and tissues. Product. Small-molecule, miRNA therapeutic. Pipeline. Basic biomedical research, target discovery, activity. TDC.miRTarBase: miRTarBase is a large public database that contains MTIs that are validated experimentally after manually surveying literature related to functional studies of miRNAs (Chou et al. ) . It contains , MTI pairs with , miRNAs and , targets. We use miRBase (Kozomara et al. ) to obtain miRNA mature sequence as the feature representation for miRNAs. Suggested data split: random split; Evaluation: AUROC. . Definition. During chemical reaction, catalyst is able to increase the rate of the reaction. Catalysts are not consumed in the catalyzed reaction but can act repeatedly. This learning task aims to predict the catalyst for a reaction given both reactant molecules and product molecules (Zahrt et al. ) . Impact. Conventionally, chemists design and synthesize catalysts by trial and error with chemical intuition, which is usually time-consuming and costly. Machine learning model and automate and accelerate the process, understand the catalytic mechanism, and providing an insight into novel catalytic design (Zahrt et al. , Coley et al. ) . Generalization. In real-world discovery, as discussed, the molecule structures in reaction of interest evolve over time (Sheridan ) . We expect model to generalize to the unseen molecules and reaction. Pipeline. Manufacturing -synthesis planning. TDC.USPTO_Catalyst: USPTO dataset is derived from the United States Patent and Trademark Office patent database (Lowe ) using a refined extraction pipeline from NextMove software. TDC selects the most common catalysts that have occurences higher than times. It contains , reactions with reaction types, , reactants and , products with common catalyst types. Suggested data split: random split; Evaluation: Micro-F , Macro-F . In this section, we describe generative learning tasks and the associated datasets in TDC. Definition. Molecule Generation is to generate diverse, novel molecules that has desirable chemical properties (Gómez-Bombarelli et al. , Kusner et al. , Polykovskiy et al. , Brown et al. ) . These properties are measured by oracle functions. A machine learning task first learns the molecular characteristics from a large set of molecules where each is evaluated through the oracles. Then, from the learned distribution, we can obtain novel candidates. As the entire chemical space is far too large to screen for each target, high through screening can only be restricted to a set of existing molecule library. Many novel drug candidates are thus usually omitted. A machine learning that can generate novel molecule obeying some pre-defined optimal properties can circumvent this problem and obtain novel class of candidates. Generalization. The generated molecules have to obtain superior properties given a range of structurally diverse drugs. Besides, the generated molecules have to suffice other basic properties, such as synthesizablility and low off-target effects. Pipeline. Efficacy and safety -lead development and optimization, activity -hit identification. (Polykovskiy et al. ) . Within this benchmark, MOSES provides a cleaned dataset of molecules that are ideal of optimization. It is processed from the ZINC Clean Leads dataset (Sterling & Irwin ) . It contains , , molecules. ZINC is a free database of commercially-available compounds for virtual screening. TDC uses a version from the original Mol-VAE paper (Gómez-Bombarelli et al. ) , which extracted randomly a set of , molecules from the version of ZINC (Irwin et al. ) . ChEMBL is a manually curated database of bioactive molecules with drug-like properties (Mendez et al. , Davies et al. ) . It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs. It contains , , molecules. Definition. Retrosynthesis is the process of finding a set of reactants that can synthesize a target molecule, i.e., product, which is a fundamental task in drug manufacturing (Liu et al. , Zheng et al. ) . The target is recursively transformed into simpler precursor molecules until commercially available "starting" molecules are identified. In a data sample, there is only one product molecule, reactants can be one or multiple molecules. Retrosynthesis prediction can be seen as reverse process of Reaction outcome prediction. Impact. Retrosynthesis planning is useful for chemists to design synthetic routes to target molecules. Computational retrosynthetic analysis tools can potentially greatly assist chemists in designing synthetic routes to novel molecules. Machine learning based methods will significantly save the time and cost. Generalization. The model is expected to accurately generate reactant sets for novel drug candidates with distinct structures from the training set across reaction types with varying reaction conditions. Pipeline. Manufacturing -Synthesis planning. TDC.USPTO-K: USPTO (United States Patent and Trademark Office) K consists of K extracted atommapped reactions with reaction types (Schneider et al. ) . It contains , reactions. Suggested data split: random split; Evaluation: Top-K accuracy. TDC.USPTO: USPTO dataset is derived from the United States Patent and Trademark Office patent database (Lowe ) using a refined extraction pipeline from NextMove software. It contains , , reactions. Suggested data split: random split; Evaluation: Top-K accuracy. Definition. Reaction outcome prediction is to predict the reaction products given a set of reactants (Jin et al. ) . Reaction outcome prediction can be seen as reverse process of retrosynthesis prediction, as described above. Impact. Predicting the products as a result of a chemical reaction is a fundamental problem in organic chemistry. It is quite challenging for many complex organic reactions. Conventional empirical methods that relies on experimentation requires intensive manual label of an experienced chemist, and are always time-consuming and expensive. Reaction Outcome Prediction aims at automating the process. Generalization. The model is expected to accurately generate product for novel set of reactants across reaction types with varying reaction conditions. Pipeline. Manufacturing -Synthesis planning. TDC.USPTO: USPTO dataset is derived from the United States Patent and Trademark Office patent database (Lowe ) using a refined extraction pipeline from NextMove software. It contains , , reactions. Suggested data split: random split; Evaluation: Top-K accuracy. TDC implements a comprehensive suite of auxiliary functions frequently used in therapeutics ML. This functionality is wrapped in an easy-to-use interface. Broadly, we provide functions for a) evaluating model performance, b) generating realistic dataset splits, c) constructing oracle generators for molecules, and d) processing, formatting, and mapping of datasets. Next, we describe these functions; note that detailed documentation and examples of usage can be found at https://tdcommons.ai. To evaluate predictive prowess of ML models built on the TDC datasets, we provide model evaluators. The evaluators implement established performance measures and additional metrics used in biology and chemistry. • Regression: TDC includes common regression metrics, including the mean squared error (MSE), mean absolute error (MAE), coefficient of determination (R 2 ), Pearson's correlation (PCC), and Spearman's correlation (Spearman's ρ). • Binary Classification: TDC includes common metrics, including the area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), accuracy, precision, recall, precision at recall of K (PR@K), and recall at precision of K (RP@K). • Multi-Class and Multi-label Classification: TDC includes Micro-F , Macro-F , and Cohen's Kappa. • Token-Level Classification conducts binary classification for each token in a sequence. TDC provides Avg-AUROC, which calculates the AUROC score between the sequence of / true labels and the sequence of predicted labels for every instance. Then, it averages AUROC scores across all instances. • Molecule Generation Metrics evaluate distributional properties of generated molecules. TDC supports the following metrics: -Diversity of a set of molecules is defined as average pairwise Tanimoto distance between Morgan fingerprints of the molecules (Benhenda ) . -KL divergence (Kullback-Leibler Divergence) between probability distribution of a particular physicochemical descriptor on the training set and probability distribution of the same descriptor on the set of generated molecules (Brown et al. ) . Models that capture distribution of molecules in the training set achieve a small KL divergence score. To increase the diversity of generated molecules, we want high KL divergence scores. -FCD Score (Fréchet ChemNet Distance) first takes the means and covariances of activations of the penultimate layer of ChemNet as calculated for the reference set and for the set of generated molecules (Brown et al. , Preuer, Renz, Unterthiner, Hochreiter & Klambauer ) . The FCD score is then calculated as pairwise Fréchet distance between the reference set and the set of generated molecules. Similar molecular distributions are characterized by low FCD values. -Novelty is the fraction of generated molecules that are not present in the training set (Polykovskiy et al. ) . -Validity is calculated using the RDKit's molecular structure parser that checks atoms' valency and consistency of bonds in aromatic rings (Polykovskiy et al. ) . -Uniqueness measures how often a model generates duplicate molecules (Polykovskiy et al. ) . When that happens often, the uniqueness score is low. A data split specifies a partitioning of the dataset into training, validation and test sets to train, tune and evaluate ML models. To date, TDC provides the following types of data splits: • Random Splits represent the simplest strategy that can be used with any dataset. The random split selects data instances at random and partitions them into train, validation, and test sets. • Scaffold Splits partitions molecules into bins based on their Murcko scaffolds (Wu et al. , Yang et al. ) . These bins are then assigned to construct structurally diverse train, validation, and test sets. The scaffold split is more challenging than the random split and is also more realistic. • Cold-Start Splits are implemented for multi-instance prediction problems (e.g., DTI, GDA, DrugRes, and MTI tasks that involve predicting properties of heterogeneous tuples consisting of object of different types, such as proteins and drugs). The cold-start split first splits the dataset into train, validation and test set on one entity type (e.g., drugs) and then it moves all pairs associated with a given entity in each set to produce the final split. • Combinatorial Splits are used for combinatorial and polytherapy tasks. This split produces disjoint sets of drug combinations in train, validation, and test sets so that the generalizability of model predictions to unseen drug combinations can be tested. Molecule generation aims to produce novel molecule with desired properties. The extent to which the generated molecules have properties of interest is quantified by a variety of scoring functions, referred to as oracles. To date, TDC provides a wrapper to easily access and process oracles. Specifically, we include popular oracles from the GuacaMol Benchmark (Brown et al. ) , including rediscovery, similarity, median, isomers, scaffold hops, and others. We also include heuristics oracles, including synthetic accessibility (SA) score (Ertl & Schuffenhauer ) , quantitative estimate of drug-likeness (QED) (Bickerton et al. ) , and penalized LogP (Landrum ) . A major limitation of de novo molecule generation oracles is that they focus on overly simplistic oracles mentioned above. As such, the oracles are either too easy to optimize or can produce unrealistic molecules. This issue was pointed out by Coley et al. ( ) who found that current evaluations for generative models do not reflect the complexity of real discovery problems. Because of that, TDC collects novel oracles that are more appropriate for realistic de novo molecule generation. Next, we describe the details. • Docking Score: Docking is a theoretical evaluation of affinity (i.e., free energy change of the binding process) between a small molecule and a target (Kitchen et al. ) . A docking evaluation usually includes the conformational sampling of the ligand and the calculation of change of free energy. A molecule with higher affinity usually has a higher potential to pose higher bioactivity. Recently, Cieplinski et al. ( ) showed the importance of docking in molecule generation. For this reason, TDC includes a meta oracle for molecular docking where we adopted a Python wrapper from pyscreener (Graff et al. ) to allow easy access to various docking software, including AutoDock Vina (Trott & Olson ) , smina (Koes et al. ) , Quick Vina (Alhossary et al. ) , PSOVina (Ng et al. ) , and DOCK (Allen et al. ). • ASKCOS: Gao & Coley ( ) found that surrogate scoring models cannot sufficiently determine the level of difficulty to synthesize a compound. Following this observation, we provide a score derived from the analysis of full retrosynthetic pathway. To this end, TDC leverages ASKCOS (Coley et al. ) , an opensource framework that integrates efforts to generalize known chemistry to new substrates by applying retrosynthetic transformations, identifying suitable reaction conditions, and evaluating what reactions are likely to be successful. The data-driven models are trained with USPTO and Reaxys databases. • Molecule.one: Molecule.one API estimates synthetic accessibility (Liu et al. ) of a molecule based on a number of factors, including the number of steps in the predicted synthetic pathway (Sacha et al. ) and the cost of the starting materials. Currently, the API token can be requested from the Molecule.one website and is provided on a one-to-one basis for research use. We are working with Molecule.one to provide a more open access from within TDC in the near future. • IBM RXN: IBM RXN Chemistry is an AI platform that integrates forward reaction prediction and retrosynthetic analysis. The backend of IBM RXN retrosynthetic analysis is a molecular transformer model (Schwaller et al. ) . The model was trained using USPTO and Pistachio databases. Because of the licensing of the retrosynthetic analysis software, TDC requires the API token as input to the oracle function, along with the input drug SMILES strings. • GSK β: Glycogen synthase kinase beta (GSK β) is an enzyme in humans that is encoded by GSK β gene. Abnormal regulation and expression of GSK β is associated with an increased susceptibility towards bipolar disorder. The oracle is a random forest classifer using ECFP fingerprints using the ExCAPE-DB dataset (Sun et al. , Jin et al. ) . • JNK : c-Jun N-terminal Kinases-(JNK ) belong to the mitogen-activated protein kinase family. The kinases are responsive to stress stimuli, such as cytokines, ultraviolet irradiation, heat shock, and osmotic shock. The oracle is a random forest classifer using ECFP fingerprints using the ExCAPE-DB dataset (Sun et al. , Jin et al. ) . • DRD : DRD is a dopamine type receptor. The oracle is constructed by Olivecrona et al. ( ) using a support vector machine classifier with a Gaussian kernel and ECFP fingerprints on the ExCAPE-DB dataset (Sun et al. ) . . Finally, TDC supports several utility functions for data processing, such as visualization of label distribution, data binarization, conversion of label units, summary of data statistics, data balancing, graph transformations, negative sampling, and database queries. Biochemical entities can be represented in various machine learning formats. One of the challenges that hinders machine learning researchers with limited biomedical training is to transform across various formats. TDC provides a MolConvert class that enables format transformation in a few lines of code. Specifically, for D molecules, it takes in SMILES, SELFIES (Krenn et al. ) , and transform them to molecular graph objects in Deep Graph Library , Pytorch Geometric Library , and various molecular features such as ECFP -, MACCS, Daylight, RDKit D, Morgan and PubChem. For D molecules, it takes in XYZ file, SDF file and transform them to D molecular graphs objects, Coulomb matrix and any D formats. New formats for more entities will also be included in the future. TDC has a flexible ecosystem of tools, libraries, and community resources to let researchers push the state-ofthe-art in ML and go from model building and training to deployment much more easily. To boost the accessibility of the project, TDC can be installed through Python Package Index (PyPI) via: pip install PyTDC TDC provides a collection of workflows with intuitive, high-level APIs for both beginners and experts to create machine learning models in Python. Building off the modularized "Problem-Learning Task-Data Set" structure (see Section ) in TDC, we provide a three-layer API to access any learning task and dataset. This hierarchical API design allows us to easily incorporate new tasks and datasets. Suppose you want to retrieve dataset "DILI" to study learning task "Tox" that belongs to a class of problems "single_pred". To obtain the dataset and its associated data split, use the following: from tdc.single_pred import Tox data = Tox(name = DILI ) df = data.get_data () The user only needs to specify these three variables and TDC automatically retrieve the processed machine learning-ready dataset from TDC server and generate a data object, which contains numerous utility functions that can be directly applied on the dataset. For example, to get the various training, validation, and test splits, type the following: from tdc.single_pred import Tox data = Tox(name = DILI ) split = data.get_split (method = random , seed = 42, frac = [0.7, 0.1, 0.2]) For other data functions, TDC provides one-liners. For example, to access the "MSE" evaluator: from tdc import Evaluator evaluator = Evaluator(name = MSE ) score = evaluator (y_true, y_pred) To access any of the oracles currently implemented in TDC, specify the oracle name to obtain the oracle function and provide SMILES fingerprints as inputs: Further, TDC allows user to access each dataset in a benchmark group (see Section ). For example, we want to access the "ADMET_Group": Motivation. A small-molecule drug needs to travel from the site of administration (e.g., oral) to the site of action (e.g., a tissue) and then decomposes, exits the body. Therefore, the chemical is required to have numerous ideal absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties (Van De Waterbeemd & Gifford ). Thus, an early and accurate ADMET profiling during the discovery stage is an essential condition for the successful development of a small-molecule candidate. An accurate ML model that can predict various ADMET endpoints are thus highly sought-after. Experimental setup. We leverage ADMET datasets included in TDC -the largest public ADMET benchmark. The included endpoints are widely used in the pharmaceutical companies, such as metabolism with various CYP enzymes, half-life, clearance, and off-target effects. In real-world discovery, the drug structures of interest evolve. Thus, ADMET prediction requires a model to generalize to a set of unseen drugs that are structurally distant to the known drug set. We adopt scaffold split to simulate this distant effect. Data are split into : : train:validation:test where train and validation set are shuffled five times to create five random runs. For binary classification, AUROC is used for balanced data and AUPRC when the number of positives are smaller than negatives and for regression task, MAE is used and Spearman correlation for benchmarks where a trend is more important than the absolute error. Baselines. The focus in this task is representation learning. We include ( ) multi-layer perceptron (MLP) with expert-curated fingerprint (Morgan fingerprint (Rogers & Hahn ) with , bits) or descriptor (RDKit D (Landrum ) , -dim); ( ) convolutional neural network (CNN) on SMILES strings, which applies D convolution over a string representation of the molecule (Huang, Fu, Glass, Zitnik, Xiao & Sun https ://github.com/mims-harvard/TDC https://travis-ci.org/github/mims-harvard/TDC https://app.circleci.com/pipelines/github/mims-harvard/TDC ); ( ) state-of-the-art (SOTA) ML models use graph neural network based models on molecular D graphs, including neural fingerprint (NeuralFP) (Duvenaud et al. ) , graph convolutional network (GCN) (Kipf & Welling ) , and attentive fingerprint (AttentiveFP) (Xiong et al. ) , three powerful Graph neural network (GNN) models. In addition, recently, (Hu, Liu, Gomes, Zitnik, Liang, Pande & Leskovec ) has adapted a pretraining strategy to molecule graph, where we include two strategies attribute masking (AttMasking) and context prediction (ContextPred). Methods follow the default hyperparameters described in the original papers. Results. Results are shown in Table . Overall, we find that pretraining GIN (Graph Isomorphism Network) (Xu et al. ) with context prediction has the best performances in endpoints, attribute masking has the best ones in endpoints, with combined for pretraining strategies, especially in CYP enzyme predictions. Expert-curated descriptor RDKit D also has five endpoints that achieve the best results, while SMILES-based CNN has one best-performing one. Our systematic evaluation yield three key findings. First, the ML SOTA models do not work well consistently for these novel realistic endpoints. In some cases, methods based on learned features are worse than the efficient domain features. This gap highlights the necessity for realistic benchmarking Second, performances vary across feature types given different endpoints. For example, in TDC.CYP A -S, the SMILES-based CNN is . %-. % better than the graph-based methods. We suspect this is due to that different feature types contain different signals (e.g. GNN focuses on a local aggregation of substructures whereas descriptors are global biochemical features). Thus, future integration of these signals could potentially improve the performance. Third, the best performing methods use pretraining strategies, highlighting an exciting avenue in recent advances in self-supervised learning to the biomedical setting. (↑) AUPRC . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.CYP A -I (↑) AUPRC . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . (↑) AUPRC . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.CYP D -S (↑) AUPRC . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.CYP A -S (↑) AUROC . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.CYP C -S (↑) AUPRC . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.Half_Life (↑) Spearman . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.CL-Micro (↑) Spearman . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.CL-Hepa (↑) Spearman . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.hERG (↑) AUROC . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.AMES (↑) AUROC . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.DILI (↑) AUROC . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . TDC.LD (↓) MAE . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Motivation. Drug-target interactions (DTI) characterize the binding of compounds to disease targets. Identifying high-affinity compounds is the first crucial step for drug discovery. Recent ML models have shown strong performances in DTI prediction (Huang, Fu, Glass, Zitnik, Xiao & Sun ) , but they adopt a random dataset splitting where testing sets contain unseen pair of compound-target, but both of the compounds and targets are seen. However, pharmaceutical companies develop compound screening campaigns for novel targets or screen a novel class of compounds for known targets. These novel compounds and targets shift over the years. Thus, it requires a DTI ML model to achieve consistent performances to the subtle domain shifts along the temporal dimension. Recently, numerous domain generalization methods have been developed in the context of images and languages (Koh et al. ) but merely in biomedical space. Experimental setup. In this benchmark, we use DTIs in TDC.BindingDB that have patent information. Specifically, we formulate each domain consisting of DTIs that are patented in a specific year. We test various domain generalization methods to predict out-of-distribution DTIs in -after training on -DTIs, simulating the realistic scenario. Note that time information for specific targets and compounds are usually private data. Thus, we solicit the patent year of the DTI as a reasonable proxy to simulate this realistic challenge. We use the popular deep learning based DTI model DeepDTA (Öztürk et al. ) as the backbone of any domain generalization algorithms. The evaluation metric is pearson correlation coefficient (PCC). Validation set selection is crucial for a fair domain generalization methods comparison. Following the strategy of "Training-domain validation set" in Gulrajani & Lopez-Paz ( ), from the -DTIs, we randomly set % of them as the validation set and use them as the in-distribution performance calculations as they follow the same as the training set and -are only used during testing, which we called out-of-distribution. Baselines. ERM (Empirical Risk Minimization) (Vapnik ) is the standard training strategy where errors across all domains and data are minimized. We then include various types of SOTA domain generalization algorithms: MMD (Maximum Mean Discrepancy) (Li et al. ) optimizes the similarities of maximum mean discrepancy across domains, CORAL (Correlation Alignment) (Sun & Saenko ) matches the mean and covariance of features across domains; IRM (Invariant Risk Minimization) (Ahuja et al. ) obtains features where a linear classifier is optimal across domains; GroupDRO (distributionally robust neural networks for group shifts) (Sagawa et al. ) optimizes ERM and adjusts the weights of domains with larger errors; MTL (Marginal Transfer Learning) (Blanchard et al. ) concatenates the original features with an augmented vector using the marginal distribution of feature vectors, which practically is the mean of the feature embedding; ANDMask (Parascandolo et al. ) masks gradients that have inconsistent signs in the corresponding weights across domains. Note that majority of the methods are developed for classification tasks, we modify the objective functions to regression and keep the rest the same. Methods follow the default hyperparameters described in the paper. Figure . We observe that in-distribution reaches . PCC and are very stable across the years, suggesting the high predictive power of ML models in the unrealistic but widely adopted ML settings. However, out-of-distribution performance significantly degrades from . % to . % across methods, suggesting that domain shift exists and realistic constraint breaks usual training strategies. Second, although the best performed methods are MMD and CORAL, the standard training strategy has similar performances as current ML SOTA domain generalization algorithms, which confirms with the systematic study conducted by Gulrajani & Lopez-Paz ( ), highlighting a demand for robust domain generalization methods that are specialized in biomedical problems. Out-of-Distribution Figure : Heatmap visualization of domain generalization methods performance across each domain in the TDC DTI-DG benchmark using TDC.BindingDB. We observe a significant gap between the in-distribution and out-of-distribution performance and highlight the demand for algorithmic innovation. Motivation. AI-assisted drug design aims to generate novel molecular structures with desired pharmaceutical properties. Recent progress in generative modeling has shown great promising results in this area. However, the current experiments focus on optimizing simple heuristic oracles, such as QED (quantitative estimate of drug-likeness) and LogP (Octanol-water partition coefficient) (Jin et al. , You et al. , Zhou et al. ) , while an experimental evaluation, such as a bioassay, or a high-fidelity simulation, is much more costly in terms of resources that require a more data-efficient strategy. Further, as generative models can explore chemical space beyond a predefined one, the structure of the generated molecular might be valid but not synthesizable (Gao & Coley ) . Therefore, we leverage docking simulation (Cieplinski et al. , Steinmann & Jensen ) as an oracle and build up benchmark groups. Docking evaluates the affinity between a ligand (a small molecular drug) and a target (a protein involved in the disease), and is widely used in drug discovery in practice (Lyu et al. ) . In addition to the objective function value, we add a quality filter and a synthetic accessibility score to evaluate the generation quality within a limited number of oracle calls. Experimental setup. We leverage TDC.ZINC dataset as the molecule library and TDC.Docking oracle function as the molecule docking score evaluator against the target DRD , which is a crucial disease target for neurology diseases such as tremor and schizophrenia. To imitate a low-data scenario, we limit the number of oracle callings available to four levels: , , , . In addition to typical oracle scores, we investigate additional metrics to evaluate the quality of generated molecules, including ( ) Top /Top /Top : Average docking score of top-/ / generated molecules for a given target; ( ) Diversity: Average pairwise Tanimoto distance of Morgan fingerprints for Top generated molecules; ( ) Novelty: Fraction of generated molecules that are not present in the training set; ( ) m : Synthesizability score of molecules obtained via molecule.one retrosynthesis model (Sacha et al. ) ; ( ) %pass: Fraction of generated molecules that successfully pass through apriori defined filters; ( ) Top %pass: The lowest docking score for molecules that pass the filter. Each model is run three times with different random seeds. We compare domain SOTA methods including Screening (Lyu et al. ) (simulated as random sampling), Graph-GA (graph-based genetic algorithm) (Jensen ) , and ML SOTA methods including stringbased LSTM (Segler et al. ) , GCPN (Graph Convolutional Policy Network) (You et al. ) , MolDQN (Deep Q-Network) (Zhou et al. ) and MARS (Markov molecular Sampling) (Xie et al. ) . We also include best-in-data, which choose molecules with the highest docking score from ZINC K database as reference. Methods follow the default hyperparameters described in the paper. Table . Overall, we observe that almost all models cannot perform well under a limited oracle setting. The majority of the methods cannot surpass the best-in-data docking scores under , , , allowable oracle callings. In the , oracle callings setting, Graph-GA (-. ) and LSTM (-. ) finally surpass the best-in-data result. Graph-GA dominates the leaderboard with learnable parameters in terms of optimization ability, while a simple SMILES LSTM ranked behind. The SOAT ML models that reported excellent performances in unlimited trivial oracles cannot beat virtual screening when allowing less than , oracle calls. This result questions the utility of the current ML SOTA methods and calls for a shift of focus on the current ML molecular generation communities to consider realistic constraints during evaluation. As for the synthesizability, as the number of allowable oracles calls increases, the more significant fraction generates undesired molecular structures despite the increasing affinity. We observe a monotonous increment in the m score of the best performing Graph GA method when we allow more oracle calls. In the , calls category, only . % -. % of the generated molecules pass the molecule filters, and within the passed molecules, the best docking score drops significantly compared to before the filter. By contrast, LSTM keeps a relatively good generated quality in all categories, showing ML generative models have an advantage in learning the distribution of training sets and producing "normal" molecules. Also, the recent synthesizable constrained generation (Korovina et al. , Gottipati et al. , Bradshaw et al. ) is a promising approach to tackle this problem. We expect to see more ML models explicitly considering synthesizability. Therapeutics machine learning is an emerging field with many hard algorithmic challenges and applications with immense opportunities for expansion, innovation, and impact. To this end, our Therapeutics Data Commons (TDC) is a platform of AI-ready datasets and learning tasks for drug discovery and development. Curated datasets, strategies for systematic model development and evaluation, and an ecosystem of tools, leaderboards and community resources in TDC serve as a meeting point for domain and machine learning scientists. We envision that TDC can considerably accelerate machine learning model development, validation and transition into production and clinical implementation. To facilitate algorithmic and scientific innovation in therapeutics, we will support the continued development of TDC to provide AI-ready datasets and enhance outreach to build an inclusive research community: • New Learning Tasks and Datasets: We are actively working to include new learning tasks and datasets and keep abreast with the state-of-the-art. We now work on tasks related to emerging therapeutic products, including antibody-drug conjugates (ADCs) and proteolysis targeting chimera (PROTACs), and new pipelines, including clinical trial design, drug delivery, and postmarketing safety. . ± . . ± . Novelty (↑) --. . ± . • New ML Tools: We plan to implement additional data functions and provide additional tools, libraries, and community resources. • New Leaderboards and Competitions: We plan to design new leaderboards for tasks that are of interest to the therapeutics community and have great potential to benefit from advanced machine learning. Lastly, TDC is an open science initiative. We welcome contributions from the research community. Structure and function of the blood-brain barrier Personalized medicine and the power of electronic health records Large-scale analysis of disease pathways in the human interactome Predicting reaction performance in c-n cross-coupling using machine learning Invariant risk minimization games Fast, accurate, and reliable molecular docking with QuickVina DOCK : Impact of new features and current docking performance Predicting chemically-induced skin reactions. part I: QSAR models of skin sensitization and their application to identify potentially hazardous compounds P-glycoprotein inhibition for optimal drug delivery Human drug hepatotoxicity: a contemporary clinical perspective Experimental in vitro dmpk and physicochemical data on a set of publicly disclosed compounds Deep learning for drug response prediction in cancer The properties of known drugs Basic principles of pharmacokinetics ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity? The protein data bank Quantifying the chemical beauty of drugs BIOVIA pipeline pilot Domain generalization by marginal transfer learning million druglike small molecules for virtual screening in the chemical universe database GDB-' Barking up the right tree: an approach to search over molecule synthesis dags A novel approach for predicting p-glycoprotein (abcb ) inhibition using molecular interaction fields GuacaMol: benchmarking models for de novo molecular design Selecting relevant descriptors for classification by bayesian estimates: a comparison with decision trees and support vector machines approaches for disparate data sets The role of microRNA-and microRNA-in skeletal muscle proliferation and differentiation Predicting antibody developability from sequence using machine learning miRTarBase update : a resource for experimentally validated microRNA-target interactions We should at least be able to design molecules that dock well Autonomous discovery in the chemical sciences part II: Outlook A robotic platform for flow synthesis of organic compounds informed by ai planning ChEMBL web services: streamlining access to drug discovery data and utilities Comprehensive analysis of kinase inhibitor selectivity ImageNet: A large-scale hierarchical image database Mechanistic insights from comparing intrinsic clearance values between human liver microsomes and hepatocytes to guide drug design SAbDab: the structural antibody database Convolutional networks on graphs for learning molecular fingerprints Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning The synthesizability of molecules proposed by generative models Deep learning in protein structural modeling and design A data-driven approach to predicting successes and failures of clinical trials Automatic chemical design using a data-driven continuous representation of molecules Learning to navigate the synthetically accessible chemical space using reinforcement learning Accelerating high-throughput virtual screening through molecular pool-based active learning In search of lost domain generalization Network medicine framework for identifying drug repurposing opportunities for COVID Chemml: A machine learning and informatics program package for the analysis, mining, and modeling of chemical and materials data The potential for microRNA therapeutics and clinical research A new summarization method for affymetrix probe level data Adme evaluation in drug discovery. . prediction of oral absorption by correlation and classification Open Graph Benchmark: Datasets for machine learning on graphs Strategies for pre-training graph neural networks DeepPurpose: A deep learning library for drug-target interaction prediction SkipGNN: predicting molecular interactions with skip-graph networks Principles of early drug discovery Zinc: a free tool to discover chemistry for biology A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space Improved methods for predicting peptide binding affinity to MHC class II molecules BepiPred-. : improving sequence-based B-cell epitope prediction using conformational epitopes Multi-objective molecule generation using interpretable substructures Predicting organic reaction outcomes with weisfeiler-lehman network Learning multimodal graph-to-graph translation for molecular optimization High accuracy protein structure prediction using deep learning' , Fourteenth Critical Assessment of Techniques for Protein Structure Prediction Integrative omics for health and disease' Managing the drug discovery/development interface Semi-supervised classification with graph convolutional networks Docking and scoring in virtual screening for drug discovery: methods and applications Lessons learned in empirical scoring with smina from the CSAR benchmarking exercise Wilds: A benchmark of in-the-wild distribution shifts Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations OpenChem: A deep learning toolkit for computational chemistry and drug design miRBase: from microRNA sequences to function The application of discovery toxicology and pathology towards the design of safer pharmaceutical lead candidates Self-referencing embedded strings (SELFIES): A % robust molecular string representation Grammar variational autoencoder Computer-aided prediction of rodent carcinogenicity by PASS and CISOC-PSCT RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling Developability index: a rapid in silico tool for the screening of antibody aggregation propensity Incidence of adverse drug reactions in hospitalized patients: a meta-analysis of prospective studies Large dataset enables prediction of repair after CRISPR-Cas editing in primary T cells Domain generalization with adversarial feature learning Parapred: antibody paratope prediction using convolutional and recurrent neural networks The principles of engineering immune cells to treat cancer Clinical pharmacology: plasma protein binding of drugs Retrosynthetic reaction prediction using neural sequence-to-sequence models RetroGNN: Approximating retrosynthesis by graph neural networks for de novo drug design BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities In silico prediction of volume of distribution in humans. extensive data set and the exploration of linear and nonlinear methods coupled with molecular interaction fields descriptors Chemical reactions from us patents ( -sep ). figshare A reference map of the human binary protein interactome Ultra-large library docking for discovering new chemotypes Prediction models of human plasma protein binding rate and oral bioavailability derived by using ga-cg-svm method A bayesian approach to in silico blood-brain barrier penetration modeling DeepTox: toxicity prediction using deep learning Basic review of the cytochrome p system ChEMBL: towards direct deposition of bioassay data Machine learning of molecular electronic properties in chemical compound space PSOVina: The hybrid particle swarm optimization algorithm for protein-ligand docking NetMHCpan-. : improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets Trend analysis of a database of intravenous pharmacokinetic parameters in humans for drug compounds Molecular de-novo design through deep reinforcement learning Deepdta: deep drug-target binding affinity prediction Learning explanations that are hard to vary The DisGeNET knowledge platform for disease genomics: update Molecular sets (MOSES): a benchmarking platform for molecular generation models DeepSynergy: predicting anti-cancer drug synergy with deep learning Fréchet chemnet distance: a metric for generative models for molecules in drug discovery Drug repurposing: progress, challenges and recommendations Quantum chemistry structures and properties of kilo molecules Electronic spectra from TDDFT and machine learning in chemical space Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more Evaluating protein transfer learning with tape Five computational developability guidelines for therapeutic antibody profiling CellMiner: a web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the nci-cell line set Extended-connectivity fingerprints Enumeration of billion organic small molecules in the chemical universe database GDB-' Molecule edit graph attention network: Modeling chemical reactions as sequences of graph edits Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization The Caco-cell line as a model of the intestinal barrier: influence of cell and culture-related factors on Caco-cell functional characteristics Drug solubility: importance and enhancement techniques Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction Prediction of chemical reaction yields using deep learning Generating focused molecule libraries for drug discovery with recurrent neural networks' Inhibition of Wnt/β-catenin signaling downregulates P-glycoprotein and reverses multi-drug resistance of cholangiocarcinoma Time-split cross-validation as a method for estimating the goodness of prospective prediction Volume and distribution of blood and their significance in regulating the circulation Aqsoldb, a curated reference set of aqueous solubility and d descriptors for a diverse set of compounds Using a genetic algorithm to find molecules with good docking scores A deep learning approach to antibiotic discovery Deep coral: Correlation alignment for deep domain adaptation ExCAPE-DB: an integrated large scale dataset facilitating big data analysis in chemogenomics STRING v : protein-protein interaction networks Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis Data-driven prediction of drug effects and interactions Pharmacogenomics of CYP D : molecular genetics, interethnic differences and clinical importance Systematic review of the incidence and characteristics of preventable adverse drug events in ambulatory care In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoVreplication Bioavailability and its assessment Plasma clearance AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading THPdb: database of FDA-approved peptide and protein therapeutics ADMET in silico modelling: towards prediction paradise?' DNA repair profiling reveals nonrandom outcomes at Cas -mediated breaks An overview of statistical learning theory Comprehensive characterization of cytochrome p isozyme selectivity across chemical libraries The immune epitope database (IEDB): update' SuperGLUE: A stickier benchmark for general-purpose language understanding systems Adme properties evaluation in drug discovery: prediction of caco-cell permeability using a combination of nsga-ii and boosting Admet evaluation in drug discovery. . predicting herg blockers by combining multiple pharmacophores and machine learning approaches Therapeutic target database : enriched resource for facilitating research and early development of targeted therapeutics Lipophilicity in drug discovery Prediction of human intestinal absorption of drug compounds from molecular structure DrugBank . : a major update to the DrugBank database for Moleculenet: a benchmark for molecular machine learning MARS: Markov molecular sampling for multi-objective drug discovery Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism In silico prediction of chemical ames mutagenicity How powerful are graph neural networks? Deep learning for drug-induced liver injury Analyzing learned molecular representations for property prediction Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells Graph convolutional policy network for goal-directed molecular graph generation DrugComb: an integrative cancer drug combination data portal Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning Cytochrome P enzymes in drug metabolism: regulation of gene expression, enzyme activities, and impact of genetic variation Predicting retrosynthetic reactions using self-corrected transformer neural networks' Optimization of molecules via deep reinforcement learning Quantitative structure-activity relationship modeling of rat acute toxicity by oral exposure Modeling polypharmacy side effects with graph convolutional networks Gene prioritization by compressive data fusion and chaining BioSNAP Datasets: Stanford biomedical network dataset collection