key: cord-0892945-sbet9rqr
authors: Zeng, Xiangxiang; Xiang, Hongxin; Yu, Linhui; Wang, Jianmin; Li, Kenli; Nussinov, Ruth; Cheng, Feixiong
title: Accurate prediction of molecular targets using a self-supervised image representation learning framework
date: 2022-04-07
journal: Res Sq
DOI: 10.21203/rs.3.rs-1477870/v1
sha: 44e71cbf98e90959b3d67898b6218c9d80bb979d
doc_id: 892945
cord_uid: sbet9rqr

The clinical efficacy and safety of a drug is determined by its molecular targets in the human proteome. However, proteome-wide evaluation of all compounds in human, or even animal models, is challenging. In this study, we present an unsupervised pre-training deep learning framework, termed ImageMol, from 8.5 million unlabeled drug-like molecules to predict molecular targets of candidate compounds. The ImageMol framework is designed to pretrain chemical representations from unlabeled molecular images based on local- and global-structural characteristics of molecules from pixels. We demonstrate high performance of ImageMol in evaluation of molecular properties (i.e., drugâ€™s metabolism, brain penetration and toxicity) and molecular target profiles (i.e., human immunodeficiency virus) across 10 benchmark datasets. ImageMol shows high accuracy in identifying anti-SARS-CoV-2 molecules across 13 high-throughput experimental datasets from the National Center for Advancing Translational Sciences (NCATS) and we re-prioritized candidate clinical 3CL inhibitors for potential treatment of COVID-19. In summary, ImageMol is an active self-supervised image processing-based strategy that offers a powerful toolbox for computational drug discovery in a variety of human diseases, including COVID-19.

Despite recent advances of biomedical research and technologies, drug discovery and development remains a challenging multidimensional task requiring optimization of vital properties of candidate compounds, including pharmacokinetics, efficacy and safety [1, 2] . It was estimated that pharmaceutical companies spent $2.6 billion in 2015, up from $802 million in 2003, on drug approval by the U.S. Food and Drug Administration (FDA) [3] .

The increasing cost of drug development resulted from lack of efficacy of the randomized controlled trials, and the unknown pharmacokinetics and safety profiles of candidate compounds [4] [5] [6] . Traditional experimental approaches are unfeasible on proteome-wide scale evaluation of molecular targets for all candidate compounds in human, or even animal models. Computational approaches and technologies have been considered a promising solution [7, 8] , which can significantly reduce costs and time during the entire pipeline of the drug discovery and development.

The rise of advanced Artificial Intelligence (AI) technologies [9, 10] , motivated their application to drug design [11] [12] [13] and target identification [14] [15] [16] . One of the fundamental challenges is how to learn molecular representation from chemical structures [17] . Previous molecular representations were based on hand-crafted features, such as fingerprintbased features [16, 18] , physiochemical descriptors and pharmacophorebased features [19, 20] . However, these traditional molecular representation methods rely on a large amount of domain knowledge, such as sequencebased [21, 22] and graph-based [23, 24] approaches. Their accuracy in extracting informative vectors for description of molecular identities and biological characteristics of the molecules is limited. Recent advances of unsupervised learning in computer vision [25, 26] suggest that it is possible to apply unsupervised image-based pre-training models for computational drug discovery.

In this study, we presented an unsupervised molecular image pretraining framework (termed ImageMol) with chemical awareness for learning the molecular structures from large-scale molecular images. ImageMol combines an image processing framework with comprehensive molecular chemistry knowledge for extracting fine pixel-level molecular features in a visual computing way. Compared with state-of-the-art methods, ImageMol has two significant improvements: (1) It utilizes molecular images as the feature representation of compounds with high accuracy and low computing cost; (2) It exploits an unsupervised pre-trained learning framework to capture the structural information of molecular images from 8.5 million drug-like compounds with diverse biological activities at the human proteome ( Figure   1 ). We demonstrated the high accuracy of ImageMol in a variety of drug discovery tasks. Via ImageMol, we identified anti-SARS-CoV-2 molecules across 13 high-throughput experimental datasets from the National Center for Advancing Translational Sciences (NCATS). In summary, ImageMol provides 5 a powerful pre-training deep learning framework for computational drug discovery.

Here, we developed a pre-training deep learning framework, ImageMol, for accurate prediction of molecular targets. ImageMol pre-trained 8,506,205 molecular images from two large drug-like databases (ChEMBL [27] and ZINC [28] ). We assembled five pretext tasks to extract biologically relevant structural information: 1) A molecular encoder was designed to extract latent features from 8.5 million molecular images (Fig. 1a) ; 2) Five pretraining strategies (Supplementary Figures 1-5 ) are utilized to optimize the latent representation of the molecular encoder by considering the chemical knowledge and structural information from molecular images (Fig. 1b) ; and 3) a pretrained molecular encoder is further fine-tuned on downstream tasks to further improve model performance (Fig. 1c) . In addition, two pre-tasks (multigranularity chemical clusters classification task and molecular rationality discrimination task (cf. Methods) are further designed to ensure that ImageMol properly capture meaningful chemical information from images ( Supplementary Figures 1-2) . We next evaluated the performance of ImageMol in a variety of drug discovery tasks, including evaluation of the drug's metabolism, brain penetration, toxic profiles, and molecular target 6 profiles across the human immunodeficiency virus (HIV), SARS-CoV-2, and Alzheimer's disease.

We first evaluated the performance of ImageMol using four types of (AUC=0.809). In stratified split, the proportion of each class in the training set, validation set, and test set is the same as in the original dataset. In scaffold split, the datasets are divided according to the molecular substructure: the substructures in the training set, validation set, and test set are disjoint, making them ideal to test robustness and generalizability of in silico models.

For a fair comparison in Fig. 2b , we used the same experimental setup as Chemception [29] , a state-of-the-art convolutional neural network (CNN) framework. ImageMol achieves higher AUC values on HIV (AUC=0.821) and

Tox21 (AUC=0.824), suggesting that ImageMol can capture more biologically 7 relevant information from molecular images than CNN. We further evaluated performance of ImageMol in prediction of drug metabolism across five major metabolism enzymes: CYP1A2, CYP2C9, CYP2C19, CYP2D6 and CYP3A4 (cf. Methods). Figure 2c showed that ImageMol achieves higher AUC values (ranging from 0.802 to 0.892) in the prediction of inhibitors vs. non-inhibitors across five major drug metabolism enzymes as well compared with two stateof-the-art molecular image-based representation models: ADMET-CNN [30] and QSAR-CNN [31] . Additional results of the detailed comparison are provided in the Supplementary Figures 6-7 .

We further compared the performance of ImageMol with three state-ofthe-art molecular representation models: 1) fingerprint-based, 2) sequencebased, and 3) graph-based models. As shown in Fig. 2d , ImageMol outperforms two sequence-based models (SMILES Transformer [22] and Recurrent Neural Network-based Sequence-to-Sequence (RNNS2S) [32] ) across all four benchmark biomedical datasets with stratified split. ImageMol has better performance ( Fig. 2e ) compared with three sequence-based models (ChemBERTa [21] , SMILES Transformer and Mol2Vec [33] ) and two graph-based models (Jure's GNN [23] and N-GRAM [34] ) based on a scaffold split. In addition, we found that ImageMol achieved higher AUC values ( Fig.   2f ) compared to traditional MACCS-based methods and FP4-based methods [35] across multiple machine learning algorithms, including support vector machine, Decision Tree, k-Nearest Neighbors, Naive Bayes (NB), and their 8 ensemble models [35] (Supplementary Table 3 

The ongoing global COVID-19 pandemic caused by a toxic agent SARS-CoV-2 virus, has led to more than 1.1 billion confirmed cases and over 6 million deaths worldwide as of March 15, 2022 . There is a critical, time-sensitive need to develop effective anti-viral treatment strategies for the COVID-19 pandemic [6, 36] . We therefore test our ImageMol to identify potential anti-SARS-CoV-2 treatments across a variety of SARS-CoV-2 biological assays, including viral replication, viral entry, counterscreen, in vitro infectivity, and live virus infectivity [20] . In total, we evaluated ImageMol across 13 SARS-CoV-2 datasets, including 3C-like (3CL) protease enzymatic activity, angiotensin converting enzyme 2 (ACE2) enzymatic activity, human embryonic kidney 293 (KEK293) cell line toxicity, human fibroblast toxicity (Hcytox), middle east respiratory syndrome pseudotyped particle entry (MERS-PPE) and its Huh7 tox counterscreen (MERS-PPE_cs), SARS-CoV PPE (CoV-PPE) and its VeroE6 tox counterscreen (CoV-PPE_cs), SARS-CoV-2 cytopathic effect (CPE) and its host tox counterscreen (Cytotox), Spike-ACE2 protein-protein 9 interaction (AlphaLISA) and its TruHit counterscreen, and transmembrane protease serine 2 (TMPRSS2) enzymatic activity (Supplementary Table 6 ).

Across 13 SARS-CoV-2 targets, ImageMol achieves high AUC values ranging from 72.0% to 82.6% (Fig. 3a) . To test whether ImageMol capture biologically relevant features, we used the global average pooling (GAP) layer of ImageMol to extract latent features of each dataset and used t-SNE to visualize latent features. Fig. 3b revealed that the latent features identified by ImageMol are well clustered according to whether they are active or inactive anti-SARS-CoV-2 agents across 8 targets or endpoints. These observations showed that ImageMol can accurately extract discriminative, antiviral features from molecular images for downstream tasks.

We further compared ImageMol with both deep learning and machine learning frameworks: 1) a graph neural network (GNN) with a series of pretraining strategies (termed Jure's GNN [23] ), and (2) REDIAL-2020 [20] , a suite of machine learning models for estimating small molecule activities in a range of SARS-CoV-2-related assays. We found that ImageMol significantly outperform Jure's GNN models across all 13 SARS-CoV-2 targets ( Fig. 3a and Supplementary Table 7 ). For instance, there are over 12% elevated AUC values of ImageMol (AUC = 0.824) compared to the Jure's GNN model (AUC = 0.704) in prediction of 3CL protease inhibitors. We further evaluated the area under the precision and recall (AUPR), a metric that is highly sensitive to the imbalance issues of positive versus negative labeled data. 10 Compared to Jure's GNN models, the elevated AUPR of ImageMol ranges from 3.0% to 29.1% with an average performance advantage of 8.5% across 13 SARS-CoV-2 targets, in particular for 3CL protease inhibitors (29.1% AUPR improvement) and ACE2 enzymatic activities (26.9% APUR improvement). To compared with REDIAL-2020 [20] , we used the same experimental settings and the performance evaluation metrics, including accuracy, sensitivity, precision, F1 (the harmonic mean between sensitivity and precision) and AUC. We found that ImageMol outperformed REDIAL-2020 as well (Supplementary Table 8 ).

In summary, these comprehensive evaluations reveal high accuracy of ImageMol in identifying anti-SARS-CoV-2 molecules across diverse viral targets and phenotypic assays. Furthermore, ImageMol is more capable on datasets with extreme imbalance of positive and negative samples compared to traditional deep learning pre-trained models [23] or machine learning approaches [20] .

We next turned to identify potential anti-SARS-CoV-2 inhibitors using 3CL protease as a prototypical example as it has been shown a promising target for therapeutic development in treating of COVID-19 [37, 38] . We focused on 2,501 U.S. FDA-approved drugs from DrugBank [39] to identify ImageMol-predicted 3CL protease inhibitors as repurposable drugs for COVID-19 using a drug repurposing strategy [36] .

Via molecular image representation of 3CL protease inhibitor vs. noninhibitor dataset under the ImageMol framework, we found that 3CL inhibitors and non-inhibitors are well separated in a t-distributed Stochastic Neighbor Embedding (t-SNE) plot (Fig. 3c) . Molecules with activity concentration 50% (AC50) less than 10 uM were defined as inhibitors, otherwise they were noninhibitors. We showed the probability of each drug in DrugBank being inferred as a 3CL protease inhibitor (Supplementary Table 9 ) and visualized their overall probability distribution (Supplementary Figure 8) . We found that 12 of the top 20 drugs (60%) have been validated (including cell assay, clinical trial, etc.) as potential SARS-CoV-2 inhibitors (Supplementary Table 9 ), among which 3 drugs are further verified as potential 3CL protease inhibitors by biological experiments (Fig. 3d) . To test the generalization ability of ImageMol, we used 10 experimentally reported 3CL protease inhibitors as an external validation set (Supplementary Table 10 ). ImageMol identified 6 out of 10 known 3CL protease inhibitors (60% success rate, Fig. 3e) , suggesting a high generalization ability in anti-SARS-CoV-2 drug discovery.

We further used the HEY293 assay to predict anti-SARS-CoV-2 repurposable drugs. We collected experimental evidence for top 20 drugs as potential SARS-CoV-2 inhibitors (Supplementary Table 11 ). We found that 13 out of 20 drugs (65%) have been validated by different experimental assays as potential inhibitors for the treatment of SARS-CoV-2 (such as in vitro cellular assays and clinical trials) in Supplementary Table 11 .

Meanwhile, 122 drugs have been identified to block SARS-CoV-2 infection [40] . From these drugs, we selected a total of 70 small molecules overlapped in DrugBank to evaluate performance of the KEY293 model. We found that ImageMol successfully predicted 47 out of 70 (67.1% success rate, Table 12) , suggesting a high generalizability of ImageMol for inferring potential candidate drugs in the HEY293 assay as well.

We next turned to use t-SNE to visualize molecular representations from different models to test the biological interpretation of ImageMol. We used the clusters identified by the multi-granularity chemical clusters classification (MG3C) task (cf. Methods) to split the molecular structures. We randomly selected 10% clusters obtained from MG3C and sampled 1,000 molecules for each cluster. We performed three comparisons for each molecule: a) MACCS fingerprints with 166-dimensional (166D) features, b) ImageMol without pretrained models with 512D features, and c) ImageMol pre-trained 512D features. We found that ImageMol distinguish molecular structures very well Figure 9b) . ImageMol can capture priori knowledge of 13 chemical information from the molecular image representations, including =O bond, -OH bond, -NH3 bond and benzene ring (Fig. 4a) . We further used the Davies Bouldin (DB) index [34] to quantitatively evaluate the clustering results and the smaller DB index represents the better performance. We found that ImageMol (DB index=1.98) was better than MACCS fingerprint (DB index=2.13); furthermore, pre-trained models can significantly improve the molecular representation as well (DB index=18.48).

Gradient-weighted Class Activation Mapping (Grad-CAM) [41] is a commonly used convolutional neural network (CNN) visualization method [42, 43] . Figures 4b and 4c illustrate 12 example molecules of the Grad-CAM visualization of ImageMol (cf. Supplementary Figures 10 and 11) . ImageMol accurately captures attention to the global (Fig. 4b ) and the local (Fig. 4c) structural information simultaneously. In addition, we counted the proportion of blank areas in the images to the entire molecular image across all 13 SARS- Table 13 ). We found an average sparsity (sparsity refers to the proportion of blank areas in an image) of 94.9% across the entire dataset, suggesting that ImageMol models are easily inclined to use blank areas of the image for meaningless inferences [31] . Figure 4d shows that ImageMol primarily pays attention to the middle area of the image during predictions. Thus, ImageMol indeed predicts based on the molecular structures rather than uses meaningless blank areas. We further calculated the coarse-grained and fine-grained hit rates (Supplementary Figure 12) . 14 The coarse-grained hit rate illustrates that ImageMol can utilize molecular structures of all images for inference, with a ratio of 100%, compared to the QSAR-CNN models [31] with 90.7%. The fine-grained hit rate shows that ImageMol can leverage almost all structural information in molecular images to inference, with a ratio of over 99%, reflecting its ability to capture global information of molecules.

In summary, ImageMol captures the biologically relevant chemical information of molecular images with both local-and global-levels of structural information, outperforming existing state-of-the-art deep learning approaches (Fig. 4 ).

The robustness of the model to hyperparameter tuning is important because the initialization of different parameters can affect the performance of the model [44] . Here, we explore the impact of pre-training strategies on the hyperparameter tuning of ImageMol. As shown in Supplementary Tables 4-5, ImageMol is more robust than ImageMol_NonPretrained, with an average performance variance of 1.2% versus 2.4%. Therefore, pre-training strategies improve the robustness of ImageMol to initialization parameters.

To explore the impact of pre-training with different data scales, we first use 0 million (no pre-training), 0.2 million, 0.6 million, 1 million, and 8.5 million drug-like compounds to pretrain ImageMol respectively and then evaluate 15 their performance. We found that the average ROC-AUC performance of 0 million (75.7%), 0.2 million (76.9%), 0.6 million (81.6%), 1 million (83.8%) and 8.5 million (85.9%) increased from 1.2% to 10.2% as the pre-trained data size increases. Thus, ImageMol can be further improved as the more drug-like molecules cancer be pre-trained. We further investigated the impact of different pretext tasks using multi-granularity chemical clusters classification (MG3C), jigsaw puzzle prediction (JPP), and MASK-based contrastive learning (MCL) (cf. Methods), respectively. We found that each pretext task improves the mean AUC value of ImageMol from 0.7% to 4.9%: without pretext task (75.7%), JPP (78.8%), MG3C (80.6%]) and MCL (76.4%) (Supplementary Figure 14) . The best performance was achieved by assembling all 3 pretext tasks for pre-training (AUC = 85.9%, Supplementary   Figure 14 ). In summary, each task integrated implemented the ImageMol framework synergistically improve performance and models can be improved further by hyperparameter tuning and pre-training from a bigger drug-like chemical datasets in the future.

We presented a self-supervised image processing-based pre-training deep learning framework that combines molecular images and unsupervised learning to learn molecular representations. We demonstrated the high accuracy of ImageMol across multiple benchmark biomedical datasets with a 16 variety of drug discovery tasks ( Figs. 2 and 3) . In particular, we identified candidate anti-SARS-CoV-2 agents, which were validated by ongoing clinical and experimental data across 13 biological anti-SARS-CoV-2 assays. If broadly applied, our pre-training deep learning framework will offer a powerful tool for rapid drug discovery and development for various emerging diseases, including COVID-19 pandemic and future pandemics as well.

We highlighted several improvements of ImageMol compared to other state-of-the-art methods. First, ImageMol achieved high performance across diverse tasks of drug discovery, including drug-like property assessment (brain permeability, drug's metabolism and toxicity) and molecular target prediction across diverse targets, such as Alzheimer's disease (i.e., BACE) and emerging infectious diseases caused by HIV and SARS-CoV-2 virus.

Furthermore, ImageMol outperforms state-of-the-art methods, including traditional deep learning and machine learning models (Fig. 2a-2c) . Second, we showed that our self-supervised image-based representation approach outperformed traditional fingerprint-based and graph-based representation methods as well (Fig. 2d-2f) . Finally, ImageMol has better interpretability and is more intuitive in identifying biologically relevant chemical structures or substructures for molecular properties and target binding (Figs. 4a-4c) . Via ablation analysis, we showed that pre-training process using 8.5 million druglike molecules significantly improved the model performance compared to models without pre-training. Thus, integrating additional chemical knowledge (such as atomic properties and 3D structural information) to each image or pixel area may further improve the performance of ImageMol. We found that five pre-training tasks are well compatible and jointly improve model performance.

We acknowledged several limitations in current study. Although we mitigated the effects of different representations of molecular images through data augmentation, perturbed views (i.e., rotation and scaling) of the input images may still affect the prediction results of ImageMol. We did not optimize for the sparsity of molecular images, which may affect the latent features extracted by the model. It is challenging to explicitly define the chemical properties of atoms and bonds compared to graph-based methods [23, 34] , which will inevitably lead to insufficient chemical information. Several potential directions may improve our ImageMol further: (1) integration of larger-scale biomedical data and larger-capacity models (such as ViT [45] ) in molecular images will inevitably be the focus of future work; (2) multi-view learning of joint images and other representations (e.g. SMILES and graph) is an important research direction; (3) introducing more chemical knowledge (including atomic properties, 3D information, etc.) to each image or pixel area is also a point worth studying as well. In summary, ImageMol is an active selfsupervised image processing-based strategy that offers a powerful toolbox for computational drug discovery in a variety of human diseases, including COVID-19.

Pre-training aims to make the model learn how to extract expressive representations by training on large-scale unlabeled datasets and then apply the well pre-trained model to related downstream tasks and fine-tune to improve their performance. Defining several effective and task related pretext tasks is required for pre-training the model. In this paper, the core of our pretraining strategy is the visual representation of molecules by considering three principles: consistency, relevance, and rationality. These principles lead ImageMol to capture meaningful chemical knowledge and structural information from molecular images. Especially, the consistency means that the semantic information of the same chemical structure in different images is consistent, such as -OH, =O, benzene. The relevance means that different augmentations of the same image (such as mask, shuffle) are related in the feature space. For example, the distribution of the image after the mask should be close to the original image. The rationality means that the molecular structure must conform to chemical common sense. The model needs to recognize the rationality of the molecule in order to promote the understanding of the molecular structure. Unlike graph-based and smilesbased pre-training methods (they either only consider consistency or only correlation), ImageMol is the first molecular image-based pre-training 19 framework and considers multiple principles comprehensively by defined five effective pretext tasks.

Considering that the semantic information of the same chemical structure in different images is consistent, the Multi-Granularity Chemical Clusters Classification (MG3C) task is proposed (Supplementary Figure 1) , which discovers semantic consistency by predicting the chemical structure of the molecule. Briefly, multi-granularity clustering is first used to assign multiple clusters of different granularities to each chemical structural fingerprint. Then, each cluster is assigned as a pseudo-label to the corresponding molecule and each molecule has multiple pseudo-labels with different granularities; Finally, molecular encoder is employed to extract the latent features of the molecular images and a structural classifier is used to classify the pseudo-labels.

Especially, we employed the MACCS keys as the descriptor of molecular fingerprints, which is a 166-length sequence composed of 0 and 1. These molecular fingerprint sequences can be used as a basis for clustering, and the closer the distance between molecular fingerprints, the more likely it is to be clustered into a cluster. Finally, we use the K-means [46] with different 

Recently, the performance gap between the unsupervised pre-training and supervised leaning in computer vision has narrowed, notably owing to the achievements of contrastive learning methods [25, 26] . However, these methods typically rely on a large number of explicit pairwise feature comparisons, which is computationally challenging [48] . Furthermore, in order to maximize the feature extraction 23 ability of the pre-training model, contrastive learning must select good feature pairs, which obviously increases the huge cost in computing resources.

Therefore, to save computing resources and mine the fine-grained information in the molecule images, we introduce a simple contrastive learning method in molecular images, namely MASK-based contrastive learning (Supplementary   Figure 4) . We first use a 16 × 16 square area to randomly mask the molecular images (Supplementary Figure 16) , denoted by G ! . Then, the masked molecular images G ! and the unmasked molecular images ! are simultaneously input into molecular encoder to extract latent features + ( G ! ),

. Finally, the cost function ℒ (*4 was introduced to ensure the consistency between the latent feature extracted by the molecular image before and after the mask, which was formalized as:

Where ‖ + ( G ! ), + ( ! )‖ " means to calculate the Euclidean distance between + ( G ! ) and + ( ! ).

Inspired by human understanding of the world, we proposed the rationality principle, which means that the structural information described by molecular images must conform to chemical common sense. We rearranged the original images to construct irrational molecular images and designed two pre-training Finally, these disordered images are viewed as irrational samples T ! .

Subsequently, the original ordered image ! and the shuffled image T ! are forward propagated to molecular encoder to extract latent features + ( ! ) and + ( T ! ), and these features are further input into a rationality classifier to obtain the probability value (31 + ( ! ) whether the sample is reasonable. Here, we 25 define the cost function of molecular rationality discrimination task ℒ (31 to update ResNet18, which is formalized as:

Where the first term and the second term represent the binary classification loss of the rational image and the irrational image respectively. (31 represents the parameters of the rationality classifier. ! (31 represents the real label, which consists of 0 (irrational) and 1 (rational).

Compared with MRD, JPP provides a more fine-grained prediction to discover the invariance and regularity of molecular images (Supplementary Figure 3) , which is widely used in computer vision [49] . Solving a jigsaw puzzle on the same molecular images can help the model pay attention to the more global structural information and learn the concepts of spatial rationality to improve the generalization of the pre-training model. In this task, by using the maximal Hamming distance algorithm in [50] , we assign an index (ranging from 0 to 99) to each permutation of patch numbers, which will be used as the classification label Where the first term and the second term represent the classification loss of the original ordered image and the shuffled image respectively. 678

represents the parameters of the jigsaw classifier.

In pre-training, we used two large-scale datasets (ZINC and ChEMBL) for 

Here, we used the ResNet18 as our molecular encoder.

After using data augmentations to obtain molecular images ! , we forward these molecular images ! to the ResNet18 model to extract latent features + ( ! ). Then, these latent features are used by five pretext tasks to calculate the total cost function ℒ :44 , which is defined as: 

After completing the pre-training, we fine-tune the pre-trained ResNet18 in the downstream task. Clearly, the performance of the model can be further improved by establishing a complex fine-tuning task for the pre-trained model.

However, fine-tuning is not the research focus of this paper, so we only use a simple and common fine-tuning method to adapt the model to different downstream tasks. In detail, we only add an additional full connection layer . Especially, since the data in the downstream task has the problem of category imbalance, we also added the category weight in the cross-entropy loss, which is formalized as: [32] and Mol2Vec [33] ), the graph-based pre-training methods (Jure's GNN [23] and N-GRAM [34] ) and molecular image-based method (Chemception [29] ). These recently proposed methods show competitive results and superior performance on molecular property prediction task.

Therefore, we selected these representative methods for comparison. In the sequence-based pre-training methods, ChemBERTa is based on RoBERTa [52] with 12 attention heads, 6 layers, and pre-trained by 77M unique SMILES sequences from PubChem [53] ; the SMILES Transformer builds an encoderdecoder network with 4 transformer [54] blocks, which is pretrained with 861,000 unlabeled SMILES sequences randomly sampled from ChEMBL24 [27] ; the RNNS2S is designed based on sequence-to-sequence learning with GRU [55] cell and attention mechanism, which is pretrained by using 334,092 valid molecular SMILES sequences from LogP and PM2-full datasets.

Mol2Vec learns vector representations of molecular substructures that point in similar directions for chemically related substructures by pre-training on 19.9 million compounds. In the graph-based pre-training method, Jure's GNN are node-level cutting edge self-supervised pre-training methods, which first transform the 2M SMILES sequences sampled from the ZINC15 database [28] into a graph structure, and use different pre-training strategies to train the Graph Isomorphism Networks (GINs) [56] . The N-GRAM method introduces N-gram graph and learns a compact representation for each graph in pre- Experimental setting. In order to compare our ImageMol with Jure's GNN, we reproduced Jure's GNN by using the public source code they provided to extract molecular features and added a fully connected layer for fine-tuning on downstream tasks. We uniformly split these datasets into 80% training set and 20% test set, and report the AUC and AUPR results on test set. We also compared our method with REDIAL-2020. To compare fairly with REDIAL-35 2020, we use the same experimental configuration as REDIAL-2020. See [20] for detailed experimental setting. Note that REDIAL-2020 provides a new data preprocessing method and divides the training set, validation set and test set, so we directly use these divided datasets to perform our evaluation process (Supplementary Table 13 

The datasets used in this project can be found at the following links: bloodbrain barrier penetration (BBBP): http://deepchem.io.s3-website-us-west- Performance evaluation of ImageMol using the benchmark datasets. The performance was evaluated in a variety of drug discovery tasks, including molecular properties (i.e., drug's metabolism, toxicity, brain penetration) and molecular target pro les (i.e., human immunode ciency virus (HIV) and beta-secretase (BACE)). The x-axis and y-axis represent False Positive Rate (FPR) and True Positive Rate (TPR) in a-c, respectively. (a) Receiver operating characteristic (ROC) curves of ImageMol across ve datasets (bloodbrain barrier penetration (BBBP), molecular Toxicity using the 21st century (Tox21), clinical trial toxicity (ClinTox), HIV and BACE) with strati ed split and scaffold split. (b) ROC curves of Chemception [29] and ImageMol on Tox21 and HIV datasets with the same experimental setup as Chemception, which is a classical convolutional neural network (CNN) for predicting molecular images. (c) ROC curves of ADMETCNN [30] , QSAR-CNN [31] and ImageMol on ve CYP isoforms validation sets (PubChem Data Set 

This is a list of supplementary les associated with this preprint. Click to download.

SupplementaryInformationFeixiong.Cheng.pdf

Automating drug discovery

Challenges and recent progress in drug discovery for tropical diseases

The $2.6 billion pill-methodologic and policy considerations

The failure to fail smartly

The latest on drug failure and approval rates

Artificial intelligence in COVID-19 drug repurposing. The Lancet Digital Health

Towards the online computer-aided design of catalytic pockets

Computer-aided synthesis of dapsone-phytochemical conjugates against dapsone-resistant Mycobacterium leprae

Context encoders: Feature learning by inpainting

Deep learning for tomographic image reconstruction

Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism

Molecular image-based convolutional neural network for the prediction of ADMET properties. Chemometrics and Intelligent Laboratory Systems

MoleculeNet: a benchmark for molecular machine learning

Compound-protein interaction prediction with end-to-end learning of neural networks for graphs and sequences

Predicting drug-protein interaction using quasi-visual question answering system

Graph neural representation learning for compound-protein interaction

Learn molecular representations from large-scale unlabeled molecules for drug discovery

DeepConv-DTI: Prediction of drugtarget interactions via deep learning with convolution on protein sequences

Pharmacophore-based models for therapeutic drugs against phosphorylated tau in Alzheimer's disease. Drug Discovery Today

A machine learning platform to estimate anti-SARS-CoV-2 activities

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

Ueda, SMILES transformer: pretrained molecular fingerprint for low data drug discovery

Strategies for pre-training graph neural networks

Self-Supervised Graph Transformer on Large-Scale Molecular Data. 2020. 33

A simple framework for contrastive learning of visual representations

Momentum contrast for unsupervised visual representation learning

The ChEMBL database in 2017

ZINC 15-ligand discovery for everyone

Chemception: a deep neural network with minimal chemistry knowledge matches the performance of expert-developed QSAR/QSPR models

Molecular image-based convolutional neural network for the prediction of ADMET properties. Chemometrics and Intelligent Laboratory Systems

Molecular image-convolutional neural network (CNN) assisted QSAR models for predicting contaminant reactivity toward OH radicals: Transfer learning, data augmentation and model interpretation

Seq2seq fingerprint: An unsupervised deep molecular embedding for drug discovery

Mol2vec: unsupervised machine learning approach with chemical intuition

N-gram graph: Simple unsupervised representation for graphs, with applications to molecules

Classification of cytochrome P450 inhibitors and noninhibitors using combined classifiers

Deep learning for drug repurposing: Methods, databases, and applications. Wiley Interdisciplinary Reviews: Computational Molecular Science

Identification of SARS-CoV-2 3CL protease inhibitors by a quantitative high-throughput screening

Preclinical characterization of an intravenous coronavirus 3CL protease inhibitor for the potential treatment of COVID19

DrugBank 5.0: a major update to the DrugBank database for 2018

Pyrimidine inhibitors synergize with nucleoside analogues to block SARS-CoV-2

Grad-cam: Visual explanations from deep networks via gradient-based localization

Automated detection of COVID-19 cases using deep neural networks with X-ray images

Jcs: An explainable covid-19 diagnosis system by joint classification and segmentation

On the importance of initialization and momentum in deep learning

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations

Billion-scale similarity search with GPUs

Deep residual learning for image recognition

Unsupervised learning of visual features by contrasting cluster assignments

Domain generalization by solving jigsaw puzzles

Unsupervised learning of visual representations by solving jigsaw puzzles

MoleculeNet: a benchmark for molecular machine learning

A robustly optimized bert pretraining approach

PubChem 2019 update: improved access to chemical data

Attention is all you need. Advances in neural information processing systems

On the properties of neural machine translation: Encoder-42 decoder approaches

How powerful are graph neural networks

Densely connected convolutional networks

drug's metabolism, toxicity, brain penetration) and molecular target profiles (i.e., human immunodeficiency virus (HIV) and beta-secretase (BACE)). The x-axis and y-axis represent False Positive Rate (FPR) and True Positive Rate (TPR) in a-c, respectively. (a) Receiver operating characteristic (ROC) curves of ImageMol across five datasets (blood-brain barrier penetration (BBBP), molecular Toxicity using the 21st century (Tox21), clinical trial toxicity (ClinTox), HIV and BACE) with stratified split and scaffold split. (b) ROC curves of Chemception

QSAR-CNN [31] and ImageMol on five CYP isoforms validation sets (PubChem Data Set II). ADMET-CNN and QSAR-CNN are the latest molecular image-based drug discovery models. (d) The ROC-AUC (%) performance of SMILESbased methods (SMILES Transformer [22], Recurrent Neural Network-based Sequence-to-Sequence (RNNS2S) [32]) and ImageMol on four datasets (BBBP, Tox21, HIV and BACE) with stratified split. (e) The ROC-AUC (%) performance of SMILES-based