key: cord-0446696-2aqh9sqf authors: Mukherjee, Shrimon; Ghosh, Madhusudan; Basuchowdhuri, Partha title: Deep Graph Convolutional Network and LSTM based approach for predicting drug-target binding affinity date: 2022-01-18 journal: nan DOI: nan sha: e2875945543a5833552e5236b4448257b272ca74 doc_id: 446696 cord_uid: 2aqh9sqf Development of new drugs is an expensive and time-consuming process. Due to the world-wide SARS-CoV-2 outbreak, it is essential that new drugs for SARS-CoV-2 are developed as soon as possible. Drug repurposing techniques can reduce the time span needed to develop new drugs by probing the list of existing FDA-approved drugs and their properties to reuse them for combating the new disease. We propose a novel architecture DeepGLSTM, which is a Graph Convolutional network and LSTM based method that predicts binding affinity values between the FDA-approved drugs and the viral proteins of SARS-CoV-2. Our proposed model has been trained on Davis, KIBA (Kinase Inhibitor Bioactivity), DTC (Drug Target Commons), Metz, ToxCast and STITCH datasets. We use our novel architecture to predict a Combined Score (calculated using Davis and KIBA score) of 2,304 FDA-approved drugs against 5 viral proteins. On the basis of the Combined Score, we prepare a list of the top-18 drugs with the highest binding affinity for 5 viral proteins present in SARS-CoV-2. Subsequently, this list may be used for the creation of new useful drugs. The discovery of drugs, by the traditional approach, is timeconsuming and expensive. Generally, it costs billions of US dollars and takes about 10-15 years for a drug to be accepted or rejected by FDA (US Food and Drug Administration) [1, 2] . The traditional approach of drug discovery goes under several phases of trials [1] . To reduce time and cost, we use already approved drugs to identify their use in combating SARS-CoV-2. This technique of discovering new applications of drugs is known as drug repurposing or repositioning [3] . A high throughput screening experiment examines bioactivities between drugs and proteins in this traditional approach, which makes it a costly and time-consuming process * Indian Association for the Cultivation of Science. † Equal Contribution to this work [4, 5] . This is an infeasible task as there are millions of drug-like compounds [6] and hundreds of potential targets. As an example, 500 protein kinases [7] are responsible for modification of about 30% of human proteins [8] . Protein plays an important role in drug-target interaction prediction. Proteins are the most important components for most of the functions within and outside human cells. The threedimensional structure of a protein and its spatial orientation determines the role of a protein. Therefore, structural changes in a protein can significantly alter its functionality [52] . In drug discovery, many drugs are designed to bind specific proteins. Any structural change of drugs may thus alter their properties to bind with the proteins. As a result, an important problem in computational drug discovery is to predict whether a drug can bind with a specific protein or not. This problem is popularly known as drug-target interaction prediction, and has become a topic of significant research interest in recent years. We use drug-target interaction prediction in order to reuse FDA-approved drugs on new target proteins in the repurposing or repostioning [3] technique of drug discovery. In DTI (drug-target interaction) [9] , the goal is to predict the binding affinity between the FDA-approved drugs and the new target proteins. Machine learning methods may be used for predicting the drug-target binding affinity. Drug-target interaction is measured in terms of binding affinity [9] . Greater interaction between the drug-target pairs indicate a stronger readout for binding affinity. It is evaluated in terms of inhibition constant (Ki), dissociation constant (K d ), changes in free energy measures (δG, δH), the half-maximal inhibitory constant (IC50) [10] , half-maximal activity concentration (AC50) [47] , KIBA score [20] and SCORES (used in STITCH database) [46] . In [20] , KIBA scores were calculated to optimize the consistency between Ki, K d and IC50 by making use of the available statistical information. Here, we introduce, C b , a Combined Score (by combining both pK d and KIBA scores, as mentioned in Eqn. 3.13) to predict binding affinity between FDA-approved drugs and new targets for SARS-CoV-2. Many statistical and machine learning models [11, 12, 13] are used in the repurposing or repositioning approach of the discovery of drugs, i.e., to predict DTI (drug-target interaction) of FDA-approved drugs on new targets. The ma-These features are supplied into a supervised learning method named gradient boosting regression trees, derived from the gradient boosting machine learning model. In the KronRLS [15, 16] approach, kernels from drugs and targets are built from their molecular descriptors, fed into a regularized least squares regression model (RLS) to predict the binding affinity. In recent years, DTI predictions have shifted from machine learning models to deep learning models [9] . Deep learning models predict better than the above-mentioned machine learning models. Some of the deep learning models proposed earlier for drug repurposing, are DeepDTA [9] , WideDTA [17] , MT-DTI [28] , DeepCPI [26] , GANsDTA [24] , Attention-DTA [25] , 1-D CNN [31] , DeepGS [27] and GraphDTA [18] . In DeepDTA [9] , drugs (provided in SMILES notation) and targets are represented as sequence of characters. In Wid-eDTA [17] , drugs and proteins are represented as words instead of characters as in DeepDTA [9] . Similarly, other deep learning based models are DeepGS [27] , DeepCPI [26] , MT-DTI [28] , Attention-DTA [25] and 1-D CNN [31] , which are also used for drug-target interaction (DTI) prediction. Earlier, Graph Neural Network (GNN) based works [27] [18] did not capture proper topological information of the required graph (i.e., generated from the SMILES notation of any drug) by providing simple adjacency representation as input to the GNN module. However, [27] [18] used 1-D CNN to encode protein sequence information. It is wellknown from existing literature that LSTM performs better than 1-D CNN to capture sequential information [39] and therefore, has been used in our model. Our proposed model DeepGLSTM introduces a novel Graph Convolution Network (GCN) block to pass the drug compound information using power graph representation to capture graph topological information to achieve state-ofthe-art (SOTA) results in neural drug repurposing domain. Our work makes the following contributions: • DeepGLSTM introduces a Graph Convolution Network (GCN) block to process the drug compounds using power graph representation and a bidirectional-LSTM (Bi-LSTM) layer to process the protein sequence. The experimental results show that our model outperforms the previous state-of-the-art results for the affinity prediction task. • We train our model on multiple benchmark datasets, Davis [19] , KIBA [20] , DTC [44] , Metz [43] , Tox-Cast [45] and STITCH [46] . We also apply our proposed model to predict the affinity score (C b Score, Eqn. 3.13) between FDA-approved drugs and the 5 viral proteins [32, 33, 34, 35, 36] of SARS-CoV-2. Drug-target affinity predictions are presently performed in two ways. The first one is simply by using traditional machine learning based methods [11, 12, 13] . The second one consists of adoption of deep learning based methods [9] . Nowadays, prediction of drug-target binding affinity has shifted from traditional machine learning models to deep learning models and most of the recent works are based on neural networks, i.e., on deep learning based approaches. [26] used an end-to-end deep learning approach for predicting drug-target affinity. Similarly, [9] introduced CNN based approach to predict affinity between drugs and targets, and this trend continued in [17] and [31] . Similarly, [28] gives a self-attention based approach for predicting affinity. [24] proposed a generative adversarial network (GAN) to learn useful patterns within labeled and unlabeled sequences and takes advantage of convolutional regression to predict affinity. In this method, two partial GANs, one for the feature extraction from the raw protein sequences, and another for the SMILES strings, were used with a CNN based regression network, for the purpose of affinity prediction. But [24] did not discuss about how GAN would perform when trained on small datasets. Again, [25] used CNN architecture followed by attention based mechanisms to determine the significance of the SMILES sequences of the drugs for interactions with proteins, thereby using it for affinity score prediction. Nevertheless, all of these methods convert drug compounds into corresponding string representations, which are not ideal to represent drug molecules. Using such strings may lead to exclusion of the topological information of the drug molecules. As a result, it decreases the performance of the above-discussed method and its power of predicting drug-target affinity. [50] used both LSTM and CNN for predicting drug-target binding affinity score. [51] introduced a similarity-based method using CNN to perform the required task. [48] used DTC as well as Metz dataset to perform the required task. [47] used ToxCast dataset to predict drugtarget binding affinity score. In contrast to the aforementioned models, [18] used topological information of drug molecules using graph neural networks. In this method, chemical structures of drugs (provided in SMILES notation) are represented as graphs, where the nodes are the atoms, the edges represent interactions between atoms, and the proteins are represented as strings made of characters. Specifically, graph convolutional networks (GCN), graph attention networks (GAT) and graph isomorphism networks (GIN) were used to predict drug-target affinity. Similarly, [27] We compare our results with KronRLS [15, 16] , Sim-Boost [14] , DeepDTA [9] , MT-DTI [28] , DeepCPI [26] , WideDTA [17] , GANsDTA [24] , AttentionDTA [25] , 1D-CNN [31] , DeepGS [27] and GraphDTA [18] . In this section, we present our novel architecture DeepGLSTM to predict the required affinity score for drug-target binding. The architecture of our proposed model is shown in Fig. 1 . Deep-GLSTM has two functional modules. The first one helps to capture the topological information from the drug molecules. On the other hand, the second one captures the sequential information from the targeted protein structures. In most of the previous works, drugs and proteins have been encoded using one-hot representation. Later, GraphDTA [18] , DeepGS [27] used graph representation to embed the drug compounds and one-hot representation for protein structures. Experimental results show that providing graph representation as an input improves the model performance and produces better results. We follow this technique to embed the drug compound topological information and feed it as an input. Similarly, for protein sequences, one could also represent them as graphs. But, doing so is more difficult because the tertiary structure of a protein is not always available in a reliable form. So, for protein sequences we use the popular one-hot encoding representation [18] . We represent each drug compound by its corresponding SMILES (Simplified Molecular Input Line Entry System) [29] notation. This notation always considers the compounds as a graph of the interactions between atoms (i.e., the nodes). To describe a node feature, we use a set of atomic features adapted from DeepChem [30] . Here, each node is represented as a multi-dimensional binary feature vector. The feature vector of every node demonstrates five pieces of information (i.e., the atom symbol, the number of adjacent atoms, the number of adjacent hydrogens, the implicit valence of the atom and whether the atom is in an aromatic structure) [30] . Then we use the open-source chemical informatics software RDKit [22] to convert the SMILES data to its corresponding molecular graph representation (i.e., adjacency representation A ∈ R N ×N ), which helps us to extract the required atomic features. The protein sequence is a string of ASCII characters, which represents amino acids. We initialize the protein sequence embedding layer by mapping every amino acid to an English letter and subsequently, by marking the amino acid with the sequential index of that letter in the English alphabet. We pass the embedding representation (c ∈ R dp , where dp is the dimension of the protein sequence embeddings) to a Bi-LSTM layer, which captures the dependencies between the characters in a sequence (p = [c1, c2...cn]) of length n. We get the output representation ht ∈ R 2d l , where d l represents the number of output units used in each LSTM cell. We compute Eqn. 3.1 for execution of the LSTM. To predict the affinity score of drug-protein interaction, understanding the interaction of each node with its neighboring nodes is very important. Generally, adjacency representation of any graph holds the connectedness relationship of any node with the other ones, which are connected to that node by an edge. So, to represent the graph level features for every drug compound, incorporation of multihop connectedness relationship between the nodes present in a graph may turn out to be important. For this reason, we introduce the idea of using GCN in our architecture. It is a well-known fact that a randomly initialized GCN is strong enough to produce important feature representations, which can capture the connectedness relationship between the graph nodes. We introduce three blocks consisting of GCNs. In the first block, we stack three GCN layers. In the second block, we stack two GCN layers and in the final block we use only one GCN layer. For every drug compound, we compute a simple propagation rule as mentioned in [23] by taking the adjacency representation (A ∈ R N ×N ) generated from the RDKit tool and a node feature matrix X ∈ R N ×C (where C is the number of features per node) as inputs in the first block. To overcome the degree normalization problem of the adjacency representation, we compute the normalized adjacency representation (Anorm) as in Eqn. 3.2, where D ∈ R N ×N is the degree matrix representation of A. The Eqn. 3.3 makes the first block workable, where W is the trainable weight, H 0 e = X is the i th layer output representation and σ is a non-linear activation function. Thus, i th layer of the GCN module produces a global representation (H i e ∈ R N ×M ) for A. In a graph, every node v is connected to every node u that belongs to its neighborhood Γ(v ) by an edge. Every node w in Γ(u) that are not present in Γ(v ) has a shortestpath distance of 2 from v. In other words, they are two hops away from v. If v is connected to all such nodes that are two hops away from it, we get a graph that can be represented by using A 2 . Such a graph can be referred to as a power graph of power 2. Similarly, to further increase the local reachability of v, the value of the exponent of the power graph may be increased. Note that, increasing the exponent of the power graph would typically lead to much denser graphs due to its increased reachability. In the GCN second block, we take squared representation (A 2 ∈ R N ×N ) of A ∈ R N ×N and the same node feature representation X as input. By similar argument, we compute the normalized adjacency representation A 2 norm as in Eqn. 3.4, where D is the degree representation of A 2 . Finally, GCN layer of the third block takes cubic representation (A 3 ∈ R N ×N ) of A ∈ R N ×N and the same feature X as input. By similar argument, we compute the normalized adjacency representation A 3 norm as in Eqn. 3.6, where D is the degree representation of A 3 . Similarly, Eqn. 3.7 makes the third GCN block workable and the parameters are same as discussed above. This block produces the required global representation (H i e ∈ R N ×M ) for A 3 . We concatenate the output representations of the three blocks as shown in Eqn. 3.8 to get the final graph level representation for every drug compound. Subsequently, we pass He into a global max-pool layer followed by some fully connected layers to get the representation, which makes the computation fast and efficient. Then we concatenate the output representations of the Bi-LSTM and the fully connected layers. Finally, we feed the concatenated representation into some fully connected layers followed by the output layer to predict the affinity score. We consider the following models as baselines -traditional machine learning based methods Kro-nRLS [15, 16] , SimBoost [14] , deep learning based models DeepDTA [9] , WideDTA [17] , GANsDTA [24] , Atten-tionDTA [25] , DeepCPI [26] , MT-DTI [28] , GraphDTA [18] and DeepGS [27] . Here, we have used Davis [19] , KIBA [20] , DTC [44] , Metz [43] , ToxCast [45] and STITCH * [46] datasets for our experiments as summarized in Table 1 . In our experiments, we transformed the K d value of the Davis dataset into log space pK d , similar to [14] , by using Eqn. 3.9. We also divide SCORES metric for STITCH dataset by 100 to make the training process simpler. In our experiments, we fixed the length of the protein sequence (n) to 1000 for all the datasets. The dimension (2di) of the hidden vector representation of the Bi-LSTM layer is of 128. We also fixed the number of input node features (C) of the required drug compounds to 78. We set the final hidden vector representation (M ) of the first GCN block to 312. Similarly, we also set the output representations (M , M ) for the rest of the GCN blocks to 156, 78 respectively. Throughout our experiment, we have considered a dropout probability rate (p) of 0.2 and we used ADAM [49] optimizer with learning rate 0.0005. Since this is a regression problem, we use the mean squared error (MSE) as the loss function to evaluate the training of our model. For all the experiments, we train our model to 1000 epochs. We use batch size (B) of 512 for KIBA, DTC, Metz, ToxCast and STITCH and batch size of 128 for Davis dataset. We followed the parameter settings of GraphDTA [18] to carry out baseline experiments for the mentioned datasets (cf. section 3.3). Reported results of the baselines models in Table 2 were taken from the previous literature works. Note that the baseline results of Table 3 were obtained by carrying out the required experiments using the source code given in the corresponding work [18] . We used Google Colab PRO plus to carry out the experiments. and r 2 m index to evaluate the performance of our model. We perform the required computation for MSE (where Pi is the predicted binding affinity value and Yi is the original binding affinity value), CI and r 2 m index using the equations Eqns. 3.10, 3.11 (where Z is a normalization constant that equals the number of drug-target pairs with different binding affinity values) and 3.12, respectively. CI [9] measures whether the rank of the predicted binding affinity score and ground-truth scores are same or * Open-source code is available at https://github.com/ MLlab4CS/DeepGLSTM.git Davis not. More specifically, when δi > δj, a positive score is given iff the predicted score (bi) for the higher affinity δi > the predicted score (bj) for the lower affinity δj. Here, h(x) is a step function. The metric r 2 m [27] is used to evaluate the external prediction performance of QSAR (Quantitative Structure-Activity Relationship) models. Model's prediction is considered iff r 2 m ≥ 0.5. In Eqn. 3.12, r 2 and r 2 0 represent Table 3 : Performance comparison between DeepGLSTM and GraphDTA on DTC (pK i ), Metz (pK i ), Toxcast (pK i ) and Stitch (SCORES). the squared correlation coefficient values between the ground-truth and the predicted values with and without intercept, respectively. The metric, C b , Combined Score (as mentioned in Eqn. 3.13) is used to obtain the top-18 (cf. supplementary) FDA-approved drugs for 5 viral proteins of SARS-CoV-2. Since, Davis and KIBA datasets are widely used in literature, we use both of these datasets to implement C b . Fig. 2 shows the predicted value (p) versus measured value (m) plots on all the datasets. If the predicted value (p) closely resembles the measured value (m), a model may be considered to be good and therefore, the output values should be close to the red line (p = m) as shown in the figures. From Fig. 2 , we can see that our model performs well on Davis, KIBA, DTC and Metz. For ToxCast and STITCH datasets, our model shows performance compared to other state-of-the-art methods. Therefore, we can conclude that the performance of our model is uniformly good over most of the datasets that are available in drug repurposing domain. We compare performance of our model with the performance of the baseline models mentioned in section 3.2. The comparative evaluation results are presented in Table 2 and Table 3 . Our model outperforms all the baseline models in terms of CI on Davis, KIBA, DTC, METZ and ToxCast datasets, in terms of MSE on KIBA, DTC, METZ, ToxCast and STITCH datasets and also in terms of r 2 m on KIBA, DTC and STITCH datasets. Additionally, our model produces relatively comparable results in terms of MSE on Davis dataset and in terms of r 2 m on Davis, Metz and ToxCast datasets. To conclude in a more consolidated way, we use our model to predict drug-target affinity in terms of C b (cf. Eqn. 3.13) for 2,304 (n ) FDA-approved drugs against 5 SARS-CoV-2 viral [37, 38, 40, 41, 42] proteins. As a result, this model yields the required C b scores for 11,520 drug-target interactions. According to the definition, higher C b value indicates higher binding affinity between a drug and its corresponding target and vice-versa [14] . So, we first sort the drugs in descending order of their C b values predicted by our model. Then we take the top three drugs that bind with receptors and top three drugs that bind with non-receptors. We do this for five highly researched SARS-CoV-2 viral proteins (Glycoprotein S aka Spike protein, RNA-dependent RNA Polymerase aka RdRp, Helicase, Papain-like Proteinase and Open Reading Frame 3a aka ORF3a) and combine them into a set consisting of 18 drugs. The prediction statistics for the selected SARS-CoV-2 viral proteins, in terms of all the binding affinity measuring metrics used, have been provided in the supplementary. Among the drugs that bind with receptors, the ones with the highest C b values were, Isoproterenol, Medrysone, Lindane, Mecamylamine, Oxybenzone, Morphine, Estriol, L-Menthol, Ergocalciferol and Methoxyphenamine. Among the drugs that bind with nonreceptors, the ones with the highest C b values were, Lactulose, Sorbitol, Migalastat, Guaiacol, Cianidanol, Mannitol, Arbutin and Benzoic. Therefore, from the list of drugs with highest binding affinity scores for the viral proteins, researchers may select drugs to conduct further experiments to understand their usage as prospective drugs to treat SARS-CoV-2 patients. As a future work, the receptor based drugs with high binding affinity may be tested for their interactions with important SARS-CoV-2 receptors. Similarly, the non-receptor based drugs with high binding affinity may be tested for their interactions with important enzymes. We carried out ablation study on Davis dataset [19] for our DeepGLSTM model. Initially, we consider GraphDTA [18] architecture as our baseline model. GraphDTA uses a block similar to the first GCN block of DeepGLSTM for drug compounds and 1-D CNN for protein sequence information. But it is a well-known fact that LSTM performs better than 1-D CNN to capture the sequential representation [39] . We just replace the 1-D CNN with Bi-LSTM keeping the remaining parts of the GraphDTA architecture same. The performance evaluation of this experiment is reported in Table 4 . It shows that Bi-LSTM is much more efficient than 1-D CNN. Additionally, to show the effectiveness of different components of our DeepGLSTM model, we carried out ablation study experiments by incrementally introducing the model components. The result of these ablation study experiments are reported in Table 5 , which clearly shows the Table 6 : Effectiveness of using the power graph (using Davis Dataset (pK d )) contribution of each component in our model architecture. The results in Table 5 suggest that the third block of GCN with Bi-LSTM contributes the most in the model performance. In most cases, the graph structure of a drug does not consist of many shortest paths that exceed a distance of 3. Therefore, we can conclude that increasing the exponent of a power graph beyond 3 may not lead to any significant change in reachability for most of the nodes, thereby, not contributing significantly towards the effectiveness of the model. The results of Table 5 also indicate that an increase in the exponent of the power graph input, used in the GCN blocks, beyond a certain value may not contribute towards increasing the effectiveness of the model. To show the effectiveness of power graph representation in our model, we carried out three additional experiments with DeepGLSTM. Firstly, we input the adjacency matrix A of the drug molecules into the first GCN block of Deep-GLSTM and the remaining two GCN blocks are removed. Secondly, we pass only the squared representation of A (i.e., A 2 ) as input into the second GCN block of DeepGLSTM and the remaining two GCN blocks are removed. Finally, we take the cubic representation of A (i.e., A 3 ) as input into the third GCN block of DeepGLSTM and the first two GCN blocks are removed. Thus, using the aforementioned settings we train three different models and note their corresponding MSEs. The results of these ablation study experiments have been reported in Table 6 . The results show that the second block GCN with input A 2 and the third block GCN with input A 3 contribute a lot to achieve comparatively low MSE scores. In this way, we introduce DeepGLSTM (i.e., combination of the above three GCN blocks with three different types of adjacency input representations) to achieve substantially low MSE scores. In this paper, we present a novel architecture for the task of predicting binding affinity values between the FDAapproved drugs and viral proteins. Our model uses three blocks of GCN for learning the topological information from the drug molecules. Our Bi-LSTM learns the representation of the protein sequences. We carried out our experiments on the Davis, KIBA, DTC, Metz, ToxCast and STITCH datasets. We also use our model for predicting binding affinity values between 2,304 FDA-approved drugs and 5 viral proteins for SARS-CoV-2. Our model produces stateof-the-art results on the Davis, KIBA, DTC, Metz, ToxCast and STITCH datasets. We have also reported the drugs that show significantly high binding affinity with some of the most studied viral proteins of SARS-CoV-2. We thank Siddhartha S Jana for his helpful comments. Drug repositioning: identifying and developing new uses for existing drugs Pharmacogenetics in drug discovery and development: a translational perspective Overcoming drug development bottlenecks with repurposing: old drugs learn new tricks. Nature Medicine Protein kinases-the major drug targets of the twenty-first century? Protein kinase inhibitors: insights into drug design from structure Frequent substructure-based approaches for classifying chemical compounds The protein kinase complement of the human genome Others Maximizing diversity from a kinase screen: identification of novel and selective pan-Trk inhibitors for chronic pain DeepDTA: deep drug-target binding affinity prediction A deep learning-based method for drugtarget interaction prediction based on long short-term memory neural network A machine learning-based method to improve docking scoring functions and its application to drug repurposing Drug discovery in the age of systems biology: the rise of computational approaches for data integration & Others The Drug Repurposing Hub: a next-generation drug library and information resource SimBoost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines Computational-experimental approach to drug-target interaction mapping: a case study on kinase inhibitors Learning with multiple pairwise kernels for drug bioactivity prediction WideDTA: prediction of drug-target binding affinity Predicting drug-target binding affinity with graph neural networks Comprehensive analysis of kinase inhibitor selectivity Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis A deep learning-based framework for drug-target interaction prediction Semi-Supervised Classification with Graph Convolutional Networks GANsDTA: predicting drug-target binding affinity using GANs At-tentionDTA: prediction of drug-target binding affinity using attention model Compound-protein interaction prediction with end-to-end learning of neural networks for graphs and sequences Deep representation learning of graphs and sequences for drug-target binding affinity prediction Self-attention based molecule representation for predicting drugtarget interaction SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more Deep Learning-Based Potential Ligand Prediction Framework for COVID-19 with Drug-Target Interaction Model. Cognitive Computation & Others Addendum: A pneumonia outbreak associated with a new coronavirus of probable bat origin Protein structure and sequence reanalysis of 2019-nCoV genome refutes snakes as its intermediate host and the unique similarity between its spike protein insertions and HIV-1 Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation & Others Structure of the RNA-dependent RNA polymerase from COVID-19 virus Remdesivir and SARS-CoV-2: Structural requirements at both nsp12 RdRp and nsp14 Exonuclease active-sites Others Structure of M pro from SARS-CoV-2 and discovery of its inhibitors Comparative study of CNN and RNN for natural language processing Coronavirus main proteinase (3CLpro) structure: basis for design of anti-SARS drugs Others Structure of papain-like protease from SARS-CoV-2 and its complexes with non-covalent inhibitors & Others Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its immune cross-reactivity with SARS-CoV Navigating the kinome Others Drug target commons: a community effort to build a consensus knowledge base for drug-target interactions Accessed STITCH: interaction networks of chemicals and proteins A deep learning-based framework for drug-target interaction prediction Kipred: a predictive model for estimating ligand-kinase inhibitor constant (pKi) A method for stochastic optimization DeepCDA: deep cross-domain compound-protein affinity prediction through LSTM and convolutional neural networks Prediction of drug-target binding affinity using similarity-based convolutional neural network The shape and structure of proteins