key: cord-0326897-hf5c9ffc
authors: Verma, Niraj; Qu, Xingming; Trozzi, Francesco; Elsaied, Mohamed; Karki, Nischal; Tao, Yunwen; Zoltowski, Brian; Larson, Eric C.; Kraka, Elfi
title: SSnet: A Deep Learning Approach for Protein-Ligand Interaction Prediction
date: 2020-09-20
journal: bioRxiv
DOI: 10.1101/2019.12.20.884841
sha: a6310d27a05de6b90763205b2ab808cdb7ebfba4
doc_id: 326897
cord_uid: hf5c9ffc

Computational prediction of Protein-Ligand Interaction (PLI) is an important step in the modern drug discovery pipeline as it mitigates the cost, time, and resources required to screen novel therapeutics. Deep Neural Networks (DNN) have recently shown excellent performance in PLI prediction. However, the performance is highly dependent on protein and ligand features utilized for the DNN model. Moreover, in current models, the deciphering of how protein features determine the underlying principles that govern PLI is not trivial. In this work, we developed a DNN framework named SSnet that utilizes secondary structure information of proteins extracted as the curvature and torsion of the protein backbone to predict PLI. We demonstrate the performance of SSnet by comparing against a variety of currently popular machine and non-machine learning models using various metrics. We visualize the intermediate layers of SSnet to show a potential latent space for proteins, in particular to extract structural elements in a protein that the model finds influential for ligand binding, which is one of the key features of SSnet. We observed in our study that SSnet learns information about locations in a protein where a ligand can bind including binding sites, allosteric sites and cryptic sites, regardless of the conformation used. We further observed that SSnet is not biased to any specific molecular interaction and extracts the protein fold information critical for PLI prediction. Our work forms an important gateway to the general exploration of secondary structure based deep learning, which is not just confined to protein-ligand interactions, and as such will have a large impact on protein research.

els, the deciphering of how protein features determine the underlying principles that govern PLI is not trivial. In this work, we developed a DNN framework named SSnet that utilizes secondary structure information of proteins extracted as the curvature and torsion of the protein backbone to predict PLI. We demonstrate the performance of SSnet by comparing against a variety of currently popular machine and non-machine learning models using various metrics. We visualize the intermediate layers of SSnet to Introduction Diverse biological processes are dictated by ligand induced conformational changes in target proteins. Modern medicine has harnessed the ability to control protein structure and function through the introduction of small molecules as therapeutic interventions to diseases. Despite the importance of Protein-Ligand Interactions (PLI) in medicine and biology and keen insight into the multitude of factors governing ligand recognition, including hydrogen bonding, 1,2 π-interactions 3 , and hydrophobicity 4 , the development of robust predictive PLI models and validation in drug discovery remains challenging.

Reliance on experimental methods to identify and confirm PLIs is time-consuming and expensive. In contrast, computational methods can save time and resources by filtering large compound libraries to identify smaller subsets of ligands that are likely to bind to the protein of interest. In this manner, reliable PLI predictive algorithms can significantly accelerate the discovery of new treatments, eliminate toxic drug candidates, and efficiently guide medicinal chemistry. 5 Currently, Virtual Screening (VS) is commonly used in academia and industry as a predictive method of determining PLI. Broadly, VS can be divided into two major categories: Ligand Based Virtual Screening (LBVS) and Structure Based Virtual Screening (SBVS). 6 LBVS applies sets of known ligands to a target of interest and is therefore limited in its ability to find novel chemotypes. In contrast, SBVS uses the 3D structure of a given target to screen libraries, thereby improving its utility in identifying novel therapeutics. 7 Over the last few decades, many classical techniques such as force field, empirical, and knowledge based 5 PLI predictions have been developed, with limited success. Often these methods show low performance and in some cases even discrepancies when compared with experimental bioactivities. 8 Even successful methods are often limited by a requirement of high resolution protein structures with detailed information about the binding pocket. 9 Machine learning (ML) and Deep learning (DL) approaches have recently received attention in this field. Various reviews summarize the application of ML/DL in drug design and discovery [10] [11] [12] [13] [14] . Machine learning based PLI prediction has been developed from a chemogenomic perspective 15 Euclidean distances over common features derived by mapping ligands and proteins; Wallach et al. 18 used a 3D grid for proteins along with 3D convolutional networks; Tsubaki et al. 19 used a combination of convolutional network for proteins and graph network for ligands; Li et al. 20 used Bayesian additive regression trees to predict PLI; and lastly Lee et al. 21 applied deep learning with convolution neural networks on protein sequences. While these methods provide novel insights for PLI, they do not provide a solid framework for direct application in drug discovery.

End-to-end learning, a powerful ML/DL technique, has gained interest in recent years since once the model is trained the users are only required to provide standard protein and ligand descriptors as input. 22 The end-to-end learning technique involves i) embedding inputs to lower dimensions, ii) formulating various neural networks depending on the data available, and iii) using backpropagation over the whole architecture to minimize loss and update weights. An example of an end-to-end learning model that has achieved high level of accuracy in PLIs prediction is GNN-CNN. 19 Currently, protein structure-based ML/DL methods for PLI predictions achieve low accuracy as they suffer from three major limitations: i) absence of high resolution structures of protein-ligand pairs for training, ii) the 3D grid for the target can form a huge and sparse matrix, which hinder ML/DL models to learn and predict PLI, and iii) techniques are sensitive to the method employed to represent ligand structure, diverse methods have been reported [23] [24] [25] [26] and selection of the optimal ligand representation can be challenging.

Strategies to overcome these limitations have largely focused on developing new methodologies to reduce target and compound structure space to 1D representations, thereby providing a dense framework for ML/DL to operate on a small number of input features. Reduction of 3D protein structure information into 1D allows the machine to efficiently learn 3D space features of the protein secondary structure, which are required for ligand interaction. This information, when combined with the way a convolution network considers the input space, makes the model unbiased towards the protein conformation, thereby not being limited by the existence of high-resolution protein-ligand structures adequate for PLI prediction. This feature can solve a major drawback for most virtual screening methods since only a limited portion of the proteins' conformational space can be crystalized.

Herein, we outline a new machine learning based algorithm termed SSnet for the prediction of PLIs. SSnet uses a 1D representation based on the curvature and torsion of the protein backbone. Mathematically, curvature and torsion are sufficient to reasonably represent the 3D structure of a protein 27 and therefore contain compact information about its function and fold. Further, curvature and torsion are sensitive to slight changes in the secondary structure, which are a consequence of all atom interactions, including side-chains.

These characteristics position SSnet to outperform existing methods of PLI prediction to rapidly curate large molecular libraries to identify a subset of likely high-affinity interactors.

As outlined below, corresponding analyses are carried out to show the robustness and versatility of our new model. We analyzed the model using the Grad-CAM to visualize heatmaps of the activation from the neural network that maximally excite the input features. 28 The input features can then be used to highlight on the protein 3D structure the residues that maximally influenced the predicted score.

In the methods and rationale section, we demonstrate how the secondary structure of proteins is used in the ML/DL framework. We discuss the representation of ligands following the introduction of SSnet model, possible evaluation criteria, its merits and demerits. We discuss the datasets used in this work to validate and train SSnet. In the results and discussion section, we first select the ligand descriptor to be used in SSnet. We validate SSnet trained on two models named: SSnet:DUD-E, a model trained on DUD-E, 29 and SSnet:BDB, a model trained on BindingDB 30 for general application. The applicability as a VS tool is demonstrated by using enrichment factor. We further show the applicability of SSnet for a virtual screening task through its high AUCROC and EF score while maintaining a lack of conformational bias allowing it to find ligands for cryptic proteins and visualize important residues considered by SSnet. In the conclusion, we outline a way to incorporate SSnet into the drug-design workflow as well as provide a future perspective. A protein can be represented by the α carbons (CA atoms) of the backbone as it defines a unique and recognizable structure, especially for protein categorization 41 . In fact, a significant amount of information about the protein is embedded in the secondary structure elements such as helices, β sheets, hairpins, coils, and turns etc. Therefore, utilization of these parameters should retain adequate information to train a ML/DL approach. Figure 1 : a) The tangent vector t, normal vector n and the binormal vector b of a Frenet frame at points P 1 and P 2 respectively for a curve r(s). b) Representation of protein backbone in terms of scalar curvature κ and torsion τ respectively. The ideal helix, turn and non-ideal helix is shown in orange, cyan and magenta respectively. The curvature and torsion pattern captures the secondary structure of the protein.

The secondary structure information can be retrieved by a smooth curve generated by a cubic spline fit of the CA atoms. Figure 1a shows the arc length s, scalar curvature κ, and scalar torsion τ which define the 3D curve r(s). The scalar curvature κ is expressed as a function of arc length s κ(s) = |r (s)| (1) and the scalar torsion

where | · | is the norm and · is the vector triple product. A protein can then be represented by considering the curvature and torsion at the anchor points (locations of CA) forming a 1D vector with twice the the number of amino acids. 27, 39 Figure 1b shows the decomposition of a protein found in Conus villepinii (PDB ID -6EFE) into scalar curvature κ and torsion τ , respectively. The residues 5 through 10 show a near ideal α helix type secondary structure, which is represented as an oscillation of τ with smooth κ. Similarly, the turn (residues 15 to 17) and a non-ideal α helix (residues 20 to 25) are captured in the decomposition plot via unique patterns. Because the curvature and torsion information of the secondary structure of proteins are encoded as patterns, ML techniques may be powerful tools to predict PLI through efficiently learned representations of these patterns. More specifically, we hypothesize that, using convolution, varying sized filters may be excellent pattern matching methods for discerning structure from these decomposition plots. More analysis on protein representation is provided in the subsection SSnet model. Each branch further goes through more convolution with same window size (red, orange, green and light blue boxes). A global max pooling layer is implemented to get the protein vector. The ligand vector is directly fed to the network. Each double array line implies a fully connected dense layer. The number inside a box represents the dimension of the corresponding vector. In the case of GNN, the ligand vector is replaced by a graph neural network as implemented by Tsubaki et al. 19 SSnet model Figure 2 shows the SSnet model developed in this work. Here, we provide a general overview of the network, more details about its specific design operation are given in the later part of this section. As denoted in the left upper branch of Figure 2 , after conversion into the Frenet-Serret frame and the calculation of curvature κ and torsion τ , κ and τ data (i.e decomposition data) is fed into the neural network. We denote this input as a 2D matrix (1D vector with curvature and torsion reshaped to contain curvature in one row and torsion in the other), X (0) , where each column represents a unique residue and the rows corresponding the curvature and torsion. The first layer is a branch convolution with varying window sizes. That is, each branch is a convolution with a filter of differing length. We perform this operation so that patterns of varying lengths in the decomposition plot can be recognized by the neural network. Each branch is then fed to more convolutions of same window size.

This allows the network to recognize more intricate patterns in X (0) that might be more difficult to recognize with a single convolution. The output of these convolutional branches are concatenated, pooled over the length of the sequence, and fed to a fully connected dense layer. The rightmost upper branch of Figure 2 shows a ligand vector which is generated and fed to a fully connected dense layer. The output of this layer is typically referred to as an embedding. Intuitively, this embedding is a reduced dimensionality representation of the protein and ligand. The outputs of the protein embedding and the ligand embedding are then concatenated and fed to further dense layers to predict the PLI.

The convolutional network in this research uses filter functions over the protein vector X (0) . To define the convolution operation more intuitively, we define a reshaping operation as follows:

where the flattening operation reshapes the row of X (0) from indices i to i+K to be a column vector c

i . This process is also referred to as vectorization. The size of the filter will then be of length K. We define the convolution operation as:

where f is a function known as the rectified linear unit (ReLU), W (0) conv is the weight matrix and b (0) conv is the bias vector. This operation fills in the columns of the output of the convolution, X (1) row=i,∀col (also called the activation or feature map). Each row of W (0) conv is considered as a different filter and each row of X (1) is the convolutional output of each of these filters.

These convolutions can be repeated such that the n th activation is computed as:

We in our SSnet model use four different branches with filter sizes of κ = 5, 10, 15 and 30. we apply a maximum operation along the rows of X (N ) κ . This is typically referred to as a Global Max Pooling layer in neural networks and is repeated R times for each row in X (N ) κ :

where d κ is a length R column vector regardless of the number of columns in the latent space X (N ) κ . This maximum operation, while important, has the effect of eliminating much of the information in the latent space. To better understand the latent space, we can further process X (N ) κ to understand how samples are distributed. For example, a simple operation would be to define another column vector v that denotes the total variation in each row of the latent space:

The concatenation of vectors d and v help elucidate how the samples are distributed in the latent space. As such, we can use these concatenated vectors as inputs to a fully connected dense layer which can learn to interpret the latent space. This output is referred to as the embedding of the protein, y prot , and is computed as

where W prot is the learned weight matrix and b prot is the bias vector of a fully connected network.

The method described above is similar to a technique recently used in speech verification systems, where the window sizes need to be dynamic because the length of audio snippet is unknown. 43, 44 In speech systems, the latent space is collapsed via mean and standard deviation operations and the embeddings provided for these operations are typically referred to as D-Vectors 43 or X-Vectors 44 . In proteins we have a similar problem as the length of the decomposition sequence to consider the active site(s) of protein is dynamic and of unknown sizes. By including the window sizes of 5, 10, 15 and 20 (number of residues to consider at a time), we ensure that the network is able to extract different sized patterns from backbones of varying length.

After embedding the protein and the ligand, we concatenate the vectors together and feed them into the final neural network branch, resulting in a prediction of binding,ŷ, which is expected to be closer to "0" for ligands and proteins that do not bind and closer to "1" for proteins and ligands that do bind. This final branch consists of two layers:

where σ refers to a sigmoid function that maps the output to [0, 1]. If we denote the ground truth binding as y, which is either 0 or 1, and denote all the parameters inside the network as W then the loss function for the SSnet model can be defined as binary cross entropy, which is computed as:

where M is the number of samples in the dataset. By optimizing this loss function the neural network can learn to extract meaningful features from the protein and ligand features that 

A neural network generally exhibits a large number of weights to be optimized so that complex information can be learned, however, some of this information could be irrelevant to a prediction task. For example, consider the task of identifying if a certain image contains a horse or not. If all horse images also contain a date information on the image and images without horse do not contain date information, the machine will quickly learn to detect the date rather than the goal object (a horse in this case). Therefore, it is essential to verify what a neural network considers "influential" for classification after training. Selvaraju et al. 28 proposed a Gradient-weighted Class Activation (Grad-CAM) based method to generate a heatmap which shows important points in the feature data, based on a particular class of prediction. That is, this method uses activations inside the neural network to understand what portions of an image are most influential for a given classification. In the context of protein structures, this methods can help to elucidate which portions of the decomposition plot are most important for a given classification. These influential patterns in the decomposition plot can then be mapped to specific sub-structures in the protein.

Grad-CAM is computed by taking the gradient weight α k for all channels in a convolutional layer as

where k is the row in the final convolutional layer, Z is a normalization term, X (N ) is the activation of the final convolutional layer, andŷ is the final layer output. The heatmap S is then computed by the weighted sum of final layer activations:

This heatmap S specifies the important portions in the input sequence that are most responsible for a particular class activation. For each convolutional branch, we can apply this procedure to understand which portions of the input decomposition sequence are contributing the most, according to each filter size K = 5, 10, 15, 20. In this way, we can then map the most influential portions onto locations on the backbone of the protein. To the best of our knowledge, this procedure has never been applied to protein (or ligand) structures because

Grad-CAM has been rarely applied outside of image processing.

The evaluation criteria for PLI are generally presented by the area under the curve of the receiver operating characteristics (AUCROC), 45 Boltzmann-Enhanced Discrimination (BEDROC), and enrichment factor (EF) 46 

where N X% is the number of ligands in the top X% of the ranked ligands. EF thus considers an estimate on a random distribution for how many more actives can be found withing the early recognition threshold.

Most state-of-the-art models for PLI predictions use human and C. elegans created by Liu et al. 51 The positive PLIs for these datasets are considered from DrugBank 4.1 52 and Matador 53 .

The negative PLIs were considered by using ligands for proteins that are dissimilar to the target in query. The human dataset contains 852 unique proteins with at least one positive or negative PLI instance. 1052 unique compounds that bind to these target proteins (one-toone, one-to-many, and many-to-many) account for 3369 positive interactions. Similarly, C.

elegans dataset contains 1434 and 2504 unique proteins and compounds, respectively, for a total of 4000 positive interactions. Experimental setting suggested by Tabei and Yamanishi 54 was used such that the ratio of positive to negative interactions used for the training were 1:1, 1:3, and 1:5. A five fold cross validation was performed for evaluation.

Although humans and C. elegans dataset provides good benchmarking against other ML approach for PLI prediction, it does not contain enough PLI instances for use in real Figure S8 .

The decoys generated computationally faces the problem of false negatives and therefore an experimental dataset could be more reliable for SSnet. We considered the BindingDB (BDB) dataset 30 which is a public, web-accessible database of measured experimental binding affinities and contains around 1.3 million data records. We created a database by considering the following properties for each data entry 1. The target has PDB ID cross-referenced as 3D structure. The first annotated structure is taken as reference PDB file.

2. The ligand has SMILES representation in the entry.

3. Record has IC50 value (a measure of strength of binding) and is either less than x (active) or greater than y (inactive).

The values for x assessed were 10 nM, 25 nM, and 100 nM, while the values for y assessed were x nM or 10,000 nM. The preliminary analysis showed that x = 100 nM and y = 10,000 nM provides the best balance between AUCROC and EF 1 % for PLI prediction and as such, this dataset was termed as SSnet:BDB ( Figure S11 ). can also function as a guide to couple ML-based PLI prediction to downstream analysis using traditional docking-based approaches. Grad-CAM analysis reveals that SSnet can accurately identify regulatory binding sites within protein targets of interest. Importantly, the ability of SSnet to identify these sites is independent of protein conformation, and is able to identify cryptic and allosteric sites without prior information of their regulatory roles. In this manner we demonstrate that SSnet mitigates many of the limitations of alternative predictive PLI approaches, while retaining high accuracy and speed.

A key bottleneck in development of PLI prediction is the selection of the optimal representation of ligand structure. Convolutional neural networks (CNN) have to update a large number of weights and therefore require a large amount of data instances (number of unique PLIs). However, in the human and C. elegans dataset the number of instances are insufficient, causing SSnet to overfit ( Figure S3 ). To overcome this problem, we ignored the convolution layer and directly fed the proteins' curvature and torsion to the fully connected dense layer making it similar to ligand vector shown in Figure 2 . This helps in reducing the number of weights to be optimized and decreases the chance of overfitting. These approaches were unnecessary for DUD-E since, it contains sufficient instances of data for ML to learn. Figure S8 .

This procedure helps mitigate any bias that SSnet might have towards a subset of inactives. Table S6 . On DUD-E test set, SSnet:DUD-E performs the best with average AUCROC of 0.97, closely followed by GNN-CNN with 0.96 (Table S6) Figure S9 shows that SSnet:DUD-E performs equivalent or slightly better than random chance, which is equivalent to or better than the other methods shown. Thus, further discussion of SSnet performance on MUV does not provide any valuable insight.

BDB is a database of PLIs with reported experimental values for PLI in terms of IC50, EC50, k i , or k d . As an entirely experimental PLI based database, BDB can be used to cross- Figure 5b shows the average AUCROC on the test set of BindingDB. The learning curve of loss over epochs is shown as Figure S7b . SSnet:BDB when tested on BDB test set has the highest average AUCROC of 0.91 which is, followed by 0.85 for GNN-CNN:BDB (Table S8 ). The scores were based on test set within BDB and thus an independent validation is required to comment on generalizability of the model. We used the DUD-E dataset for this purpose. We eliminated any potential bias from the protein similarity of the test set from DUD-E in the SSnet:BDB training dataset by removing all targets with sequence similarity greater than 75 %. SSnet:BDB tested on DUD-E has an average AUCROC of 0.81 while GNN-CNN:BDB has 0.79 (Table S6) . SSnet:BDB still performs poorly on MUV dataset equivalent to random chance ( Figure S9 ). As SSnet:BDB is based on experimental data, we compared our models with the traditional methods on DUD-E targets. Figure 6a shows the AUCROC of SSnet:BDB compared to the best AUCROC from smina, vina and edock, three freely available traditional virtual screening and docking methods. SSnet consistently has high AUCROC compared to the best scores of smina, vina, and edock. 49 Figure   6b shows that SSnet:BDB, when tested on the DUD-E dataset, has better performance than random chance (AUCROC greater than 0.5) 94 % of the time with a mean AUCROC 0.78 Furthermore, these targets being compared were part of the filtering described above. Table   3 shows AUCROC for various methods for 21 DUD-E targets. 50 We observe large variations across different methods for the targets tested on DUD-E dataset. No single method performed the best for all the targets. [67] [68] [69] SSnet gives an overall superior performance with 8 targets having the best AUCROC score. SSnet:BDB has the highest average AUCROC of 0.79 among the methods shown followed by both FRED and HYBRID 0.78. It is important to note that HYBRID requires and utilizes prior knowledge of the structure of a ligand bound to the target site, which strongly limits the applicability of this method. • Boosting consensus score (BCS) is a gradient based decision tree framework which is trained on binary labels (actives and decoys) on an individual decision tree for each target where the input is composed of docking scores obtained from each docking method. For each target, the docking scores were provided to other off-targets for boosting the model performance.

• Mean-variance consensus (MVC) is a parameterized function based on gaussian distribution of the scores.

• Mean, median (Med), maximum (Max), and minimum (Min) are the statistics obtained from normalized scores across the docking methods.

The consensus scoring does increase the performance for each target. However, SSnet outperforms or is equivalent to the best consensus based scoring methods for 7 targets in terms of AUCROC. We note that the resources and time required for consensus scoring is significant.

SSnet, therefore, serves as a balance between accuracy and resources/time required.

As described in the evaluation criteria section, AUCROC is not necessarily the best metric for PLI prediction evaluation. While high AUCROC demonstrates the ability of a model in discerning true positives from false positives, this is only useful when selecting drugs with scores greater than 0.5 for SSnet. This might not be feasible when selecting drugs from extremely huge datasets like ZINC which has ≈1.5 billion ligands. For screening such large databases, enrichment factor (EF) should be employed. In fact, as EF shows the likelihood of finding true actives from the top scoring subset of the database, it provides a more reliable metric for picking high scoring ligands. Furthermore, many studies have shown that ranking based on either metric alone is not sufficient indication of performance. [70] [71] [72] SSnet Enriches the Top Scoring Ligands with True Actives

As mentioned, EF provides a more stable metric for PLI prediction when filtering a large dataset to enrich limited subset of ligands with most likely actives. For this reason, in this section we evaluate and compare the performance of SSnet on the basis of EFs. Does SSnet perform reliably as a virtual screening tool, in terms of enrichment factor, compared to other high performing algorithms? Table S7 shows As an external dataset validation for DUD-E trained model, BDB test set was used. MUV represents a gold standard for independent validation of PLI prediction. However, for the reasons outlined above MUV is be an extremely challenging dataset. Similarly to the MUV data described in the section above, we observe all methods perform poorly on MUV with SSnet:BDB having mean EF 1% of 1.48 (Table S3-5) . Since testing SSnet:DUD-E on a DUD-E generated test set should be considered bad practice due to the bias outlined, we benchmarked SSnet:BDB. Figure 7a show the EF 1 % of SSnet:BDB compared with the best performance achieved via vina, smina or edock for each target derived from. 49 SSnet:BDB outperforms the best score of the traditional VS approaches in 74 % of the targets considered. Figure 7b show that for 90 % of the targets, SSnet achieved better outcome than random sampling of the ligands (EF score higher than 1). The average EF 1 % was 15 ± 11.

As described in Benchmarking SSnet section, SSnet:BDB was compared with various methods as shown in Table 5 for various methods taken from Ericksen et al. 50 . SSnet:BDB has the best EF 1% in 4 out of 21 targets. The mean EF 1% is 16 with a standard deviation of 10. SSnet:BDB mean score is similar to most of the methods shown. Only HYBRID and FRED were found to deliver superior performances. 9  0  30  32  0  0  30  7  0  32  ESR1  32  8  37  36  17  29  23  20  20  37  ESR2  21  9  40  40  12  22  20  12  19  40  ACE  14  12  18  20  24  3  3  9  15  24  GLCM  4  12  4  35  13  17  4  31  0  35  HDAC8  0  27  31  30  15  2  32  2  9  32  HIVINT  0  11  8  10  15  7  8  5  10  15  PDE5A  7  9  20  31  15  10  12  5  25  31  PTN1  12  15  21  19  21  26  21  15  5  26  ADA17  0  0  8  17  6  10  14  10  30  30  FA10  26  16  17  19  12  27  18  8  19  27  HIVPR  2  9  6  10  15  5  6  13  24  24  MMP13  12  6  18  30  15  3  4  11  29  30  TRY1  7  16  17  20  17  14  3  39  11  39   mean  9  10  18  24  11  11  12  11  16 30 std. dev. 9 7 11 11 7 9 9 9 10 8 29  29  23  22  15  21  6  18  CDK2  30  35  27  29  24  23  14  18  PLK1  35  8  8  5  4  7  1  35  SRC  30  7  9  7  6  6  5  30  FABP4  32  34  32  32  32  13  6  0  ESR1  37  38  37  34  34  32  16  20  ESR2  40  35  34  31  28  26  9  19  ACE  24  33  30  30  26  19  13  15  GLCM  35  37  30  30  28  19  19  0  HDAC8  32  45  46  46  41  20  35  9  HIVINT  15  19  21  17  11  13  10  10  PDE5A  31  32  28  26  26  19  13  25  PTN1  26  35  37  35  32  30  20  5  ADA17  30  19  17  16  17  11  4  30  FA10  27  26  30  33  30  23  19  19  HIVPR  24  19  16  18  18  9  15  24  MMP13  30  34  25  26  24  20  18  29  TRY1  39  33  31  28  27  24  18  11   mean  30  29  27  26  23  18  14  16  std. dev.  8  11  11  11  10  8  8  10 true actives and enriching the top scored ligands with true actives. This allows SSnet to be deployed quickly in any drug-discovery workflow.

Latent space for proteins

The validation of SSnet demonstrates its reliability. Furthermore, we see that SSnet learns beyond the biases that plague many ML approaches to PLI. Thus, understanding the under- Grad-CAM highlights the important residues for ligand recognition. In all cases the ligand forms several types of PLIs, some of which were analyzed as shown in Table 7 . Analyzing the highlighted structure from the Grad-CAM analysis, we observed that the SSnet considers a weighted probability density of the binding sites present in a protein ( Figure S10 ). It is important to note that the analyzed proteins include allosteric sites, an example of which is shown in Figure 8 . The protein Prolyl-tRNA Synthetase from Plasmodium falciparum is in complex with glyburide. Hewitt et al. 73 showed that glyburide binds to the allosteric site of Prolyl-tRNA Synthetase. SSnet is able to highlight the region of the protein where glyburide binds, which is not the known orthosteric binding site but an allosteric one. This information can be used by researchers to describe bounding box for any downstream docking application. 

The binding of a ligand perturbs the secondary structure and can cause significant differences from the original unbound protein structure. We investigated the ability of SSnet in predicting ligand binding based on an unbound protein structure. We divulged our focus on answering two key questions:

• Can SSnet predict the same results using an unbound protein structure or a different conformation of the same protein?

• Can SSnet detect cryptic sites based on unbound protein structures?

To address the first question, Table 8 shows the results of binding site prediction when different protein conformations of 9 randomly selected targets from the test set of DUD-E dataset. Each target was screened through 45,609 randomly selected ligands from DUD-E dataset. The first and second columns denote the PDB ID for a protein ligand complex presence of a ligand changes the secondary structure of the protein and therefore we observe a range of root-mean-squared-distance (RMSD) from 0.175 to 0.666 between a PLC-DC pair.

The prediction results for each PLC-DC pair in predicting actives and inactives are almost the same with maximum error of 0.03 %. To analyse further we looked into the probability scores obtained for each ligand. The Figure 9 shows the correlation of SSnet scores for gests that SSnet is able to predict similar results for a given protein regardless its specific conformation.

Some proteins have binding sites that are not easily detectable. These proteins, termed cryptic proteins, have binding sites that are present in a protein-ligand complex crystal structures but not necessarily in the apo protein crystal structures. 75 The change in conformation upon ligand binding is a dynamic phenomenon and has been widely reported in the literature. 76 Figure 10 : Heatmap generated from unbound protein for cryptic sites. The heatmap is a rainbow mapping with violet as the lowest and red as the highest value. The grey color shows bound protein with ligand. Figure 10 shows cryptic sites for 3 different proteins taken from CryptoSite set. 77 Figure   10 shows the bound (proteins with heatmap) and unbound (grey) proteins. The result from

Grad-CAM analysis of these proteins are highlighted with blue having the lowest and red having the highest influence on the PLI prediction. Furthermore, we notice that SSnet score was not as strongly influenced by the residues as is the case for the proteins shown in Figure   8 . This is visually observed by higher abundance of intermediate colors (yellow-green) in the cryptic sites ( Figure 10 , compared to mostly red or blue in Figure 8 . Figure 10a shows the unbound-bound pair of a cAMP-dependent protein kinase. The unbound structure of this protein (PDB-ID 2GFC) has an activation loop that protrudes into the active site, occluding the binding pocket. SSnet is able to predict that the ligand will bind strongly to this protein and highlights location closer to the actual binding site on the unbound structure, even though in the latter the binding site is occluded. Retrieving such information is of critical importance as these sites are practically impossible to detect using classical VS methods as they rely on the particular protein structure used for the calculation. Figure 10b shows the bound-unbound pair for Tyrosine kinase domain of hepatocyte growth factor receptor C-MET from Homo sapiens (PDB ID 3F82 and 1R1W, respectively).

As the ligand binds at flexible regions in the protein, we observe large conformation changes that result in the formation of a pocket for ligand binding. This example shows that the predicted binding is a consequence of each individual chain, considered independently.

Thus, the applicability of SSnet can be expanded for predicting PLIs that involve multiple protein chains.

Grad-CAM analysis, as well as conformation independence of PLI prediction, shows that SSnet learns crucial details of the input features required for predicting PLI. The torsion and curvature of the protein structure effectively describe the features required for PLI while remaining compact, as it is a vectorized format of the complex protein structure. The ability of SSnet to predict PLIs regardless of the crystal conformation showcases the versatile nature of the model. Grad-CAM analysis of the PLI prediction enhances the result of the screening as it can be quickly paired with pre-existing docking tools that require rigid bounding box for accurate posing and subsequent docking.

The study of PLI is an important field for progress in pharmaceutical industry and potentially extended to any biological applications. The limitations of the existing tools for predicting PLI, however, has stalled the progress in these fields. PLI computations suffer from large compute times as accurate PLI prediction require accounting of large number of physicochemical properties. Furthermore, biases arise for the ML based PLI prediction tools due to imbalances in the representation of these physicochemical properties in the training dataset. SSnet outperforms several notable ML algorithms in terms of AUCROC when trained and tested on humans and C. elegans dataset. Furthermore, SSnet outperforms the state-ofthe-art ML approach GNN-CNN in terms of AUCROC and EF1% when using both models trained on DUD-E and BDB datasets. This comparison holds true even when both the models were tested on independent test sets. Moreover, SSnet:BDB performs better if not equivalent to the classical methods in terms of both AUCROC and EF while being orders of magnitude faster than any of the traditional VS approaches.

The SSnet model utilizes secondary structure information of the protein and, since it only processes CA atoms, it does not necessarily require high resolution structural information.

The analysis done on bound-unbound proteins show that SSnet can predict similar results even with different conformations of the protein including cryptic sites. SSnet requires a single conformation to predict whether a ligand is active or not, even if the protein-ligand complex has different tertiary structure. Grad-CAM analysis not only addressed the validity of SSnet learning appropriate details from the input features but also provides the user an intuitive visualization of potential binding site for PLI.

SSnet can be coupled with traditional VS/docking algorithm as pre-screen to filter ligands. Moreover, Grad-CAM analysis showed that SSnet is able to provide accurate prediction of ligand binding sites: active, allosteric, and cryptic sites. As most of VS/docking algorithms necessitate prior knowledge of the binding site, this information can be used to trim ligand search space and determine the box placement. Such information is not retrievable by most of the other VS methods for PLIs prediction. These features of SSnet allows it to be seamlessly integrated into existing VS workflow where SSnet can be used to cull large databases to a small size, and determine bounding box for subsequent docking algorithms.

The top scoring docked poses can then be used directly in experimental setup or further analysis using techniques like Molecular Dynamics, to study the PLI.

Our study suggests that end-to-end learning models based on the secondary structure of proteins have great potential in bioinformatics which is not just confined to protein ligand prediction and can be extended to various biological studies such as protein-protein interaction, protein-DNA interaction, protein-RNA interactions etc. Inspired by the t-SNE results for the last layer in protein embedding we propose a possible latent space for proteins that encodes important information about the protein bioactivity and further exploration could result in a metric to compare proteins based on their bioactivity. We leave these explorations of both the SSnet model and the underlying latent space for future work.

For reproduciblity and validation, all scripts developed in this work is provided through GitHub (https://github.com/ekraka/SSnet).

Regulation of Protein-Ligand Binding Affinity by Hydrogen Bond Pairing

Cation−π Interactions in Protein-ligand Binding: Theory and Data-mining reveal Different Roles for Lysine and Arginine

Optimized Hydrophobic Interactions and Hydrogen Bonding at the Target-Ligand Interface Leads the Pathways of Drug-Designing

Docking and Scoring in Virtual Screening for Drug Discovery: Methods and Applications

Computational Methods in Drug Discovery

Structure-based Virtual Screening: An Overview. Drug Discovery Today

Software for Molecular Docking: A Review

FRED and HYBRID docking performance on standardized datasets

Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases

From Machine Learning to Deep Learning: Advances in Scoring Functions for Protein-ligand Docking

Machine Learning and Artificial Neural Network Accelerated Computational Discoveries in Materials Science

Making Machine Learning a Useful Tool in the Accelerated Discovery of Transition Metal Complexes

Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery

Protein-ligand Interaction Prediction: An Improved Chemogenomics Approach

Prediction of Drugtarget Interaction Networks from the Integration of Chemical and Genomic Spaces

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

Compound-protein Interaction Prediction with Endto-end Learning of Neural Networks for Graphs and Sequences

DeepConv-DTI: Prediction of Drug-target Interactions via Deep Learning with Convolution on Protein Sequences

Exploiting Machine Learning for End-to-end Drug Discovery and Development

Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Open-source Platform to Benchmark Fingerprints for Ligand-based Virtual Screening

Extended-Connectivity Fingerprints

The Graph Neural Network Model

Description and Recognition of Regular and Distorted Secondary Structures in Proteins using the Automated Protein Structure Analysis Method

Visual Explanations from Deep Networks via Gradient-based Localization

Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking

BindingDB in 2015: A public Database for Medicinal Chemistry, Computational Chemistry and Systems Pharmacology

A Structural Classification of Proteins Database for the Investigation of Sequences and Structures

CATH -A Hierarchic Classification of Protein Domain Structures

Identification of Homology in Protein Structure Classification

Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-bonded and Geometrical Features

Knowledge-based Protein Secondary Structure Assignment

Identification of Structural Motifs from Protein Coordinate Data: Secondary Structure and First-level Supersecondary Structure

Protein Secondary Structure Assignment Revisited: A Detailed Analysis of Different Assignment Methods

A Consensus View of Fold Space: Combining SCOP, CATH, and the Dali Domain Dictionary

Exploring the Mechanism of Catalysis with the Unified Reaction Valley Approach (URVA) -A Review

Stereochemical Criteria for Polypeptides and Proteins. V. Conformation of a System of three linked Peptide Units

RDKit: Open-source Cheminformatics

Deep Neural Networks for Small Footprint Text-dependent Speaker Verification

X-vectors: Robust dnn embeddings for speaker recognition

Virtual Screening Workflow Development Guided by the "Receiver Operating Characteristic" Curve Approach. Application to High-Throughput Docking on Metabotropic Glutamate Receptor Subtype 4

Predictiveness Curves in Virtual Screening

Improved Scoring of Ligand-Protein Interactions Using OWFEG Free Energy Grids

Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem

Evaluation of consensus scoring methods for AutoDock Vina, smina and idock

Machine Learning Consensus Scoring Improves Performance Across Targets in Structure-Based Virtual Screening

Improving Compound-protein Interaction Prediction by Building up Highly Credible Negative Samples

DrugBank: A Knowledgebase for Drugs, Drug Actions and Drug Targets

SuperTarget and Matador: Resources for Exploring Drug-target Relationships

Scalable Prediction of Compound-protein Interactions using Minwise Hashing

Protein-Ligand Scoring with Convolutional Neural Networks

Efficient Estimation of Word Representations in Vector Space

Supervised Prediction of Drug-target Interactions Using Bipartite Local Models

Gaussian Interaction Profile Kernels for Predicting Drug-target Interaction

Predicting Drug-target Interactions from Chemical and Genomic Kernels using Bayesian Matrix Factorization

AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading

Lessons Learned in Empirical Scoring with Smina from the CSAR 2011 Benchmarking Exercise

Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets

Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening

Comparison of Ligand-and Structure-Based Virtual Screening on the DUD Data Set

Molecular graph convolutions: moving beyond fingerprints

Comprehensive Comparison of Ligand-Based Virtual Screening Tools Against the DUD Data set Reveals Limitations of Current 3D Methods

Humblet, C. Comparison of Several Molecular Docking Programs: Pose Prediction and Virtual Screening Accuracy

Docking Screens: Right for the Right Reasons?

Recognizing Pitfalls in Virtual Screening: A Critical Review

Recommendations for evaluation of computational methods

What do we know and when do we know it?

Enyedy, I. The statistics of virtual screening and lead optimization

Biochemical and Structural Characterization of Selective Allosteric Inhibitors of the Plasmodium falciparum Drug Target, Prolyl-tRNA-synthetase

The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes

Expanding the Druggable Proteome by Characterization and Prediction of Cryptic Binding Sites

The Structural Basis of Allosteric Regulation in Proteins

Exploring the Structural Origins of Cryptic Sites on Proteins

The authors thank SMU for generous supercomputer resources. The authors would also like to thank Saeedi Mohammadi for the discussion of SSnet and providing helpful tips.

This work was financially supported by National Science Foundation Grants CHE 1464906.