EMBER: Multi-label prediction of kinase-substrate phosphorylation events through deep learning EMBER: Multi-label prediction of kinase-substrate phosphorylation events through deep learning Kathryn E. Kirchoff 1 and Shawn M. Gomez 2,3, 1 Department of Computer Science, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA 2 Department of Pharmacology, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA 3 Joint Department of Biomedical Engineering at the University of North Carolina at Chapel Hill and North Carolina State University, Chapel Hill, NC 27599, USA Abstract Kinase-catalyzed phosphorylation of proteins forms the back- bone of signal transduction within the cell, enabling the coor- dination of numerous processes such as the cell cycle, apop- tosis, and differentiation. While on the order of 105 phos- phorylation events have been described, we know the specific kinase performing these functions for less than 5% of cases. The ability to predict which kinases initiate specific individual phosphorylation events has the potential to greatly enhance the design of downstream experimental studies, while simultane- ously creating a preliminary map of the broader phosphoryla- tion network that controls cellular signaling. To this end, we de- scribe EMBER, a deep learning method that integrates kinase- phylogeny information and motif-dissimilarity information into a multi-label classification model for the prediction of kinase- motif phosphorylation events. Unlike previous deep learning methods that perform single-label classification, we restate the task of kinase-motif phosphorylation prediction as a multi-label problem, allowing us to train a single unified model rather than a separate model for each of the 134 kinase families. We utilize a Siamese network to generate novel vector representations, or an embedding, of motif sequences, and we compare our novel em- bedding to a previously proposed peptide embedding. Our mo- tif vector representations are used, along with one-hot encoded motif sequences, as input to a classification network while also leveraging kinase phylogenetic relationships into our model via a kinase phylogeny-weighted loss function. Results suggest that this approach holds significant promise for improving our map of phosphorylation relations that underlie kinome signaling. Availability: https://github.com/gomezlab/EMBER Correspondence: smgomez@unc.edu Introduction Phosphorylation is the most abundant post-translational mod- ification of protein structure, affecting from one to two-thirds of eukaryotic proteins. In humans, the number of kinases catalyzing this reaction hints at its importance, with kinases being one of the largest gene families with roughly 520 mem- bers distributed among 134 families (1–3). During phospho- rylation, a kinase facilitates the addition of a phosphate group at serine, threonine, tyrosine, or histidine residues; though other sites exist. Phosphorylation of a substrate at any of these residues occurs within the context of specific consen- sus phosphorylation sequences, which we refer to here as “motifs”. Additional substrate binding sequences within the kinase or substrate, as well as protein scaffolds that facili- tate structural orientation and downstream catalysis of the re- action, modify the efficacy of motif phosphorylation. Typi- cally, the net effect of kinase phosphorylation is to switch the downstream target into an “on” or “off” state, enabling the transmission of information throughout the cell. Kinase ac- tivity touches nearly all aspects of cellular behavior, and the alteration of kinase behavior underlies many diseases while simultaneously establishing the basis for therapeutic inter- ventions (4–11). Although the importance of phosphorylation in cell informa- tion processing and its dysregulation as a driver of disease is well-recognized, the map of kinase-motif phosphorylation in- teractions is mostly unknown. So, while upwards of 100,000 motifs are known to be phosphorylated, less than 5% of these have an associated kinase identified as the catalyzing agent (12). This knowledge gap provides a considerable impetus for the development of methods aimed at predicting kinase- motif phosphorylation events that, at a minimum, could help focus experimental efforts. As a result, a number of computational tools have been devel- oped, spanning a myriad of methodological approaches in- cluding random forests (13), support vector machines (14), logistic regression (15), and Bayesian decision theory (16). Advances in deep learning have similarly spawned new ap- proaches, with two methods recently described. The first, MusiteDeep, utilizes a convolutional neural network (CNN) with attention to generate single predictions (17). The sec- ond deep learning method, DeepPhos, exploits densely con- nected CNN (DC-CNN) blocks for its predictions (18). Both of these approaches train individual models for each kinase family, requiring a separate model for each of the 134 ki- nase families. In addition to the practical challenge of train- ing many individual models, a further disadvantage of these two deep learning approaches is the potential lost opportunity gains from transfer learning, as models trained independently do not directly incorporate knowledge of motif phosphoryla- tion by kinases from different kinase families. Here, we describe, EMBER (Embedding-based multi-label prediction of phosphorylation events), a deep learning ap- proach for predicting multi-label kinase-motif phosphoryla- tion relationships. In our approach, we utilize a Siamese neu- Kirchoff et al. | bioR‰iv | February 10, 2021 | 1–10 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ ral network, modified for our multi-label prediction task, to generate a high-dimensional embedding of motif vectors. We further utilize one-hot encoded motif sequences. These two representations are leveraged together as a dual input into our classifier, improving learning and prediction. We also find that our Siamese embedding generally outperforms a previ- ously proposed protein embedding, ProtVec, which is trained on significantly more data (19). We further integrate infor- mation regarding evolutionary relationships between kinases into our classification network loss function, informing pre- dictions in light of the sparsity associated with these data, and we find that this information improves prediction accuracy. As EMBER utilizes transfer learning across families, we ex- pect that model accuracy will improve more so than other deep learning approaches as more data describing kinase- substrate relationships are collected. Together, these results suggest that EMBER holds significant promise for improving our map of phosphorylation relationships that underlie the ki- nome and broader cellular signaling. Methods Kinase-motif interaction data. As documented kinase- motif interactions are sparse in relation to the total number of known phosphorylation events, we attempted to maximize the number of examples of such interactions for training. To do this, we integrated multiple datasets describing kinase- motif relationships across multiple vertebrate species. Our data was sourced from PhosphoSitePlus, PhosphoNetworks, and Phospho.ELM, all of which are collections of annotated and experimentally verified kinase-motif relationships (20– 22). From these data sources, non-redundant kinase-motif relationships were extracted and integrated into a single set of interactions. We used the standard single-letter amino acid code for representation of amino acids, with an additional ’X’ symbol to represent an ambiguous amino acid. We defined our motifs as peptides composed of a central phosphorylat- able amino acid — either serine (S), threonine (T), or tyrosine (Y) — flanked by seven amino acids on either side. There- fore, each motif is a 15-amino acid peptide or “15-mer”. As a phosphorylatable amino acid may not have seven flanking amino acids to either side if it is located near the end of a substrate sequence, we used ‘-’ to represent the absence of an amino acid in order to maintain a consistent motif length of 15 amino acids across all instances. Deep learning models are known to generally require large amounts of examples per class in order to achieve adequate performance. Our original dataset was considerably imbal- anced in that all positive labels (verified kinase-motif inter- actions) had a very low positive-to-negative label ratio. For example, the TLK kinase family only has nine positive la- bels (verified TLK-motif interactions) and more than 10,000 negative labels (lack of evidence for a TLK-motif interac- tion). To maximize our ability to learn from our data, we utilized only kinases that had a relatively large number of ex- perimentally validated motif interactions, reducing the num- ber of kinase-motif relationships to be used as input for our model. This filtering also served to considerably mitigate Table 1. Summary of our kinase-motif phosphorylation dataset. Shown are the number of kinases per family along with the number of motifs phosphorylated by each kinase family in the training and test sets. Family Kinases Training motifs Testing motifs Akt 3 382 63 CDK 21 752 116 CK2 2 775 97 MAPK 14 1275 187 PIKK 7 497 63 PKA 5 1235 231 PKC 10 1497 251 Src 11 869 99 the label imbalances in our data. From the 7531 remain- ing motifs, we set aside 853 motifs for the independent test set, leaving 6678 for the training set. Then we removed any sequences from the training set that met a 60% similarity threshold with any sequence in the test set, based on Ham- ming distance scores. This process removed 229 motifs from the training set. Kinase labels were then grouped into re- spective kinase families contingent on data collected from the RegPhos (1) database, resulting in eight kinase families. Our resulting data set is comprised of 7302 phosphorylatable mo- tifs and their reaction-associated kinase families (Table 1). Furthermore, our data are multi-label in that a single motif may be phosphorylated by multiple kinases, including those from other families, resulting in a data point with potentially multiple positive labels. Motif embeddings. ProtVec embedding. We chose to investigate two methods to achieve our motif embedding. First, we considered ProtVec, a learned embedding of amino acids, originally intended for protein function classification (19). ProtVec is the result of a Word2Vec algorithm trained on a corpus of 546,790 se- quences obtained from Swiss-Prot, which were broken up into 3 amino acid-long subsequences, or "3-grams". As a result of this approach, ProtVec provides a 100-dimensional distributed representation, analogous to a natural language "word embedding", that establishes coordinates for each pos- sible amino acid 3-gram. This results in a 9048 ◊ 100 matrix of coordinates, one 100-dimensional coordinate for each 3- gram. In a preliminary investigation, we found that averaging the ProtVec coordinates resulted in a higher-quality embed- ding compared to the original ProtVec coordinates. Compar- isons between the two embeddings are provided in Supple- mental Material. We averaged the embedding coordinates, per amino acid, in the following fashion: We define T = [AAA,ALA,LAA, ...,unknown], the vector of 9048 amino acid 3-grams provided by the authors of ProtVec. We also define A = [A,L,S, ...,-], the alphabet comprising the 22 amino acid symbols. We equate “-” to the “unknown” character defined by ProtVec. Then, we compute the matrix of averaged ProtVec coordinates, C(avg), which will be 22 ◊ 100 dimensions: 2 | bioR‰iv Kirchoff et al. | EMBER: kinase-substrate multi-label prediction .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ C(avg) = S WWWWW U c0,0 c0,1 c0,2 . . . c0,99 c1,0 c1,1 c1,2 . . . c1,99 c2,2 c2,1 c2,2 . . . c2,99 ... . . . ... c21,0 c21,1 c21,2 . . . c21,99 T XXXXX V (1) We solve for each element of C(avg) based on the values of C(raw), the original (9048 x 100) ProtVec matrix: c (avg) ij = 1 |Qi| ÿ kœQi c (raw) kj (2) where c(avg)ij belongs to C (avg), c(raw)ij belongs to C (raw), and Qi = {q : Ai œ Tq } (3) Note that the original ProtVec matrix was 9048 ◊ 100 dimen- sions, thus each j corresponds to the index of one of the 100 original ProtVec dimensions along the second tensor dimen- sion. Siamese embedding. We aimed to produce a final model, composed of an embedding technique and a classification method, that was specific to our motif dataset. To this end, we implemented a Siamese network to provide a novel learned representation of our motifs (Figure 1). The Siamese net- work is composed of two identical "twin" networks, deemed as such due to their identical hyperparameters as well as their identical learned weights and biases (23). During training, each twin network receives a separate motif sequence that is represented as a one-hot encoding, denoted either as a or b in Figure 1. Motifs are processed through the network until reaching the final fully-connected layers, ha and hb, which provide the resultant embeddings for the original motif se- quences. Next, the layers are joined by calculating the pair- wise Euclidean distance, Dw , between ha and hb. Dw can be interpreted as the overall dissimilarity between the origi- nal motif sequences, a and b. The loss function operates on the final layer, striving to embed relatively more similar data points closer to each other, and relatively more different data points farther away from each other. In this way, the network amplifies the similarities and differences between motifs, and it translates such relationships into a semantically meaningful vector representation for each motif in the embedding space. We utilized a contrastive loss as described in Hadsell et al. (24), but we sought to modify the function to account for the multi-label aspect of our task. The canonical Siamese loss between a pair of samples, a and b, is defined as L(a, b, Y ) = (1 ≠ Y ) 1 2 (Dw)2 + (Y ) 1 2 [max(0, m ≠ Dw)]2, (4) where Dw is the Euclidean distance between the outputs of the embedding layer, m is the margin which is a hyperparam- eter defined prior to training, and Y œ {0, 1}. The value of Y is determined by the label of each data point in the pair. If a Fig. 1. Siamese network architecture, composed of twin convolutional neural net- works (CNNs). The twin networks are joined at the final layer. a and b represent a pair of motifs from the training set, while ha and hb represent the respective hidden layers output by either CNN. The difference between the hidden layers is calculated to obtain the distance layer, Dw . Dw is input into the loss along with Y , a variable indicating the dissimilarity, regarding kinase interactions, between a and b. After training is complete, the "twin" architecture is no longer necessary; each motif is input into a single twin and the output of the embedding layer gives the resultant representation of the given motif. pair of samples has identical labels, they are declared “same” (Y = 0). Conversely, if a pair of samples has different la- bels, they are declared “different” (Y = 1). This definition relies on the assumption that each sample may only have one true label. To adapt the original Siamese loss to account for the multi-label aspect of our task, we replaced the discrete variable Y with a continuous variable, namely, the Jaccard distance between kinase-label set pairs. Thus, our modified loss function is defined as LJ (a, b, Y ) = (1≠Ja,b) 1 2 (Dw)2 +(Ja,b) 1 2 [max(0, m≠Dw)]2, (5) where Ja,b is shorthand for J (Ka, Kb), which is the Jaccard distance between the kinase-label set Ka and the kinase-label set Kb, associated with motif sample a and motif sample b, respectively. Formally, J (Ka, Kb) = 1 ≠ |Ka fl Kb| |Ka fi Kb| (6) and consequently, 0 Æ J (Ka, Kb) Æ 1. (7) In this way we have defined a continuous metric by which to compare a pair of motifs, rather than the usual “0” or “1” distinction. The Siamese network was trained for 10,000 iterations on the training set, precluding the data points in the independent test set. When composing a mini-batch, we alternated between "similar" and "dissimilar" motif pairs during training. Simi- lar pairs were defined as motifs whose J (Ka, Kb) > 0.5, and dissimilar pairs were defined as motifs whose J (Ka, Kb) Æ 0.5. After training, we must produce the final embedding space to be used in training of our subsequent classification Kirchoff et al. | EMBER: kinase-substrate multi-label prediction bioR‰iv | 3 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ Fig. 2. EMBER model architecture. Here, the previously-trained Siamese network is colored pink, and the classifier architecture is colored orange. The 15 amino acid-length motif, a, is converted into a one-hot encoded matrix, V . The one-hot encoded matrix is then fed into a single twin from the Siamese network. The 100-d embedding, e, is output by the Siamese network. Here, we reduce e to a 2-dimensional space for illustrative purposes using UMAP. Then, e is fed into a multilayer perceptron (MLP) alongside V , which is fed into a convolutional neural network (CNN). Then, the last layers of the separate networks are concatenated, followed by a series of fully-connected layers. The final output is a vector, k, of length eight, where each value corresponds to the probability that the motif a was phosphorylated by one of the kinase families indicated in k. network. To obtain the final embedding, we input each motif into a single arbitrary twin of the original network (because both twins learn the same weights and biases), producing a high-dimensional (100-dimensional) vector representation of the original motif sequence. The resultant motif embedding effected by the single Siamese twin is further discussed in the Results section. We used k-nearest neighbors (k-NN) classi- fication on each family to quantitatively compare the predic- tive capabilities of ProtVec and Siamese embeddings in the coordinate-only space. For our k-NN computation, we used a k of 85. Predictive model framework. EMBER architecture. An overview of the architecture of EM- BER is shown in Figure 2. EMBER takes as input raw motif sequences and the coordinates of each respective motif in the embedding space. We use one-hot encoded motifs as the sec- ond type of input into our model. Each motif sequence is represented by a 15 ◊ 22 matrix. In addition, we utilize the embedding provided by our Siamese network, which creates a latent space of dimensions m ◊ 90 where m is the number of motifs. The inputs into our classifier network, one-hot sequences and embeddings, are fed through a convolutional neural network (CNN) and a multilayer perceptron (MLP), respectively. The outputs of the two networks are then concatenated, and the concatenated layer is fed through a series of fully-connected layers (a MLP), followed by a sigmoid activation function. We performed 5-fold cross validation to assess the accuracy of our model when trained on different training-validation folds. We averaged the performance on the independent test set across the five folds to compute our final performance on the classification task. Evaluation metrics. In order to quantify the performance of our models, we computed the area under the receiver oper- ating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). These metrics were eval- uated per kinase family. We also show the micro-average and macro-average for both AUROC and AUPRC. We define � = {⁄j : j = 0, ..., q} as the set of all labels. The micro- average, Emicro, aggregates the label-wise contributions of each class: Emicro = E( qÿ ⁄=0 tp⁄, qÿ ⁄=0 tn⁄, qÿ ⁄=0 f p⁄, qÿ ⁄=0 f n⁄), (8) where E is an evaluation metric, in our case, either AUROC or AUPRC. Alternatively, the macro-average, Emacro, takes into account the score for each respective class and averages those scores together, thus treating all classes equally: Emacro = 1 q qÿ ⁄=0 E(tp⁄, tn⁄, f p⁄, f n⁄), (9) where E is once again an evaluation metric, in our case, ei- ther AUROC or AUPRC. Both the Emacro and the Emicro are calculated based on tp⁄, tn⁄, f p⁄, and f n⁄, which are, respectively, the number of true positives, the number of true negatives, the number of false positives, and the number of false negatives of label ⁄. 4 | bioR‰iv Kirchoff et al. | EMBER: kinase-substrate multi-label prediction .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ Kinase phylogenetic distances. We sought to leverage the phylogenetic relationships between kinases to improve predictions of kinase-motif interactions. Specifically, we considered the dissimilarity of a pair of kinase families in conjunction with the dissimilarity of the two respective groups of motifs that either kinase family phosphorylates (i.e., “kinase-family dissimilarity” vs. “motif-group dissim- ilarity”). Note that the terms “distance” and “dissimilarity” are interchangeable. As the phylogenetic distances given by Manning et al. (2) do not provide distances between typical and atypical kinase families, we established a proxy phylo- genetic distance that allows us to define distances between these two families. We define this proxy phylogenetic dis- tance through the Levenshtein edit distance, Lev(ka, kb), be- tween kinase-domain sequences. Kinase-domain sequences are the specific subsequences of kinases that are directly in- volved in phosphorylation. These kinase-domain sequences were obtained from an online source provided by Manning et al. (2). Distances between kinase domain sequences was calculated by performing local alignment, utilizing the BLO- SUM62 substitution matrix to weight indels and substitu- tions. To calculate overall kinase-family dissimilarity, we took the average of the Levenshtein edit distances between each kinase domain pair, per family, d(fa, fb) = q kaœfa q kbœfb Lev(ka, kb) |fa| · |fb| (10) where d(fa, fb) is the dissimilarity metric (distance) between kinase family a and kinase family b. ka is the kinase-domain sequence of a kinase belonging to family a, kb is the kinase- domain sequence of a kinase belonging to family b, and the Levenshtein distance between kinase domain ka and kinase domain kb is determined by Lev(ka, kb). This formula was applied per kinase family pair and stored in an a ◊ b kinase- family dissimilarity matrix. We will refer to this proxy metric for evolutionary dissimilarity between kinase families as the “phylogenetic distance” between kinase families. Kinase-family dissimilarity vs. motif-group dissimilarity. For our (kinase-family dissimilarity)-(motif-group dissimilarity) correlation, we defined motif-group dissimilarity in the same manner as kinase-family dissimilarity, finding the Leven- shtein distance between motifs based on local alignment us- ing BLOSUM62. Then, we sought to find the correlation between kinase-family dissimilarity and motif-group dissim- ilarity. Therefore, calculation of motif-group dissimilarity, per kinase family pair, was defined identically as in Equation 10, but based on the motifs specific to each kinase family, resulting in an a ◊ b motif-group dissimilarity matrix. Kinase phylogenetic loss. To leverage evolutionary rela- tionships between kinase families into our predictions, we weighted the original binary cross entropy (BCE) loss by a kinase phylogenetic metric. Specifically, our weighted BCE loss per minibatch is defined as: Fig. 3. Heatmap matrix depicting pairwise kinase-domain distances. Levenshtein distances were normalized, with the yellow end of the color bar representing far dis- tances (less similar) and the pink end representing close distances (more similar). P BCE(ŷ, y) = ≠ 1 n nÿ i P Ti yi log(ŷi), (11) where n is the size of the mini batch, yi is the one-hot actual label vector for sample i, ŷi is the predicted label vector for sample i, and Pi is the phylogenetic weight vector for sample i given by Pi = # w0,i, ..., w|K|,i $T , (12) with wk,i being the average phylogenetic weight scalar of label k for sample i: wk,i = 1 |Li| ÿ jœLi Fk,j , (13) and Fk,j is the vector of family weights of label k. Finally, Li is the set of indices corresponding to positive labels for sample i Li = {i œ [0, ..., m ≠ 1] : yi = 1} , (14) where m is the length of the one-hot true label vector for sample i. Results Correlation between kinase phylogenetic dissimilarity and phosphorylated motif dissimilarity. We sought to il- luminate the relationship between kinase-family dissimilar- ity and phosphorylated motif-group dissimilarity described in the Methods section. That is, we wanted to determine if “similar” kinases tend to phosphorylate “similar” motifs based on some quantitative metric. To this end, we calcu- lated the correlation between average kinase-family dissimi- larities and motif-group dissimilarities based on normalized pairwise alignment scores. From this, we found a Pearson correlation of 0.667, indicating a moderate positive relation- ship between kinase dissimilarity and that of their respective Kirchoff et al. | EMBER: kinase-substrate multi-label prediction bioR‰iv | 5 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ phosphorylated motifs. While moderate, this correlation be- tween kinase dissimilarity and motif dissimilarity suggests a potential signal in the phylogenetic relationships that could be leveraged to improve predictions. Using our normalized distances as a proxy for phylogenetic distance (see Methods), the dissimilarity between kinases is displayed as a heatmap in Figure 3. The Akt and PKC fam- ily have the greatest similarity (lowest dissimilarity) of all pairwise comparisons, with PKA-Akt and MAPK-CDK fol- lowing as the next most similar family pairs. Together, these results provide motivation to incorporate both motif dissim- ilarity and kinase relatedness into the predictive model, as achieved through our custom phylogenetic loss function de- scribed in Methods. The effects of this approach are de- scribed later in Results. Motif embedding via Siamese network. We sought to develop a novel learned representation of motifs using a Siamese neural network. Siamese networks were first in- troduced in the early 1990s as a method to solve signature verification, posed as an image-to-image matching problem (23). Siamese networks perform metric learning by exploit- ing the dissimilarity between a pair of data points. Training a Siamese network effects a function with the goal of produc- ing a meaningful embedding, capturing semantic similarity in the form of a distance metric. We hypothesized that incor- porating high-dimensional vector representations of motifs (i.e., an embedding) into the input of a classification network would provide more predictive power than methods that do not utilize such information. In our Siamese model, we opted to use convolutional layers as described in Methods. We per- formed k-NN on both the ProtVec and Siamese embeddings of motifs and found that the Siamese embedding produced better predictions, on average, than the ProtVec embedding (see Table 2). More specifically, the Siamese embedding resulted in a macro-average AUROC of 0.903 compared to ProtVec’s 0.898 and a micro-average AUROC of 0.924 com- pared to ProtVec’s 0.902. Likewise, the Siamese embedding had better AUPRC, with a macro-average AUPRC of 0.692 compared to ProtVec’s 0.670 and a micro-average AUPRC of 0.747 compared to ProtVec’s 0.643. Furthermore, we cal- culated the silhouette scores of both embeddings and found our Siamese embedding to have a significantly better mean silhouette score of 0.114 compared to ProtVec’s 0.005. We performed dimensionality reduction for visualization of the Siamese embeddings using uniform manifold approxi- mation and projection (UMAP) (25). For our UMAP im- plementation, we used 200 neighbors, a minimum distance of 0.1, and Euclidean distance for our metric. The resulting 2-dimensional UMAP motif embeddings derived from the Siamese network are shown in Figure 4. As can be seen, the motifs phosphorylated by a given kinase family have a dis- tinctive distribution in the embedding space, with some distri- butions being highly unique, and with some significant over- lap between certain families. More specifically, our Siamese embedding shows that motifs phosphorylated by either PKC, PKA, or Akt appear to occupy a similar latent space. Sim- ilarly, motifs phosphorylated by either CDK or MAPK also Fig. 4. Siamese embedding of motifs. Each point represents one of the 7302 mo- tifs, and each panel displays kinase family-specific phosphorylation patterns. Each colored point corresponds to a motif in the test set phosphorylated by a member of the specified kinase family. Highlighted points are slightly enlarged in size to enhance readability. 6 | bioR‰iv Kirchoff et al. | EMBER: kinase-substrate multi-label prediction .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ Table 2. Area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC) scores on independent test set prediction, given by k -NN performed on the ProtVec and Siamese embedding. Precision Recall Family ProtVec Siamese ProtVec Siamese Akt 0.908 0.897 0.462 0.513 CDK 0.889 0.892 0.511 0.538 CK2 0.906 0.893 0.665 0.714 MAPK 0.907 0.908 0.739 0.720 PIKK 0.845 0.900 0.579 0.663 PKA 0.865 0.852 0.716 0.659 PKC 0.865 0.885 0.697 0.741 Src 0.998 0.995 0.993 0.991 macro-average 0.898 0.903 0.670 0.692 micro-average 0.902 0.924 0.643 0.747 occupy a similar space. These observations mirror the phylo- genetic relationships shown in Figure 3, where the MAPK and CDK families have a relatively short mean evolution- ary distance between them, and the PKC-PKA distance, even shorter still. In addition to these overlapping families, we also observe that Src-phosphorylated motifs form a distinct cluster. This is likely driven by the fact that Src is the only tyrosine ki- nase family among the eight kinase families we investigated, with its motifs invariably having a tyrosine (Y) at the eighth position in the 15-amino acid sequence, compared to the other 7 families whose motifs have either a serine (S) or a (T) in this position. This effects a significant sequence dis- crepancy between Src-phosphorylated motifs and remaining motifs. The fact that Src-phosphorylated motifs cluster so precisely serves as a sanity check that our Siamese embed- ding is capturing sequence (dis)similarity information despite being trained through comparison of kinase-motif phospho- rylation events in lieu of motif sequence comparisons. We note that the embedding produced by our Siamese network is quite qualitatively similar to the ProtVec embedding in terms of these kinase-label clusters indicated in the UMAP projec- tions. The UMAP projections of the ProtVec embeddings are included in Supplementary Material. Prediction of phosphorylation events. Following train- ing of EMBER on both motif sequences and motif vector representations as input, we conducted an ablation test in which we removed the motif vector representation (or coordi- nate) input along with its respective MLP; this was achieved by applying a dropout rate of 1.00 on the final layer of the coordinate-associated MLP. This ablation test allowed us to observe how our novel motif sequence-coordinate model compares to a canonical deep learning model whose input consists solely of one-hot encoded motif sequences (such as in the methods utilized by Wang et al. (17) and Luo et al. (18)). We also compared EMBER trained with the standard BCE loss to EMBER trained with our kinase phylogenetic loss. All predictive models, as described in Table 3, were trained on identical training-validation splits and evaluated on the same independent test set. Fig. 5. Confusion matrix for EMBER predictions on the test set. The numbers inside each box represent the raw number of predictions per box. The color scale is based on the ratio of predictions (in the corresponding box) to total predictions, per label. A lighter color corresponds to a larger ratio of predictions to total predictions. Comparisons between the predictive capability of the mod- els described here are quantified by AUROC and AUPRC, and these metrics are presented for each of the three mod- els in Table 3. As indicated by Table 3, EMBER, utilizing both sequence and coordinate information, outperforms the canonical sequence model in both AUROC and AUPRC. In addition, integration of phylogenetic information into the loss provides a generally small but consistent additional boost in performance, showing the best overall results out of the three models for AUROC and AUPRC. Individual performance metric curves for each kinase label, produced by EMBER trained via the phylogenetic loss, are shown in Figure 6. A confusion matrix providing greater detail and illustrating the relative effectiveness of our model for prediction of differ- ent kinase families is shown in Figure 5. In order to compute the confusion matrix, we set a prediction threshold of 0.5, declaring any prediction above 0.5 as "positive" and any pre- diction equal to or less than 0.5 as "negative". As indicated by the confusion matrix, the model often confounds motifs that are phosphorylated by closely related kinase families, for ex- ample, MAPK and CDK. This is presumably due to the close phylogenetic relationship between MAPK and CDK, as in- dicated by their relatively low phylogenetic distance of 0.75 (Figure 3). Furthermore, our Siamese network embeds mo- tifs of these respective families into the same relative space, as shown in Figure 4, further illustrating the confounding na- ture of these motifs. A similar trend is found for motifs phos- phorylated by PKC, PKA, and Akt. This trio is also shown to be closely related as indicated by the correlations in Figure 3 and the embeddings in Figure 4. Comparison to existing methods. We sought to compare EMBER’s performance to the two existing deep learning methods, MusiteDeep and DeepPhos, which adopt single- Kirchoff et al. | EMBER: kinase-substrate multi-label prediction bioR‰iv | 7 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ label models. However, this is not a straight-forward com- parison because EMBER was trained on sequences 15 amino acids in length while MusiteDeep and DeepPhos were trained on sequences of 33 and 51 amino acids in length, respec- tively. Thus, we must elongate our 15-mers to lengths of 33 and 51 in order for MusiteDeep and DeepPhos to accept those sequences into their architectures as input. To accom- plish this, we queried the Uniprot database to find complete protein sequences of which our test set motifs were subse- quences. For instances in which a motif was a subsequence of multiple proteins we chose a protein at random from the set. By referencing the original complete protein sequence we were able to elongate our motifs by adding nine (in the case of MusiteDeep) or 18 (in the case of DeepPhos) amino acids to each flank of the original 15-mer motif. This re- sulted in a test set of 33-mers for MusiteDeep and 51-mers for DeepPhos. We note that of the eight kinase families for which our model produces predictions, DeepPhos has functioning models for only four of the families (CDK, CK2, MAPK, and PKC), and MusiteDeep has models for only five of the families (CDK, CK2, MAPK, PKA, and PKC). We show AUROC and AUPRC results per kinase label from each of the three meth- ods in Figure 6. EMBER outperforms MusiteDeep and Deep- Phos on all four averaged metrics, indicating that our multi- label approach may be better equipped to solve the problem of kinase-motif prediction compared to the single-label ap- proaches. Discussion Illuminating the map of kinase-substrate interactions has the potential to enhance our understanding of basic cellular sig- naling as well as drive health applications, for example, by facilitating the development of novel kinase inhibitor-based therapies that disrupt kinase signaling pathways. Here, we have presented a deep learning-based approach that aims to predict which substrates are likely to be phosphorylated by a specific kinase family. In particular, our multi-label ap- proach establishes a unified model that utilizes all available kinase-motif data to learn broader structures within the data and improve predictions across all kinase families in tandem. This approach avoids challenges in hyperparameter tuning in- herent in the development of an individual model for each kinase. We believe that this approach will enable continuing improvement in predictions, as newly generated data describ- ing any kinase-motif phosphorylation event can assist in im- proving predictions for all kinases. That is, a kinase-motif interaction discovered for PKA will improve the predictions not just for PKA, but also for Akt, PKC, MAPK, etc. through the transfer learning capabilities inherent in our multi-label model. We showed that incorporation of a learned vector repre- sentation of motifs, namely the motifs’ coordinates in the Siamese embedding space, serves to improve performance over a model that utilizes only one-hot encoded motif se- quences as input. Not only did the Siamese embedding im- prove prediction of phosphorylation events through a neu- Fig. 6. AUROC and AUPRC results achieved on the independent test set by Deep- Phos, MusiteDeep, and EMBER. The AUROC and AUPRC of each kinase family label is shown in the respective legends. 8 | bioR‰iv Kirchoff et al. | EMBER: kinase-substrate multi-label prediction .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ Table 3. AUROC and AUPRC results achieved on the independent test set across deep learning classification models. The AUROC and AUPRC are presented per kinase family for each model. From left to right, we include results for the ablated sequence-only CNN, EMBER trained using a canonical BCE loss, and EMBER trained using the kinase phylogeny-weighted loss as described in Methods. AUROC AUPRC Family Seq-CNN EMBER (BCE) EMBER (PBCE) Seq-CNN EMBER (BCE) EMBER (PBCE) Akt 0.844 0.865 0.889 0.377 0.379 0.483 CDK 0.882 0.891 0.902 0.552 0.582 0.600 CK2 0.902 0.915 0.923 0.681 0.750 0.755 MAPK 0.898 0.908 0.907 0.704 0.729 0.730 PIKK 0.850 0.864 0.889 0.534 0.557 0.610 PKA 0.838 0.856 0.867 0.657 0.689 0.718 PKC 0.857 0.877 0.888 0.704 0.732 0.763 Src 0.997 0.995 0.996 0.993 0.993 0.994 macro-average 0.884 0.897 0.908 0.650 0.676 0.706 micro-average 0.909 0.921 0.928 0.715 0.745 0.765 ral network architecture, but it also outperformed ProtVec, a previously developed embedding, in a coordinate-based k- NN task. This improvement over ProtVec was in spite of the fact that the Siamese network utilized less than 7,000 training sequences of 15 amino acids in length compared to ProtVec’s 500,000 sequences of approximately 300 amino acids in average length. The Siamese embedding was further generated through direct comparison of kinase-motif phos- phorylation events rather than simply the sequence-derived data used by ProtVec. Furthermore, ProtVec is a generalized protein embedding while the Siamese embedding described here has the potential to be customized. For example, the use of the Jaccard distance in the Siamese loss allows the network to be trained on any number of multi-label datasets such acetylation, methylation, and carbonylation reactions. We also found that there is a small though meaningful rela- tionship between the evolutionary distance between kinases and the motifs they phosphorylate, supporting the concept that closely related kinases will tend to phosphorylate similar motifs. When encoded in the form of our phylogenetic loss function, this relationship was able to slightly improve pre- diction accuracies. Together, these results suggest that EM- BER holds significant promise towards the task of illuminat- ing the currently unknown relationships between kinases and the substrates they act on. ACKNOWLEDGEMENTS We would like to acknowledge members of the GomezLab for helpful comments and feedback. This work was supported by grants through the National Institutes of Health (Grant #s CA177993, CA233811, CA238475, DK116204). Bibliography 1. Tzong-Yi Lee, Justin Bo-Kai Hsu, Wen-Chi Chang, and Hsien-Da Huang. RegPhos: a system to explore the protein kinase-substrate phosphorylation network in humans. Nucleic Acids Res., 39(Database issue):D777–87, January 2011. 2. G Manning, D B Whyte, R Martinez, T Hunter, and S Sudarsanam. The protein kinase complement of the human genome. Science, 298(5600):1912–1934, December 2002. 3. Panayotis Vlastaridis, Pelagia Kyriakidou, Anargyros Chaliotis, Yves Van de Peer, Stephen G Oliver, and Grigoris D Amoutzias. Estimating the total number of phosphopro- teins and phosphorylation sites in eukaryotic proteomes. Gigascience, 6(2):1–11, February 2017. 4. Leah J Wilson, Adam Linley, Dean E Hammond, Fiona E Hood, Judy M Coulson, David J MacEwan, Sarah J Ross, Joseph R Slupsky, Paul D Smith, Patrick A Eyers, and Ian A Prior. New perspectives, opportunities, and challenges in exploring the human protein kinome. Cancer Res., December 2017. 5. G L Johnson and Razvan Lapadat. Mitogen-Activated protein kinase pathways mediated by ERK , JNK , and p38 protein kinases. Science, 298(5600):1911, 2002. 6. Gayathri K Perera, Chrysanthi Ainali, Ekaterina Semenova, Christian Hundhausen, Guillermo Barinaga, Deepika Kassen, Andrew E Williams, Muddassar M Mirza, Mercedesz Balazs, Xiaoting Wang, Robert Sanchez Rodriguez, Andrej Alendar, Jonathan Barker, Sophia Tsoka, Wenjun Ouyang, and Frank O Nestle. Integrative biology approach iden- tifies cytokine targeting strategies for psoriasis. Sci. Transl. Med., 6(223):223ra22, February 2014. 7. Nicole Tegtmeyer, Matthias Neddermann, Carmen Isabell Asche, and Steffen Backert. Sub- version of host kinases: a key network in cellular signaling hijacked by helicobacter pylori CagA. Mol. Microbiol., May 2017. 8. Amandine Charras, Pinelopi Arvaniti, Christelle Le Dantec, Marina I Arleevskaya, Kaliopi Zachou, George N Dalekos, Anne Bordon, and Yves Renaudineau. JAK inhibitors suppress innate epigenetic reprogramming: a promise for patients with sjögren’s syndrome. Clin. Rev. Allergy Immunol., June 2019. 9. Alessia Alunno, Ivan Padjen, Antonis Fanouriakis, and Dimitrios T Boumpas. Pathogenic and therapeutic relevance of JAK/STAT signaling in systemic lupus erythematosus: Integra- tion of distinct inflammatory pathways and the prospect of their inhibition with an oral agent. Cells, 8(8), August 2019. 10. Ya Nan Deng, Joseph A Bellanti, and Song Guo Zheng. Essential kinases and transcrip- tional regulators and their roles in autoimmunity. Biomolecules, 9(4), April 2019. 11. Kyla A L Collins, Timothy J Stuhlmiller, Jon S Zawistowski, Michael P East, Trang T Pham, Claire R Hall, Daniel R Goulet, Samantha M Bevill, Steven P Angus, Sara H Velarde, Noah Sciaky, Tudor I Oprea, Lee M Graves, Gary L Johnson, and Shawn M Gomez. Proteomic analysis defines kinase taxonomies specific for subtypes of breast cancer. Oncotarget, 9 (21):15480–15497, March 2018. 12. Elise J Needham, Benjamin L Parker, Timur Burykin, David E James, and Sean J Humphrey. Illuminating the dark phosphoproteome. Sci. Signal., 12(565), January 2019. 13. Wenwen Fan, Xiaoyi Xu, Yi Shen, Huanqing Feng, Ao Li, and Minghui Wang. Prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional infor- mation and random forest. Amino Acids, 46(4):1069–1078, April 2014. 14. Shu-Yun Huang, Shao-Ping Shi, Jian-Ding Qiu, and Ming-Chu Liu. Using support vector machines to identify protein phosphorylation sites in viruses. J. Mol. Graph. Model., 56: 84–90, March 2015. 15. Fuyi Li, Chen Li, Tatiana T Marquez-Lago, André Leier, Tatsuya Akutsu, Anthony W Purcell, A Ian Smith, Trevor Lithgow, Roger J Daly, Jiangning Song, and Kuo-Chen Chou. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphory- lation sites in the human proteome. Bioinformatics, 34(24):4223–4231, December 2018. 16. Yu Xue, Ao Li, Lirong Wang, Huanqing Feng, and Xuebiao Yao. PPSP: prediction of PK- specific phosphorylation site with bayesian decision theory. BMC Bioinformatics, 7:163, March 2006. 17. Duolin Wang, Shuai Zeng, Chunhui Xu, Wangren Qiu, Yanchun Liang, Trupti Joshi, and Dong Xu. MusiteDeep: a deep-learning framework for general and kinase-specific phos- phorylation site prediction. Bioinformatics, 33(24):3909–3916, December 2017. 18. Fenglin Luo, Minghui Wang, Yu Liu, Xing-Ming Zhao, and Ao Li. DeepPhos: prediction of protein phosphorylation sites with deep learning. Bioinformatics, 35(16):2766–2773, August 2019. 19. Ehsaneddin Asgari and Mohammad R K Mofrad. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10(11):e0141287, November 2015. 20. Peter V Hornbeck, Jon M Kornhauser, Sasha Tkachev, Bin Zhang, Elzbieta Skrzypek, Beth Murray, Vaughan Latham, and Michael Sullivan. PhosphoSitePlus: a comprehen- sive resource for investigating the structure and function of experimentally determined post- translational modifications in man and mouse. Nucleic Acids Res., 40(Database issue): D261–70, January 2012. 21. Jianfei Hu, Hee-Sool Rho, Robert H Newman, Jin Zhang, Heng Zhu, and Jiang Qian. Phos- phoNetworks: a database for human phosphorylation networks. Bioinformatics, 30(1):141– 142, January 2014. 22. Holger Dinkel, Claudia Chica, Allegra Via, Cathryn M Gould, Lars J Jensen, Toby J Gibson, and Francesca Diella. Phospho.ELM: a database of phosphorylation sites–update 2011. Nucleic Acids Res., 39(Database issue):D261–7, January 2011. 23. Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signa- ture verification using a “siamese” time delay neural network. In J D Cowan, G Tesauro, and J Alspector, editors, Advances in Neural Information Processing Systems 6, pages 737–744. Morgan-Kaufmann, 1994. 24. R Hadsell, S Chopra, and Y LeCun. Dimensionality reduction by learning an invariant map- ping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recog- nition (CVPR’06), volume 2, pages 1735–1742, June 2006. Kirchoff et al. | EMBER: kinase-substrate multi-label prediction bioR‰iv | 9 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ 25. Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projection for dimension reduction. February 2018. 10 | bioR‰iv Kirchoff et al. | EMBER: kinase-substrate multi-label prediction .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ S e e a Ma e a EMBER: Multi-label prediction of kinase-substrate phosphor lation events through deep learning 1. Metrics Definitions of metrics that characteri e the area under the receiver operator curve (AUROC) and the precision-recall curve (AUPRC) as described in the main manuscript: The AUROC is the integral of the receiver operator curve, which is found b plotting the true positive rate (TPR) against the false positive rate (FPR) at various decision thresholds. The TPR is defined as TP / (TP + FN) where TP are the true positive predictions and FN are the false negative predictions. The FPR is defined as FP / (FP + TN) where FP are the false positive predictions and TN are the true negative predictions. The AUPRC is the integral of the receiver operator curve, which is found b plotting the precision against the recall (i.e. TPR) at various decision thresholds. Precision is defined as: TP / (TP + FP) where TP are the true positive predictions and FP are the false positive predictions. 2. ProtVec embedding figures. Here, we show a qualitative comparison, via a UMAP reduction, between the original ProtVec embedding and the averaged ProtVec embedding. Original​: .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ Averaged​: 3. ProtVec embedding kNN results. In the table below we show the AUROC and AUPRC results of the kNN classification task on the original ProtVec embedding and the averaged ProtVec embedding. For our kNN calculation, we used k = 85. AUROC AUPRC Ki a e igi al P Vec a e aged P Vec igi al P Vec a e aged PV Ak 0.832 0.908 0.421 0.462 CDK 0.857 0.889 0.410 0.511 CK2 0.872 0.906 0.590 0.665 MAPK 0.888 0.907 0.688 0.739 PIKK 0.810 0.845 0.396 0.579 PKA 0.816 0.865 0.625 0.716 PKC 0.830 0.865 0.638 0.697 S c 0.979 0.998 0.906 0.993 ac 0.861 0.898 0.584 0.670 ic 0.851 0.902 0.551 0.643 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/ 4. Hard are Training and testing of EMBER occurred on a Linux s stem with the following configuration: - ​Pop!_OS Linux 20.04 - Intel Xeon E5-2620 v4 with 32 cores @ 3.0 GH - 128 GB Ram - Nvidia Titan Xp On this s stem, the Siamese network took around 12 minutes to train, and the classification network took around 6 minutes to train. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.02.04.934216doi: bioRxiv preprint https://doi.org/10.1101/2020.02.04.934216 http://creativecommons.org/licenses/by/4.0/