key: cord-0258882-37dd3it1 authors: Littmann, Maria; Heinzinger, Michael; Dallago, Christian; Olenyi, Tobias; Rost, Burkhard title: Embeddings from deep learning transfer GO annotations beyond homology date: 2020-09-06 journal: bioRxiv DOI: 10.1101/2020.09.04.282814 sha: ed6c0bf0d6c5da602331dbbf0a0125486f1a6bac doc_id: 258882 cord_uid: 37dd3it1 Knowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we proposed predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originated from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 250 million protein sequences. Replicating the conditions of CAFA3, our method reached an Fmax of 37±2%, 50±3%, and 57±2% for BPO, MFO, and CCO, respectively. This was numerically close to the top ten methods that had participated in CAFA3. Restricting the annotation transfer to proteins with <20% pairwise sequence identity to the query, performance dropped (Fmax BPO 33±2%, MFO 43±3%, CCO 53±2%); this still outperformed naïve sequence-based transfer. Preliminary results from CAFA4 appeared to confirm these findings. Overall, this new method may help, in particular, to annotate novel proteins from smaller families or proteins with intrinsically disordered regions. ProtBert 35 ). Although SeqVec has never explicitly encountered GO terms during training, we 48 hypothesized SeqVec embeddings to implicitly encode information relevant for the transfer of 1 annotations, i.e. capturing aspects of protein function. Simple embedding-based transfer almost as good as CAFA3 top-10. First, we predicted GO 7 terms for all 3,328 CAFA3 targets using the Gene Ontology Annotation (GOA) dataset GOA2017 8 (Methods), removed all entries identical to CAFA3 targets (PIDE=100%; set: GOA2017-100) and 9 transferred the annotations of the closest hit (k=1) in this set to the query. When applying the NK 10 evaluation mode (no-knowledge available for query, Methods/CAFA3), the embedding-transfer 11 reached Fmax scores of 37±2% for BPO (precision=39±2%, recall=36±2%), 50±3% for MFO 12 (precision=54±3%, recall=47±3%), and 57±2% for CCO (precision=61±3%, recall=54±3%; Table 13 1, Fig. 1, Fig. S1 in Supporting Online Material -SOM). Errors were estimated through the 95% 14 confidence intervals (± 1.96 stderr). In the sense that the database with annotations to transfer 15 (GOA2017) had been available before the CAFA3 submission deadline (February 2017), our 16 predictions were directly comparable to CAFA3 31 . This embedding-based annotation transfer 17 clearly outperformed the two CAFA3 baselines ( Fig. 1 : the simple BLAST for sequence-based 18 annotation transfer, and the Naïve method assigning GO terms statistically based on database 19 frequencies, here GOA2017); it would have performed close to the top ten CAFA3 competitors 20 (in particular for BPO: Fig. 1 ) had the method competed at CAFA3. -in contrast to our method -did compete at CAFA3 and to two background approaches "BLAST" 28 (homology-based inference) and "Naïve" (assignment of terms based on term frequency) (lighter bars). 29 The result shown held for the NK evaluation mode (no knowledge), i.e. only using proteins that were novel 30 in the sense that they had no prior annotations. If we had developed our method before CAFA3, it would 31 have almost reached the tenth place for MFO and CCO and ranked even slightly better for BPO. Error 32 bars (for our method) marked the 95% confidence intervals. GOA2017X (2017), and GOA2020-100 (2020) used for annotation transfer (note: the notation '-5 100' implies that any entry in the dataset with PIDE=100% to any CAFA3 protein had been 6 removed). All values were compiled for picking the single top hit (k=1) and using the CAFA3 targets 7 from the NK and full evaluation mode 31 . For all simple annotation transfers (embedding-and 8 sequence-based), performance was higher for the more recent data sets (GOA2020 vs. 9 GOA2017). Error estimates are given as 95% confidence intervals. Fmax values were computed 10 using the CAFA3 tool 31,33 . Including more neighbors (k>1) only slightly affected Fmax (Table S2; all average Fmax values for 13 k=2 to k=10 remained within the 95% confidence interval of the value for k=1). When taking all 14 predictions into account independent of a threshold in prediction strength referred to as the 15 reliability index (RI, Methods; i.e. any copied annotation is considered even if the confidence is 16 low), the number of predicted GO terms increased with higher k (Table S3 ). The average number 17 of terms annotated for each protein in GOA2017 was already as high as 37 for BPO, 14 for MFO, 18 and 9 for CCO. When including all predictions independent of their strength (RI) our method 19 predicted more terms for CCO and BPO than expected from this distribution even for k=1. In 20 contrast for MFO, 11.7 terms were predicted on average slightly lower than the expected 21 average, at least for k=1 (explosion of terms for k>1: Table S3 ). While precision dropped with 22 adding terms, recall increased (Table S3) annotations (LK evaluation mode) than for novel proteins without any annotations (NK evaluation 30 mode; Table 1 ); the same was true for the CAFA3 top-10 for which the Fmax scores increased 31 even more than for our method for BPO and MFO, and less for CCO (Fig. 1, Fig. S2 transfer (BLAST) were within the 95% confidence interval (Fig. 2) . This clearly showed that our 45 approach benefited from information available through embeddings but not available from 46 sequence, and that at least some protein pairs close in embedding and distant in sequence space 47 might function alike. applying this method to new queries, annotations will be transferred from the latest GOA. We 3 used GOA2020-100 (from 01/2020 removing the CAFA3 targets) to assess how the improvement 4 of annotations from 2017 to 2020 influenced annotation transfer (Table 1) . On GOA2020-100, 5 SeqVec embedding-based transfer achieved Fmax scores of 50±2% (precision=50±3%, 6 recall=50±3%), 60±3% (precision=52±3%, recall=71±3%), and 65±2% (precision=57±3%, 7 recall=75±3%) for BPO, MFO, and CCO, respectively, for the NK evaluation mode (Table 1) . This 8 constituted a substantial increase over GOA2017-100 (Table 1) . We submitted predictions for 9 MFO for the CAFA4 targets. In the first preliminary evaluation published during ISMB2020 41 , our 10 method achieved 9 th place in line with the post facto CAFA3 results presented here. The large performance boost between GOA2017 and GOA2020 indicated the addition of 12 many new and relevant GO annotations in three years. However, for increasingly diverged pairs 13 (Q,T), we observed a much larger drop in Fmax than for GOA2017 (Fig. 2, Fig. S3 ). In the extreme, helpful new experiments simply refined previous computational predictions. 19 Running BLAST against GOA2020-100 for sequence-based transfer (choosing the hit 20 with the highest PIDE) showed that sequence-transfer also profited from improved annotations 21 (difference in Fmax values for BLAST in Table 1 ). However, while Fmax scores for embedding-based 22 transfer increased the most for BPO, those for sequence-based transfer increased most for MFO. 23 Embedding-transfer still outperformed BLAST for the GOA2020-100 set (Fig. S3C) . 24 Even when constraining annotation transfers to more sequence-distant pairs (lower 25 PIDE), our method outperformed BLAST against GOA2020-100 in terms of Fmax at least for BPO 26 and for higher levels of PIDE in MFO/CCO (Fig. S3C) better predictions than relatively high embedding RIs (Fig. S4) . We tried to benefit from these observations through simple combinations. Firstly, we considered all terms predicted by 1 embeddings from either SeqVec or ProtBert. Secondly, reliability scores were combined leading 2 to higher reliability for terms predicted in both approaches than for terms only predicted by one. 3 None of those two approaches improved performance (Table S4 , method SeqVec/ProtBert). 4 Further simple combinations of embedding and sequence also did not improve performance 5 (Table S4 , method SeqVec/ProtBert/BLAST). Finding out whether or not more advanced 6 combinations were beyond the scope of this work. (Table S4 summarized all predicted and 27 annotated GO leaf terms, the corresponding names can be found in the additional files 28 predictions_$emb_$ont.txt). Step 2: new predictions. Since the GO term predictions matched well-characterized Conclusions 6 We introduced a new concept for the prediction of GO terms, namely the annotation transfer 7 based on similarity of embeddings obtained from deep learning language models (LMs). This 8 approach essentially replaced sequence information by complex embeddings that capture some 9 non-local information; it resulted in a very simple novel prediction method complementing 10 homology-based inference. Despite its simplicity, this new method would have reached the top 11 ten, had it participated at CAFA3 (Fig. 1) ; it outperformed by several margins of statistically 12 significance homology-based inference ("BLAST") with Fmax values of BPO +11±2% 13 (Fmax(embedding)-Fmax(sequence)), MFO +8±3%, and CCO +11±2% (Table 1, Fig. 1 ). 14 Embedding-based transfer remained above the average for sequence-based transfer even for 15 protein pairs with PIDE<20% (Fig. 2) . In other words, embedding similarity still worked for 16 proteins that diverged beyond the recognition in pairwise alignments ( for this was in that the correlation between embeddings and sequence remained limited (Table 28 2). UniRef50 (UniProt 5 clustered at 50% PIDE). As no labeled data was used (self-supervised 43 training), the embeddings could not capture any explicit information such as GO numbers. Thus, SeqVec does not need to be retrained for subsequent prediction tasks using the embeddings as 45 input. After pre-training, the hidden states of the model were used to extract features. For passes for the first LSTM layer were extracted and concatenated into a matrix of size L * 1024 for a protein with L residues. A fixed-size representation was then derived by averaging over the 1 length dimension, resulting in a vector of size 1024 for each protein (Fig. S6 ). To evaluate the effect of using different LMs to generate the embeddings, we also used For comparison to methods that contributed to CAFA3 31 , we added another dataset 31 1,097 (more details about the dataset are given in the original CAFA3 publication 31 ). In order to expand the comparison of the transfer based on sequence-and embedding 1 similarity, we also reduced the redundancy through applying CD-HIT and PSI-CD-HIT 55,56 to the 2 GOA2020 and GOA2017 sets against the evaluation set at thresholds θ of PIDE=100, 90, 80, 3 70, 60, 50, 40, 30 and 20% (Table S1 in the Supporting Online Material (SOM) for more details 4 about these nine subsets). 5 We evaluated our method against two baseline methods used at CAFA3, namely Naïve 6 and BLAST, as well as, against CAFA3's top ten 31 . We computed standard performance The Fmax value denoted the maximum F1 score achievable for any threshold in reliability (RI, Eqn. 15 5). This implies that the assessment fixes the optimal value rather than the method providing this with k as the overall number of hits/neighbors, l as the number of hits annotated with the GO term 6 p and the distance ( , 9 ) between query and hit being calculated according to Eqn. 4. Proteins represented by an embedding identical to the query protein (d=0) led to RI=1. Since the RI also takes into account, how many proteins l in a list of k hits are annotated with a 9 certain term p (Eqn. 5), predicted terms annotated to more proteins (larger l) have a higher RI 10 than terms annotated to fewer proteins (smaller l Metabolism of ketonic acids in animal tissues Yeast chromosome III: new gene functions Correlation of the amino acid composition of a protein to its 15 structural and biological characteristics Prediction of protein function from sequence 17 properties: discriminant analysis of a data base Prediction of structural and functional features of protein 20 and nucleic acid sequences by artificial neural networks DeepGOPlus: improved protein function prediction from 4 sequence Evolutionary processes and evolutionary noise at the molecular level Atlas of protein sequence and structure PSORT: a program for detecting sorting signals in proteins and 10 predicting their subcellular localization LocTree3 prediction of localization Sequence conserved for sub-cellular localization ProNA2020 predicts protein-DNA, protein-RNA, and protein-protein binding 16 proteins and residues from sequence Computational prediction shines light on type III 19 secretion origins The CAFA challenge reports improved protein function prediction and new 21 functional annotations for hundreds of genes through experimental screens Community-Wide Evaluation of Computational Function 24 An expanded evaluation of protein function prediction methods shows an 26 improvement in accuracy Deep contextualized word representations Character-Aware Neural Language Models. 7 arXiv preprint arXiv Pre-training of Deep 9 Bidirectional Transformers for Language Understanding. arXiv Neural Machine Translation by Jointly Learning to 11 Align and Translate. arXiv The Gene Ontology Annotation (GOA) Database: sharing knowledge in 13 The GOA database: gene Ontology annotation updates for GOA