key: cord-0295316-amprhpe0 authors: Ruffolo, Jeffrey A.; Chu, Lee-Shin; Mahajan, Sai Pooja; Gray, Jeffrey J. title: Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies date: 2022-04-21 journal: bioRxiv DOI: 10.1101/2022.04.20.488972 sha: 53ae1c735454d9b6c4f481b0e14d1716ea491fa1 doc_id: 295316 cord_uid: amprhpe0 Antibodies have the capacity to bind a diverse set of antigens, and they have become critical therapeutics and diagnostic molecules. The binding of antibodies is facilitated by a set of six hypervariable loops that are diversified through genetic recombination and mutation. Even with recent advances, accurate structural prediction of these loops remains a challenge. Here, we present IgFold, a fast deep learning method for antibody structure prediction. IgFold consists of a pre-trained language model trained on 558M natural antibody sequences followed by graph networks that directly predict backbone atom coordinates. IgFold predicts structures of similar or better quality than alternative methods (including AlphaFold) in significantly less time (under one minute). Accurate structure prediction on this timescale makes possible avenues of investigation that were previously infeasible. As a demonstration of IgFold’s capabilities, we predicted structures for 105K paired antibody sequences, expanding the observed antibody structural space by over 40 fold. structures (thousands (30)) presents a di cultly in training an e ective antibody structure predictor. In the absence 48 of structural data, self-supervised language models provide a powerful framework for extracting patterns from the 49 significantly greater number (billions (31)) of natural antibody sequences identified by immune repertoire sequencing 50 studies. For this work, we used AntiBERTy (21), a transformer language model pre-trained on 558M natural antibody 51 sequences, to generate embeddings for structure prediction. Similar to the role played by alignments of evolutionarily related sequences for general protein structure prediction (32), embeddings from AntiBERTy act as a contextual representation that places individual sequences within the broader antibody space. 54 Prior work has demonstrated that protein language models can learn structural features from sequence pre-training 55 alone (17, 33) . To investigate whether sequence embeddings from AntiBERTy contained nascent structural features, 56 we generated embeddings for the set of 3,467 paired antibody sequences with experimentally determined structures in 57 the PDB. For each sequence, we extracted the portions of the embedding corresponding to the six CDR loops and 58 averaged to obtain fixed-sized CDR loop representations (one per loop). We then collected the embeddings for each 59 CDR loop across all sequences and visualized using two-dimensional t-SNE ( Figure S1 ). To determine whether the 60 CDR loop representations encoded structural features, we labeled each point according to its canonical structural 61 cluster. For CDR H3, which lacks canonical clusters, we instead labeled by loop length. For the five CDR loops that 62 adopt canonical folds we observed clear organization within the embedded space. For the CDR H3 loop, we found 63 that the embedding space did not separate into natural clusters, but was rather organized roughly in accordance with 64 loop length. These results suggest that AntiBERTy has learned to encode CDR loop structural features through 65 sequence pre-training alone. Coordinate prediction from sequence embeddings. To predict 3D atomic coordinates from sequence embeddings, we 67 adopt a graphical representation of antibody structure, with each residue as a node and information passing between 68 all pairs of residues ( Figure 1 ). The nodes are initialized using the final hidden layer embeddings from AntiBERTy. 69 To initialize the edges, we collect the full set of inter-residue attention matrices from each layer of AntiBERTy. These 70 attention matrices are a useful source of edge information as they encode the residue-residue information pathways 71 learned by the pre-trained model. For paired antibodies, we concatenate the sequence embeddings from each chain 72 and initialize inter-chain edges to zero. We do not explicitly provide a chain break delimiter, as the pre-trained 73 language model already includes a positional embedding for each sequence. The structure prediction model begins 74 with a series of four graph transformer (34) layers interleaved with edge updates via the triangle multiplicative layer 75 proposed for AlphaFold (10) . 76 Following the initial graph transformer layers, we incorporate structural template information into the nascent 77 representation using invariant point attention (IPA) (10) . In contrast to the application of IPA for the AlphaFold 78 structure module, we fix the template coordinates and use IPA as a form of structure-aware self-attention. This 79 enables the model to incorporate the local structural environment into the sequence representation directly from the 80 3D coordinates, rather than switching to an inter-residue representation (e.g., distance or contact matrices). We use 81 three IPA layers to incorporate template information. Rather than search for structural templates for training, we 82 generate template-like structures by corruption of the true label structures. Specifically, for 50% of training examples, 83 we randomly select one to six consecutive segments of twenty residues and move the atomic coordinates to the origin. 84 The remaining residues are provided to the model as a template. The deleted segments of residues are hidden from the 85 IPA attention, so that the model only incorporates structural information from residues with meaningful coordinates. 86 Finally, we use another set of IPA layers to predict the final 3D antibody structure. Here, we employ a strategy 87 similar to the AlphaFold structure module (10) and train a series of three IPA layers to translate and rotate each 88 residue from an initialized position at the origin to the final predicted position. We depart slightly from the AlphaFold 89 implementation and learn separate weights for each IPA layer, as well as allow gradient propagation through the 90 rotations. To train the model for structure prediction, we minimize the mean-squared error between the predicted 91 coordinates and the experimental structure after Kabsch alignment. In practice, we observe that the first IPA layer is 92 su cient to learn the global arrangement of residues (albeit in a compact form), while the second and third layers 93 function to produce the properly scaled structure with correct bond lengths and angles ( Figure S2 ). Per-residue error prediction. Simulatneously with structure prediction training, we additionally train the model to 95 estimate the error in its own predictions. For error estimation, we use two IPA layers that operate similarly to the 96 template incorporation layers (i.e., without coordinate updates). The error estimation layers take as input the final 97 predicted structure, as well as a separate set of node and edge features derived from the initial AntiBERTy features. 98 We stop gradient propagation through the error estimation layers into the predicted structure to prevent the model 99 from optimizing for accurately estimated, but highly erroneous structures. For each residue, the error estimation 100 layers are trained to predict the deviation of the Catom from the experimental structure after a Kabsch alignment 101 of the beta barrel residues. We use a di erent alignment for error estimation than structure prediction to more closely 102 mirror the conventional antibody modeling evaluation metrics. The model is trained to minimize the L1 norm of the 103 predicted Cdeviation minus the true deviation. Structure dataset augmentation with AlphaFold. We sought to train the model on as many immunoglobulin structures 105 as possible. From the Structural Antibody Databae (SAbDab) (30), we obtained 4,275 structures consisting of and deep learning methods for antibody structure prediction. Although previous work has demonstrated significant 135 improvements by deep learning over grafting-based methods, we continue to benchmark against grafting to track its 136 performance as increasingly many antibody structures become available. For each benchmark target, we predicted 137 structures using ABodyBuilder (37), DeepAb (14), ABlooper (15), and AlphaFold-Multimer (13) . Of these methods, 138 ABodyBuilder utilizes a grafting-based algorithm for structure prediction and the remaining use some form of deep 139 learning. DeepAb and ABlooper are both trained specifically for paired antibody structure prediction, and have 140 previously reported comparable performance. AlphaFold-Multimer has demonstrated state-of-the-art performance for 141 protein complex prediction -however, performance on antibody structures specifically remains to be evaluated. The performance of each method was assessed by measuring the backbone heavy-atom RMSD between the predicted 143 and experimentally determined structures for the framework residues and each CDR loop. All RMSD values are 144 measured after alignment of the framework residues. In general, we observed state-of-the-art performance for all of 145 the deep learning methods while grafting performance continued to lag behind ( Figure 2A , Table 1 ). On average, all 146 methods predicted both the heavy and light chain framework structures with high accuracy (0.43-0.54 Å and 0.38 -147 0.45 Å, respectively). Similarly, for the CDR1 and CDR2 loops, all deep learning methods produced sub-angstrom 148 predictions on average, with the grafting-based ABodyBuilder performing marginally worse. The largest improvement 149 in prediction accuracy by deep learning methods is observed for the CDR3 loops. 150 We also considered the predicted orientation between the heavy and light chains, which is an important determinant 151 of the overall binding surface (8, 9) . Accuracy of the inter-chain orientation was evaluated by measuring the deviation 152 from native of the inter-chain packing angle, inter-domain distance, heavy-opening angle, and light-opening angle. Each of these orienational coordinates are rescaled by dividing by their respective standard deviations (calculated 154 over the set of experimentally determined antibody structures) and summed to obtain an orientational coordinate 155 distance (OCD) (9). We found that in general deep learning methods produced F V structures with OCD values below 156 four, indicating that the predicted structures are typically within one standard deviation of the native structures for between the structures predicted by each method. For each pair of methods, we measured the RMSD of framework and 161 CDR loop residues, as well as the OCD, between the predicted structures for each benchmark target ( Figure S8 ). We 162 additionally plotted the distribution of structural similarities between IgFold and the alternative methods ( Figure S9 ). 163 We found that the framework structures (and their relative orientations) predicted by IgFold resembled those of 164 DeepAb and AlphaFold-Multimer, but were less similar to those of ABodyBuilder and ABlooper. This is expected, 165 given that ABlooper frameworks are based on ABodyBuilder grafts, while the frameworks from the remaining methods 166 are entirely learned (and tend to be more accurate). Interestingly, we also observed that the CDR1 and CDR2 loops 167 from IgFold, DeepAb, and AlphaFold-Multimer were quite similar on average. It is unclear why CDR loop structures 168 from ABlooper, which was trained on a dataset similar to that of DeepAb and predicts CDR loops in an end-to-end 169 manner like IgFold, tend to be disimilar to those of DeepAb and IgFold. This may be due to framework inaccuracies 170 degrading the quality of CDR loop structures. Although the performance of the deep learning methods for antibody structure prediction is largely comparable, 172 the speed of prediction is not. Grafting-based methods, such as ABodyBuilder, tend to be much faster than deep 173 learning methods (if a suitable template can be found). For the present benchmark, ABodyBuilder was able to 174 predict structures in seconds for 65 of 67 targets, only twice resorting to a time-consuming CDR H3 loop building 175 procedure. However, as reported above, this speed is obtained at the expense of accuracy. DeepAb and ABlooper, 176 which are more accurate and trained specifically for antibodies, require more time to predict full-atom structures (up 177 to one minute for ABlooper and ten minutes for DeepAb). AlphaFold-Multimer, trained for general protein structure 178 prediction from multiple sequence alignments, requires approximately one hour to predict full-atom structures. IgFold 179 prediction speed is comparable to ABlooper, and is able to predict full-atom structures in less than a minute. to those of ABlooper, perhaps reflective of both models training for end-to-end coordinate prediction, but less similar 189 than those of DeepAb. The disimilarity of predictions between IgFold and AlphaFold-Multimer is surprising, given the extensive use target 7AQZ (to be published, Figure 2F ). This nanbody features a 15-residue CDR3 loop that adopts the "stretched- Figure S7G ). In the majority of such cases, AlphaFold predicts the correct 226 conformation, yielding the lower average CDR3 RMSD. However, the distinct conformations from both methods 227 may be useful for producing an ensemble of structures for some applications. In the second example, we compared 228 the structures predicted by both methods for the benchmark target 7AQY (to be published, Figure 2G ). This Determining whether a given structural prediction is reliable is critical for e ective incorporation of antibody structure 239 prediction into workflows. During training, we task IgFold with predicting the deviation of each residue's Catom 240 from the native (under alignment of the beta barrel residues). We then use this predicted deviation as a per-residue 241 error estimate to assess expected accuracy of di erent structural regions. To assess the utility of IgFold's error predictions for identifying inaccurate CDR loops, we compared the average 243 predicted error for each CDR loop to the RMSD between the predicted loop and the native structure for the paired 244 F V and nanobody benchmarks. For five of the six paired F V CDR loops, we observed significant correlations between 245 the predicted error and the loop RMSDs from native ( Figure S10 ). For CDR L2 loops were no significant correlations 246 were observed; however, given the relatively high accuracy of CDR L2 loop predictions, there was little error to detect. For nanobodies, we observed significant correlations between the predicted error and RMSD for all of the CDR loops 248 ( Figure S11 ). For the challenging-to-predict, conformationally diverse CDR3 loops, we observed significant correlations for both Figure 3F ). Antibody engineering campaigns often deviate significantly from the space of natural antibody sequences (45). Predicting structures for such heavily engineered sequences is challenging, particularly for models trained primarily on 263 natural antibody structural data (such as IgFold). To investigate whether IgFold's error estimations can identify likely 264 mistakes in such sequences, we predicted the structure of an anti-HLA (human leukocyte antigen) antibody with a 265 sequence randomized CDR H1 loop (46) ( Figure 3C ). As expected, there is significant error in the predicted CDR H1 266 loop structure. However, the erroneous structure is accompanied by a high error estimate, revealing that the predicted 267 conformation is likely to be incorrect. This suggests that the RMSD predictions from IgFold are well-calibrated to 268 unnatural antibody sequences and should be informative for a broad range of antibody structure predictions. Template data is successfully incorporated into predictions. For many antibody engineering workflows, partial predictions is useful for improving the quality of structure models. We simulated IgFold's behavior in this scenario 273 by predicting structures for the paired antibody and nanobody benchmark targets while providing the coordinates of all non-H3 residues as templates. In general, we found that IgFold was able to incorporate the template data 275 into its predictions, with the average RMSD for all templated CDR loops being significantly reduced (IgFold[Fv-H3]: 276 Figure 4A , IgFold[Fv-CDR3]: Figure 4E ). To illustrate the e ectiveness of structural data incorporation, we identified 277 a paired antibody benchmark target with challenging-to-predict non-H3 CDR loops that were corrected by inclusion 278 of templates. We consider the benchmark target 7AJ6 (to be published), for which IgFold inaccurately predicted the 279 H2 and L1 loops (1.27 Å and 2.01 Å RMSD, respectively). We found that the model correctly inorporates the the 280 template data for both loops ( Figure 4B ), reducing the H2 and L1 loop RMSD to 0.73 Å and 0.70 Å, respectively. Having demonstrated successful incorporation of structural data into predictions using templates, we next investi- for several paired benchmark targets we observe notable improvements in predicted CDR H3 loop quality ( Figure 4C ). In one such case, for benchmark target 7RDL, inclusion of non-H3 structural data reduces the RMSD of the CDR 286 H3 loop from 5.45 Å to 2.86 Å ( Figure 4D ). For nanobodies, we observe fewer cases with substantial improvement 287 to CDR3 loop predictions given template data ( Figure 4F ). In only one case, benchmark target 7CZ0, do we see a 288 meaningful improvement in RMSD (from 2.03 Å to 1.05 Å). For this target, the improvement in CDR3 accuracy is 289 due to correction of C-terminal residues that anchor the end of the loop to the framework ( Figure 4G ). 290 We additionally experimented with providing the entire crystal structure to IgFold as template information. In this 291 scenario, IgFold sucessfully incorporates the structural information of all CDR loops (including H3) into its predictions 292 (IgFold[Fv]: Figure 4A , Figure 4E ). Although this approach is of little practical value for structure prediction (as the 293 correct structure is already known) it may be a useful approach for instilling structural information into pre-trained 294 embeddings, which are valuable for other antibody learning tasks. our analysis of these structures was limited, we are optimistic that this large dataset will be useful for future studies 328 and model development. where W c,v oe R d node ◊d gt-head is a learnable parameter for the value transformation for the c-th attention head. In the 388 above, Î is the concatenation operation over the outputs of the C attention heads. Following the original GT, we use 389 a gated residual connection to combine the updated node embedding with the previous node embedding: where W g oe R 3úd node ◊1 is a learnable parameter that controls the strength of the gating function. Edge updates via triangular multiplicative operations. Inter-residue edge embeddings are updated using the e cient 395 triangular multiplicative operation proposed for AlphaFold (10) . Following AlphaFold, we first calculate updates using 396 the "outgoing" triangle edges, then the "incoming" triangle edges. We calculate the outgoing edge transformations as 397 follows: where W a,v , W b,v oe R d edge ◊2úd edge are learnable parameters for the transformations of the "left" and "right" edges of 402 each triangle, and W a,g , W b,g oe R d edge ◊2úd edge are learnable parameters for their respective gating functions. We where W out c,v oe R 2úd edge ◊d edge and W out c,g oe R d edge ◊d edge are learnable parameters for the value and gating transformations, 410 respectively, for the outgoing triangle update to edge e ij . After applying the outgoing triangle update, we calculate 411 the incoming triangle update similarly as follows: where W in c,v oe R 2úd edge ◊d edge and W in c,g oe R d edge ◊d edge are learnable parameters for the value and gating transformations, 418 respectively, for the incoming triangle update to edge e ij . Note that a ij and b ij are calulated using separate sets of 419 learnable parameters for the outgoing and incoming triangle updates. implementation. Because our objective is to incorporate known structural data into the embedding, we omit the 425 translational and rotational updates used in the AlphaFold structure module. We incorporate partial structure 426 information by masking the attention between residue pairs that do not both have known coordinates. As a result, when no template information is provided, the node embeddings are updated only using the transition layers. Structure realization via invariant point attention. The processed node and edge embeddings are passed to a 429 block of three IPA layers to predict the residue atomic coordinates. Following the structure module of AlphaFold, 430 we adopt a "residue gas" representation, in which each residue is represented by an independent coordinate frame. The coordinate frame for each residue is defined by four atoms (N, C -, C, and C -) placed with ideal bond lengths 432 and angles. We initialize the structure with all residue frames having Cat the origin and task the model with The promise and challenge of high-throughput sequencing of the antibody 484 repertoire Phenotypic determinism and stochasticity in antibody repertoires of clonally expanded plasma cells. bioRxiv Monoclonal 488 antibodies isolated without screening by analyzing the variable-gene repertoire of plasma cells Rosettaantibodydesign (rabd): A general 490 framework for computational antibody design Pyigclassify: a database of antibody cdr structural classifications Second antibody modeling assessment Geometric potentials from deep learning improve prediction of cdr h3 loop structures Abangle: characterising the vh-vl orientation in antibodies Improved prediction of antibody vl-vh orientation Highly accurate 500 protein structure prediction with alphafold Accurate prediction of protein 502 structures and interactions using a three-track neural network Colabfold-making protein folding accessible to all. bioRxiv Antibody structure prediction using interpretable deep learning ABlooper: Fast accurate antibody cdr loop structure prediction with accuracy estimation. bioRxiv Improved antibody structure prediction by deep learning of side chain conformations Biological structure and function emerge from scaling 510 unsupervised learning to 250 million protein sequences Prottrans: towards cracking the 512 language of life's code through self-supervised deep learning and high performance computing Given the comparable aggregate performance of the deep learning methods, we further investigated the similarity