key: cord-344105-9bw9rm6e
authors: Teraguchi, Shunsuke; Saputri, Dianita S.; Anais Llamas-Covarrubias, Mara; Davila, Ana; Diez, Diego; Aybars Nazlica, Sedat; Rozewicki, John; Ismanto, Hendra S.; Wilamowski, Jan; Xie, Jiaqi; Xu, Zichang; de Jesus Loza-Lopez, Martin; van Eerden, Floris J.; Li, Songling; Standley, Daron M.
title: Methods for sequence and structural analysis of B and T cell receptor repertoires
date: 2020-07-17
journal: Comput Struct Biotechnol J
DOI: 10.1016/j.csbj.2020.07.008
sha: 
doc_id: 344105
cord_uid: 9bw9rm6e

B cell receptors (BCRs) and T cell receptors (TCRs) make up an essential network of defense molecules that, collectively, can distinguish self from non-self and facilitate destruction of antigen-bearing cells such as pathogens or tumors. The analysis of BCR and TCR repertoires plays an important role in both basic immunology as well as in biotechnology. Because the repertoires are highly diverse, specialized software methods are needed to extract meaningful information from BCR and TCR sequence data. Here, we review recent developments in bioinformatics tools for analysis of BCR and TCR repertoires, with an emphasis on those that incorporate structural features. After describing the recent sequencing technologies for immune receptor repertoires, we survey structural modeling methods for BCR and TCRs, along with methods for clustering such models. We review downstream analyses, including BCR and TCR epitope prediction, antibody-antigen docking and TCR-peptide-MHC Modeling. We also briefly discuss molecular dynamics in this context.

B cell receptors (BCRs) and T cell receptors (TCRs) are key molecules in adaptive immune response that provide protection to perturbations, both from the outside (e.g. pathogens) and from within (e.g. mutated or misfolded proteins). Together, BCRs and TCRs constitute a unique class of proteins whose coding sequences are arranged combinatorically in a cell-autonomous manner known as V(D)J recombination. In V(D)J recombination within a given cell, variable (V), diversity (D), and joining (J) segments are selected randomly from among many variants, and joined to make the V (variable) region of a full-length receptor. In addition to V(D)J recombination, BCRs can also undergo subsequent somatic hypermutation (SHM) and clonal selection upon antigen encounter, collectively referred to as "affinity maturation". On a cell population level, these processes create a functionally diverse and dynamic set (repertoire) of B and T cells.

The number of possible different BCR or TCR sequence combinations is extremely high, with theoretical estimates in the 10 12 -10 18 range [1] . However, the observed populations of receptor sequences in a given individual follow a power law, where most sequences appear only at very low frequency and a minority of sequences appear at higher frequencies (See for example [2] for a recent discussion).

For both BCRs and TCRs, V regions consist of two polypeptide chains, referred to as "light" (BCRs) or "alpha" (TCRs) and "heavy" (BCRs) or "beta" (TCRs). TCRs are composed of a single set of alpha and beta chains while BCRs contain two sets of light and heavy chains. [1] For simplicity, in this review, we focus on a single pair of (lightheavy or alpha-beta) chains.

Both BCRs and TCRs belong to the immunoglobulin-like fold in which the canonical antigen binding site is composed three loops called "complementarity-determining regions" (CDRs), in each receptor chain. The V(D)J recombination junction, in which random nucleotides may be inserted during the recombination, is located in the third CDR (CDR3). As a result, CDR3 is the most diverse among the three CDRs. [1] Much effort has been spent on CDR3 modeling, in particular for soluble BCRs (antibodies).

BCRs interact directly with antigens, and we refer to interface residues as "paratope" on the BCR side and "epitope" on the antigen side ( Figure 1A) . TCRs, on the other hand, interact with antigen-derived peptide fragments, which are presented by major histocompatibility complex (MHC) proteins ( Figure 1B) . Here, generally "epitope" refers to the antigen-derived peptide and not the MHC contacting residues. TCR contacting residues are shown as sticks. B, TCR-peptide-MHC complex for a viral peptide TAX and class I HLA A-0201 (PDB identifier 1BD2). The epitope is shown as red spheres and contacting MHC residues are shown as green spheres, while paratope residues are shown as sticks.

Each human carries up to six class I MHC molecules and up to eight class II molecules.

There are thousands of MHC variants (alleles) in the human population, which can differ in their peptide specificity [1] . Peptide-MHC binding affinity shapes the TCR repertoire, and the particular set of MHC alleles carried by an individual become a source of TCR repertoire diversity, affecting the susceptibility to particular diseases (reviewed in [3] ).

Since BCR maturation requires a co-stimulation from activated helper T cells [4] , the BCR and TCR repertoires are not completely independent.

Both BCR and TCR sequences can be captured by current sequencing technologies.

Moreover, molecule and cell barcoding technologies are an area of intense research and development. Emerging sequencing and barcoding methods are thus expected to revolutionize our understanding of immune repertoires. As just one example, the number of paired (alpha-beta) TCR sequences for which the peptide-MHC is known has grown by two orders of magnitude in the last two years [5] , indicating a need for computational tools that can keep pace with this growth.

In this review, after briefly reviewing recent technologies for repertoire sequencing, we explore tools for interpreting BCR and TCR sequences in terms of their structures and targeted antigens. In this context, we cover structural modeling, epitope prediction, molecular docking, and molecular dynamics. Integration of such tools, along with growth in sequence and associated experimental data, will allow us to more fully describe the immune status of an individual in health and disease.

Very early approaches to characterize immune repertoires were limited to estimating the length of the CDR3 loops [6] . Current methods, relying on high-throughput sequencing (HTS) technology, can be used for comprehensive quantification of full-length TCR and BCR V region sequences [7, 8] . Though a comprehensive review on the existing technologies for repertoire sequencing analysis is beyond the scope of this review, HTS is the main source of data for subsequent structural analysis. Therefore, we briefly describe the basic information contained in bulk and single-cell RNA-based repertoire sequencing (Fig. 2) . In bulk sequencing, the information of receptor pairs will be lost while higher coverage tends to be achieved. In single cell sequencing, the pairing information is preserved while currently sample preparation and sequencing costs tend to be higher than in bulk sequencing.

Early development of HTS repertoire analysis was based on bulk sequencing (i.e. sequencing many cells without preserving their identities). In this approach, the information of light/heavy or alpha/beta pairs is lost. Thus, bulk sequence analysis tends to focus on a single (typically the heavy/beta) chain.

Repertoire sequencing typically uses TCR/BCR enrichment followed by PCR amplification to increase sensitivity and reduce sequencing cost. Since a 100bp fragment is enough to resolve the CDR3 fragment, short read sequencing is often used. The choice of sequencing technology can have an important impact on quality, since the types and rates of errors can be different. Among preferred platforms are Illumina MiSeq (long reads) and HiSeq (short reads targeting CDR3).

One of the sources of low-quality repertoire data is a biproduct of PCR amplification.

Without other information, we cannot distinguish between true nucleotide sequence differences and PCR errors. As a result, PCR errors cause the appearance of spurious sequences, in particular from dominant, highly abundant sequences/clonotypes. Use of Unique Molecular Identifier (UMI) sequences enables correction of PCR amplification biases and quantification of the number of receptors expressed. Thus, the use of technologies with UMI have a distinct advantage.

To date, several pipelines can be used to extract repertoire information from bulk HTS data. These tools generally map sequencing reads to TCR/BCR reference sequences.

Then, contigs, the continuous sequences assembled from the mapped reads, can subsequently be annotated by V(D)J gene usage and CDR (1,2,)3 amino acid sequences [9, 10]. IMGT/HighV-QUEST (International Immunogenetics Information System V-Query and Standardization) [11, 12] uses pairwise alignment and sequence comparison to experimental data to align sequencing reads. IgBLAST [13] utilizes the BLAST algorithm [14] for its search engine. MiXCR [15] is an efficient pipeline equipped with a fast aligner. It can be used for reconstructing TCR/BCR sequences from generic RNAseq data without PCR amplification of TCRs/BCRs [16] . A detailed assessment on those three tools can be found in [17] . The Immcantation framework [18, 19] and TRUST (TCR repertoire utilities for solid tissue) [20] can be also used for the same purpose among many other available tools not covered here Though single chain information alone is usually not enough to explain the binding of the receptor to the target epitope, there are several methods applicable to bulk sequencing data. For example, diversity analysis of the repertoire sequences can be used for estimating the clonal diversity of an immune repertoire of each individual, as well as repertoire overlap among repertoires of several individuals. This can currently be performed using conventional ecology measures [21] [22] [23] , or repertoire-designed estimators [24] [25] [26] . Also, by analyzing repertoire data from many individuals with additional information like Human Leukocyte Antigen (HLA) allele profiles or disease status, one can associate each TCR with particular labels with the help of statistical hypothesis testing [27, 28] . Repertoire information also carries the information of underlying V(D)J recombination. Thus, from repertoire sequences, generative models of V(D)J recombination were developed; and, in turn, these models were used to analyze repertoire sequence data [29] [30] [31] [32] [33] [34] . We have collected some of (but not all of) tools used for those sequence analysis as in Table 1 . 

There have also been exciting developments in the application of HTS technology for experimental discovery of epitopes. In Libra-seq (Linking B cell receptor to antigen specificity through sequencing) [42], the 10x Genomics platform was used to barcode not only BCR sequences but also antigen proteins. By sorting the antigen-bound B cells and then performing single cell sequencing, antigen specific BCRs can be identified from the antigen barcodes. Similarly, by using barcoded peptide-MHC complexes, HTS allow us to generate a large reference dataset of TCR-epitope pairs [43] . Kula et al. [44] developed T-Scan, a high-throughput method that identifies functional antigen targets of

T cells and subsequent next-generation sequencing enabled T-scan to discover CMV antigens as well as the targets of self-reactive TCRs. Gee MH et al. [45] used yeastdisplay libraries of pMHCs and screened for antigens of orphan T cell receptors on tumorinfiltrating lymphocytes. Kobayashi et al. [46] have developed a cloning and expression system called hTEC10 (human TCR efficient cloning system within 10 days) that can be used to rapidly determine the antigen specificity of TCRs. They applied their system successfully to peptide specificity and cytotoxic activity of TCRs from EBV infection and cancer.

In spite of advances in experiential determination of receptor-antigen interactions, most high-throughput experiments lack residue-level resolution. X-ray crystallography and single-particle electron microscopy (cryo-EM), on the other hand, provide such highresolution information, but are not suitable for high-throughput analysis.

Computational modeling of TCRs and BCRs is now routine and can be performed in a high-throughput manner. Building 3D models of receptors is also the first step in structure-based analysis of receptor antigen interactions. For 3D structural modeling, TCR or BCR V regions are generally divided into "frameworks" and the three CDRs ( Fig.   3) . Each framework is a double layer of beta sheets that contain the beginning and ending of each CDR loop. There are other loops in V regions, but the CDRs are important because of their high sequence diversity and because they form a continuous surface that constitutes the main antigen binding interface. Of the CDRs, CDR3 is the most diverse in terms of both sequence and structure. CDR3 modeling has been tackled by a wide range of approaches [47] . Software for CDR3 modeling ( Table 2 ) spans the range from simple sequence alignment methods [48] , to fragment assembly [49] , molecular dynamics (MD) [50] and robotics-based loop closure algorithms [51] . In the most recent antibody modeling assessment (AMA-II) [52] , the lowest heavy-chain CDR3 (CDRH3) errors were obtained by our own group using a combination of MD, fragment assembly and manual selection [53] . Based on an internal assessment of our AMA-II results, we developed a purely fragment assembly-based tool, Kotai Antibody Builder [54] . We more recently introduced Repertoire Builder, which exceeded Kotai Antibody Builder in terms of accuracy, with a factor of 100 improvement in speed [55]. In the same time frame, several new tools, including ABodyBuilder [56], TCRModel [57] , and PigsPro (Prediction of immunoglobulin structure v2) [58] have been introduced, which show advancement over previously published methods. Because of its high accuracy and ability to scale with the number of input sequences, we will briefly outline the Repertoire Builder approach. In order to improve speed and reduce noise, one aim of Repertoire Builder was to remove 3D structure from the key decision-making steps: sampling and scoring. Working in three dimensions is computationally expensive and also messy, as protein structure files can contain a plethora of sources of noise. As an alternative, we derived feature vectors from pairwise query-template alignments and trained a machine learning model to recognize the good alignments. Feature vectors currently consist of BLOSUM62 matrix elements or gaps for each aligned residue pair and cover the entire V region. The inclusion of residues outside of the CDR region was intended to take the environment of the CDR into account in the choice of template. We note that scoring at the alignment level is not unique to Repertoire Builder; all of the methods do this. What is novel here is the alignmentderived feature vectors. Another trick used by Repertoire Builder was to store templates in the form of structure-aware multiple sequence alignments (MSAs), which can be readily computed using our MAFFT-DASH (Multiple Alignment using Fast Fourier Transform-Database of Aligned Structural Homologs) pipeline and which have been shown to be significantly more accurate than sequence-based MSAs [59] . The query sequence can be added to a stored template MSA efficiently using MAFFT's fragmentadding option, which preserves the relationships between the templates in the stored MSAs [60] . Templates in MSAs are grouped by their CDR lengths. Thus, there is a different template MSA stored for each CDR-length combination. The advantage of using MAFFT-DASH in this manner is primarily a combination of speed and MSA accuracy.

We have not assessed whether use of alternative alignment strategies results in a degradation of model quality. The current Repertoire Builder can model 10 4 paired or unpaired sequences in approximately 30 min, which makes it practically useful for highthroughput sequencing discussed above. To our knowledge, Repertoire Builder is the only server that allows multiple BCR or TCR sequences to be input at one time.

As genomic data continues to grow, methods for clustering nucleotide or amino acid sequences will play major role in sequence and structural analysis. Since generic sequence clustering methods (e.g. [61, 62] ) are beyond the scope of this review, here we focus on methods specific to immune receptors. A common goal when studying immune repertoires is to understand common features of receptors that are shared by a group of donors of interest (Fig. 4) . The implication here is that receptors target the same antigen and epitope will be more common in the donors of interest than in a control group. This is a very general notion that can be applied to either BCRs or TCRs and approached in a variety of ways. Given the broad diversity of immune repertoires, their uneven population distributions, and the relatively low overlap of exact matching sequences among subjects, this task is a significant challenge. To address these issues, several clustering strategies have been developed recently. Below, we review some representative examples, including our own efforts. 

Based on the observation that there are specific positions in TCR CDR3 regions that contact antigen peptides and that the presence of particular sequence motifs can define TCR clusters, Glanville et al., developed the GLIPH (grouping of lymphocyte interactions by paratope hotspots) algorithm [63, 64] . This algorithm clusters TCRs based on local sequence motifs, as well as on other parameters such as global CDR3 similarity, V gene usage, CDR3 length, MHC profile of donor(s) and clone size. GLIPH identifies motifs that are enriched in a given dataset relative to a control group, with the goal of producing groups of TCRs targeting the same peptide-MHC (pMHC). By using this approach, the authors were able to design synthetic antigen-specific TCRs to groups, and confirm their specificity experimentally.

In a similar study, Dash et al. [65] developed TCRdist; a tool that estimates the similarity of two TCR sequences by computing a weighted Hamming distance among the concatenated amino acid sequences of the CDR loops of each TCR. TCRdist assumes a higher weight (3x) for the CDR3 regions. Clusters of highly similar antigen-specific TCRs can be built, and new TCRs of unknown specificity can be assigned to an antigenspecific cluster based on similarity, allowing for the prediction of antigen specificity.

Additionally, a diversity score (TCRdiv) that robustly calculates the diversity of epitopespecific repertoires by considering both TCR similarity and exact identity in a generalized Simpson's diversity index, was developed. TCRdist has recently been used to identify clonal expansion of M. tuberculosis specific TCRs in a South African cohort where it was able to accurately classify active tuberculosis patients [66] .

Though they share the same goal, the focus of those two tools are slightly different. The GLIPH algorithm assumes that the input data is enriched in TCRs targeting a restricted set of epitopes, and tries to cluster these enriched TCRs using common motifs in the dataset. With this approach, they are also able to avoid direct comparison of all pairs of sequences, which is computationally expensive. Thus, GLIPH is suitable for large repertoire analyses of particular disease cohorts. On the other hand, TCRdist is based on direct comparison of each TCRs using a "universal" measure of TCR similarity, and it is thus currently difficult to apply the method to datasets greater than approximately 10 4 .

However, an advantage of TCRdist is that the calculated distance between a pair of TCRs are always the same, regardless of other factors. Such "universal" definition of TCR similarity/difference is of use when assumptions about shared antigen/epitope cannot be made.

Structural studies of antibodies targeting antigens specific to HIV [67] , influenza [68] and more recently SARS-CoV-2 [69] have demonstrated that antibodies produced in unrelated donors targeting common antigens and epitopes can share sequence and structural features. We note here that, since B cells can undergo affinity-driven maturation, such receptors need not derive from a similar common clone. Recently, the SAAB+ tool was developed to characterize structural properties of CDRs from differentiated B cells [70] . It is likely that more tools trained to identify "convergence" of functionally related antibodies will appear in the future as more sequence data from donors with shared BCR epitopes become available.

To this end, we recently developed InterClone, a method to cluster BCR sequences which are likely to share epitopes [71] . InterClone is based on a comparison of sequence and structural features of pairs of BCRs using a machine learning-based classifier that was trained on known antigen-BCR structures. Like TCRdist, InterClone assigns a "universal" similarity score to each BCR pair. Hierarchical clustering is then used to group sequences of high similarity. As such, InterClone can be used without requiring sequences to be enriched in a particular BCR motif. A sensitivity of 61.9% and specificity of 99.7% were obtained when InterClone was applied to an independent set of anti-HIV antibody sequences [71] . A more robust and computationally efficient version of InterClone that works for both BCRs and TCRs and can perform high-throughput analysis of up to 10 5 sequences is currently being developed.

In addition to the above clustering methods, networks that describe antibody repertoire architecture can be used to compare repertoires. Miho and colleagues [72] developed a platform that builds similarity networks of hundreds of thousands of antibody sequences from both humans and mice. Using this approach, the authors detected global patterns in antibody repertoire architectures that were highly reproducible in different subjects, and tended to converge despite independent VDJ recombination. Furthermore, these repertoire architectures were robust to clonal deletion of private clones.

TCRs recognize short peptides presented on class I or II MHC complexes. The ability to predict epitope(s) from TCR sequence and MHC allele would be highly valuable in elucidating disease etiology, monitoring the immune system, developing diagnostic assays and designing vaccines. Traditionally, identifying epitopes is carried out experimentally [73], and is both costly and time-consuming. There is necessarily great interest in methods that can accelerate this process computationally.

To this end, Fischer et al. [74] developed a deep learning approach on TCR CDR3 regions to predict the antigen-specificity of single T cells. Jokinen et al., [75] developed TCRGP to predict whether TCRs recognize certain epitopes using a novel Gaussian process (GP).

Their method uses CDR sequences from TCR alpha and beta and learns which CDR recognizes different epitopes. The tool was applied to identify T cells specific to HBV.

NetTCR by Jurtz VI et al. [43] utilized convolutional networks for sequence-based prediction of TCR-pMHC specificity. NetTCR uses the recent explosion of nextgeneration sequencing data to train a sequence based-predictor. Ogishi et al. [76] computationally defined immunogenicity scores through sequence-level simulation of interaction between pMHC complexes and public TCR repertoires. Though their focus is more on immunogenicity of peptides presented to MHC molecules, they also observed correlation between individual TCR-pMHC affinities and the features important for immunogenicity score. Gielis et al. [77] applied random forest-based classifiers for epitope specific TCRs to repertoire level analysis. Their models successfully detected the increase of epitope specific TCRs upon vaccination in two Yellow Fever vaccination studies. The works by Chain and co-workers [78, 79] also addressed related questions. In [78] , the authors have constructed a classifier to distinguish the TCR beta sequences in expanded repertoires of ovalbumin-stimulated mice from control. Their classifier was based on the frequencies of amino acid triplets in CDR3 and their choice of machine learning algorithm called LPBoost (linear programming boosting) allowed them to identify the responsible motifs in CDR3.

Unlike BCRs, which can be expressed as soluble antibodies, TCRs remain attached to the cell surface. This, along with their weaker binding affinities to pMHC complexes, has made experimental structural analysis more difficult than for BCRs. Nevertheless, from the known crystal structures of TCR-pMHC complexes, we can see that the range of docking modes is highly restricted, as expected by the similarity of MHCs within a given class (Fig. 5) . As a result of this restriction, we and others [80] have approached the problem using structural templates for TCR-pMHC docking. There are currently few methods for modeling TCR-pMHC complexes. To our knowledge, there are two public servers for this purpose: our own ImmuneScape [81] and the Lymphocyte Receptor Automated Modeling or LYRA-based [82] TCRpMHCmodels [83] . Both of these approaches are "template-based" in the sense that existing structures instead of stochastic conformational sampling are used as templates for each of the key modeling steps: TCR, pMHC and TCR-pMHC orientation. They are also both "bottomup" in the sense that models for TCR and pMHC are built and then combined to form the TCR-pMHC complex. One possible conceptual difference is that, in ImmuneScape, CDRs are modeled after the TCR and pMHC templates are combined in order to take the pMHC into account. It will be interesting to compare the two approaches in more detail.

TCRpMHCmodels compared favorably to an earlier rigid docking-based approach, TCRFlexDock, which suggests that care must be taken in sampling TCR-pMHC orientations beyond that which is observed in typical crystal structures.

Several computational methods are available to predict BCRs epitopes and paratopes. Of the two problems, paratope prediction is much easier, as paratopes tend to correspond to CDR residues, while epitopes can be anywhere on an antigen. This is illustrated in the case of anti-influenza hemagglutinin (HA) antibodies (Fig. 6) ; a superimposition of all known anti-HA antibodies leaves very little un-targeted surface area.

Paratope prediction methods include the Paratome algorithm [84] , which is based on structural consensus between BCRs and uses features from sequence or structure; Prediction of Antibody Contacts or ProABC [85] , which applies a random forest learning technique and is based on sequence; Parapred [86] , which uses a deep learning architecture to extract patterns from variable regions in sequence;

AntibodyInterfacePrediction [87] , which uses a support vector machine method (SVM); AG-Fast-Parapred [88], which is based on deep neural networks, and utilizes antigen sequence information to predict paratope. AG-Fast-Parapred reported improved accuracy over existing methods; however, at the time of this writing, the tool is not available to the public. With regard to epitope prediction there are many tools available. Antibody i-Patch [89] algorithm introduces a likelihood score for residue contact as constraints on the local docking to generate the paratope prediction, and thus requires the structure of antigenantibody complex. Additionally, epitope predictors have evolved to be specific to cognate antibody. Previously, methods were built to predict linear epitopes which are contiguous polypeptide chains, an example of which is LBtope (Linear B-Cell epitope prediction server) [90] whose algorithm includes experimentally verified B-cell epitopes to discriminate from background by using SVM. However, the majority of epitopes are noncontinuous surface residues characterized by structure as well as sequence. Several methods are available to treat such conformational epitopes. SEPIa [91] uses a combination of two classifiers (naive Bayesian and random forest) from antigen sequence.

BepiPred-2.0 [92] uses the random forest algorithm to predict epitopes from primary sequence only. A recent method based on subgraph clustering for the prediction of separated and overlapping epitope, Glep [93] , achieved an f-score of 0.579 for singleepitopes.

Recently, there has been a realization that epitope prediction without reference to a particular antibody is an ill-formed problem, and methods of "antibody-specific epitope prediction" have been introduced [94] . There are currently few options for antibodyspecific epitope prediction. The PEASE (Predicting Epitopes using Antibody Sequence) [95] method applies machine learning to predict true contacts of antibody-antigen residue pairs, providing candidates of epitope patches. EpiPred [96] identifies the epitope region by rescoring antibody-antigen global docking with its algorithm based on geometric matching of antigen-antibody interfaces and asymmetric potentials. MAbTope [97] predicts epitope residues based on consensus epitopes shared by top-ranked poses; the success of this approach depends on the quality of the docking. Although there is a clear awareness of the importance of the antibody information in epitope prediction, the traditional antigen-centric methods cannot easily be extended to include such information. This is partially because of the increase in the number of degrees of freedom when antibody-antigen interactions are considered.

The most direct means of tackling antibody-antigen interactions is through protein docking, a technique that requires structure information of antibody and antigen. This introduces 6 additional degrees of freedom for rigid docking and a host of other issues due to the complexity and inherent uncertainty of protein structural information.

Nevertheless, protein docking is a mature field and steady progress has been made in this area. Generally speaking, docking methods can be classified into four categories: Fast Fourier transform (FFT) correlation; Monte-Carlo (MC) simulated annealing; Geometric hashing; and flexible docking [98] . In Table 3 , we give a representative list of molecular docking tools or web servers that can be applied to antibody-antigen docking. Of these, Cluspro [99] , PatchDock [100] , FRODOCK [101] and SnugDock [102] provide [103] and another three representative tools (ClusPro, LightDock [104] and ZDOCK [105] ) to systematically analyze 16 antibody-antigen complexes from the well-studied ZDOCK protein-protein interaction benchmark (version 5.0) [106] . The results were evaluated using criteria established by the Critical Assessment of PRedicted Interactions (CAPRI) community where models are classified into the four categories: Incorrect, Acceptable, Medium, or High quality [107] . It was demonstrated that information-driven docking, even using noisy predictions of epitope and paratope, could significantly improve performance over all four algorithms [108] . Notably, HADDOCK was capable of providing high quality models for all 16 entries based on CAPRI criteria in this test.

However, this study did not evaluate the tolerance of the docking methods to typical BCR modeling errors.

As with all protein docking from homology models, the success of docking antibody models depends heavily on the quality of the starting structures [109] . Structural uncertainties in the binding regions can occur either from flexibility or modeling errors.

Moreover, the regions of greatest uncertainty tend to be the CDRs (especially CDRH3), which is highly likely to form part of the paratope [110] . These issues can be addressed to some extent by use of epitope and paratope predictions. However, few antibody docking methods have been rigorously tested using a large benchmark of realistic models.

The bottom line is that structure-based prediction of antibody-antigen interactions from sequence involves a number of interrelated tasks: receptor and antigen model building, initial epitope and paratope prediction, docking, scoring and refinement. The combination of so many critical steps results in complexity, both in terms of software integration and in parameter optimization. Fortunately, the emergence of larger and better BCR sequence datasets will be a motivation to develop well-integrated structure prediction pipelines.

In this review, we have focused primarily on high-throughput structure-based methods that can be applied to BCR or TCR repertoires. As is clear from the previous section, combining software methods that work well in isolation introduces complexity. Such complexity arises from conceptual considerations (e.g. parameter optimization) and technical issues (code interoperability). In this regard, MD is conceptually simple: it applies Newtonian mechanics to molecular systems. The force fields describing the interatomic interactions can be taken as given and generally do not have to be optimized. Therefore, even though MD is not a high-throughput method, it can be used to independently confirm BCR-or TCR specific calculations.

As with all proteins, the dynamics of BCRs and TCRs is intimately tied to their functions. Most studies focusing on the T cell receptor only study the dynamics of T cell receptors when bound to a pMHC. In contrast, Dominguez and Knapp compared the dynamics of T cell receptors bound to pMHC and free T cell receptors. In their study they found, apart from expected results as an increased flexibility and increased solvent accessible surface of the CDRs in the free T cell receptor, also differences in the hydrogen bond network of the CDR3α chain in the free TCR versus the pMHC bound TCR [113] . A study combining steered molecular dynamics and single-molecule biophysical experiments [114] studied the formation of catch bonds between the pMHC and the TCR. Catch bonds are a special type of bond in which the lifetime increases when more force is applied. This study suggests that catch bond formation is influenced by conformational changes in the pMHC.

A downside of molecular dynamics simulations are the high computational requirements. Fodor et al. were able to distill conformational data from pMHC class I x-ray structures using ensemble refinement, which is a refinement technique to obtain dynamic data without the need of more computationally intensive molecular dynamics simulations [115] . Another way to reduce the computational requirements is by using coarse grained simulations, in which atoms are grouped together into beads. Coarse graining allows for the study of much larger systems on longer time scales. Friess et al. modeled the transmembrane domains of the immunoglobulin M (IgM) B cell receptor, which have been unresolved so far, and subsequently used coarse grained simulations to study their aggregation behavior and association with lipid rafts [116] .

Recent advances in sequencing technology enable the study of immune responses in unprecedented breadth and depth. As discussed above, the emerging data has spawned the development of a wide range of modeling methods that are applicable to B cells, T cells or both. Current challenges include the integration of data and methodologies. For example, sequence and structural information can, in principle, be combined to yield more accurate descriptions of receptors sharing antigen and epitope specificity. Structural modeling is still not in the mainstream of repertoire analysis; nevertheless, 3D modeling methods present a straightforward direction to encompass "shared features" of functionally related receptors in different donors.

In the context of repertoire analysis, we are often interested in the target antigens and epitopes; however, the scale of publicly available data on targeted antigens and epitopes is currently smaller than that of BCR/TCR sequences, and vastly smaller the actual BCRantigen or TCR-peptide-MHC interactome. As barcoding methods evolve to include antigens themselves [42] , there may soon be new and valuable data available to train methods for functional classification of BCRs and TCRs.

At the point where we are asking not only what is targeted but also why or why not, the use of structural modeling is likely to play a critical role in our understanding of BCR and TCR molecular recognition. As a case in point, at the time of this writing, we are in the midst of the COVID-19 pandemic. This is an example where the target antigens, along with their structures, are largely known, and understanding host immune responses to these antigens is of vital importance in the development of diagnostics, biomarkers, vaccines and therapeutics [117] . Structural similarity among neutralizing antibodies targeting SARS-CoV-2 [69] or between SARS-CoV-1 and SARS-CoV-2 [118] have been noted. With such high stakes driving research and development, integration of emerging technologies in the repertoire analysis domain, including structural analysis, is expected.

As the saying goes, "necessity is the mother of invention," and the need for understanding human immune repertoires has never been greater.

We would like to thank all members of the Systems Immunology Lab for helpful 

Janeway's immunobiology

How many different clonotypes do immune repertoires contain? Current Opinion in Systems Biology

Structural determinants of T-cell receptor bias in immunity

Cytokine-secreting follicular T cells shape the antibody repertoire

VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium

Statistical analysis of CDR3 length distributions for the assessment of T and B cell repertoire biases

Characterizing immune repertoires by high 12

IMGT/HighV QUEST paradigm for T cell receptor IMGT clonotype diversity and next generation repertoire immunoprofiling

IgBLAST: an immunoglobulin variable domain sequence analysis tool

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

MiXCR: software for comprehensive adaptive immunity profiling

Antigen receptor repertoire profiling from RNA-seq data

Benchmarking immunoinformatic tools for the analysis of antibody repertoire sequences

pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires

Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data

Ultrasensitive detection of TCR hypervariable-region sequences in solid-tissue RNA-seq data

VDJtools: Unifying Post-analysis of T Cell Receptor Repertoires

Vidjil: A Web Platform for Analysis of High-Throughput

High-Throughput Mapping of B Cell Receptor Sequences to Antigen Specificity

NetTCR: sequence-based prediction of TCR binding to peptide-MHC complexes using convolutional neural networks. bioRxiv

T-Scan: A Genome-wide Method for the Systematic Discovery of T Cell Epitopes

A new cloning and expression system yields and validates TCRs from blood lymphocytes of patients with cancer within 10 days

Antibody H3 Structure Prediction

PIGS: automatic prediction of antibody structures

RosettaAntibody: antibody variable region homology modeling server

Revisiting antibody modeling assessment for CDR-H3 loop

Sub-angstrom accuracy in protein loop reconstruction by robotics-inspired conformational sampling

Second antibody modeling assessment (AMA-II)

High-resolution modeling of antibody structures by a combination of bioinformatics, expert knowledge, and molecular simulations

Kotai Antibody Builder: automated high-resolution

TCRmodel: high resolution modeling of T cell receptors from sequence

PIGSPro: prediction of immunoGlobulin structures v2

MAFFT-DASH: integrated protein sequence and structural alignment

Adding unaligned sequences into an existing alignment using MAFFT and LAST

Search and clustering orders of magnitude faster than BLAST

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Identifying specificity groups in the T cell receptor repertoire

Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening

Quantifiable predictive features define epitope-specific T cell receptor repertoires

A Diverse Lipid Antigen-Specific TCR Repertoire Is Clonally Expanded during Active Tuberculosis

Sequence and structural convergence of broad and potent HIV antibodies that mimic CD4 binding

Vaccine-Induced Antibodies that Neutralize Group 1 and Group 2 Influenza A Viruses

Convergent antibody responses to SARS-CoV-2 in convalescent individuals

Structural diversity of B-cell receptor repertoires along the B-cell differentiation axis in humans and mice

Functional clustering of B cell receptors using sequence and structural features

Large-scale network analysis reveals the sequence space architecture of antibody repertoires

T cell antigen discovery

Predicting antigen-specificity of single T-cells based on TCR CDR3 regions. bioRxiv

TCRGP: Determining epitope specificity of T cell receptors. bioRxiv

Quantitative Prediction of the Landscape of T Cell Epitope Immunogenicity in Sequence Space

Specificity, Privacy, and Degeneracy in the CD4 T Cell Receptor Repertoire Following Immunization. Front Immunol

Tracking global changes induced in the CD4 T-cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence

T-Cell Receptor Cognate Target Prediction Based on Paired alpha and beta Chain Sequence and Structural CDR Loop Similarities

Structural modeling of lymphocyte receptors and their antigens

LYRA, a webserver for lymphocyte receptor structural modeling

TCRpMHCmodels: Structural modelling of TCR-pMHC class I complexes. Sci Rep

Paratome: an online tool for systematic identification of antigen-binding regions in antibodies based on sequence or structure

Prediction of site-specific interactions in antibody-antigen complexes: the proABC method and server

Parapred: antibody paratope prediction using convolutional

Antibody i-Patch prediction of the antibody binding site improves rigid local antibody-antigen docking

Improved method for linear B-cell epitope prediction using antigen's primary sequence

SEPIa, a knowledge-driven algorithm for predicting conformational B-cell epitopes from the amino acid sequence

BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes

Novel overlapping subgraph clustering for the detection of antigen epitopes

PEASE: predicting B-cell epitopes utilizing antibody sequence

Improving B-cell epitope prediction and its application to global antibody-antigen docking

MAbTope: A Method for Improved Epitope Mapping

Protein-protein docking tested in blind predictions: the CAPRI experiment

The ClusPro web server for protein-protein docking

PatchDock and SymmDock: servers for rigid and symmetric docking

FRODOCK 2.0: fast protein-protein docking server

SnugDock: paratope structural optimization during antibody-antigen docking compensates for errors in antibody homology models

The HADDOCK2.2 Web Server: User-Friendly Integrative Modeling of Biomolecular Complexes

LightDock goes informationdriven

ZDOCK server: interactive docking prediction of proteinprotein complexes and symmetric multimers

Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2

Modeling protein-protein and protein-peptide complexes: CAPRI 6th edition

Modeling Antibody-Antigen Complexes by Information-Driven Docking

Modeling complexes of modeled proteins

Computational approaches to therapeutic antibody design: established methods and emerging trends

Large scale characterization of the LC13 TCR and HLA-B8 structural landscape in reaction to 172 altered peptide ligands: a molecular dynamics simulation study

Epitope flexibility and dynamic footprint revealed by molecular dynamics of a pMHC-TCR complex

How peptide/MHC presence affects the dynamics of the LC13 T-cell receptor

Mechano-regulation of Peptide-MHC Class I Conformations Determines TCR Antigen Recognition

Previously Hidden Dynamics at the TCR-Peptide-MHC Interface Revealed

Structural Model of the mIgM B-Cell Receptor Transmembrane Domain From Self-Association Molecular Dynamics Simulations

The trinity of COVID-19: immunity, inflammation and intervention

Potent neutralizing antibodies against SARS-CoV-2 identified by high-throughput single-cell sequencing of convalescent patients' B cells

ASAP -A Webserver for Immunoglobulin-Sequencing Analysis Pipeline

Antigen Receptor Galaxy: A User-Friendly, Web-Based Tool for Analysis and Visualization of T and B Cell Receptor Repertoire Data

bcRep: R Package for Comprehensive Analysis of B Cell Receptor Repertoire Data

sumrep: A Summary Statistic Framework for Immune Receptor Repertoire Comparison and Model Validation

Modeling and docking of antibody structures with Rosetta

FireDock: a web server for fast interaction refinement in molecular docking

SwarmDock: a server for flexible protein-protein docking

pyDockWEB: a web server for rigid-body protein-protein docking using electrostatics and desolvation scoring

HDOCK: a web server for protein-protein and protein-DNA/RNA docking based on a hybrid strategy

HexServer: an FFT-based protein docking server powered by graphics processors

A web interface for easy flexible protein-protein docking with ATTRACT

GRAMM-X public web server for proteinprotein docking

Sedat Aybars Nazlica, Jiaqi Xie and Martin de Jesus Loza Lopez contributed to the experimental epitope determination sections. John Rozewicki, Ana Davila and Jan Wilamowski wrote most of the in-house software sections. Floris J. van Eerden wrote the MD section. Zichang Xu wrote the BCR docking section. Daron M. Standley wrote the overall manuscript and coordinated the efforts of the other members

Standley and Songling Li are shareholders in KOTAI Biotechnologies, Inc. and are co-applicants on US Patent App. 16/333,875. All authors declare no other conflicts of interest