key: cord-0981694-hxisj7fv authors: Wang, Yiquan; Yuan, Meng; Lv, Huibin; Peng, Jian; Wilson, Ian A.; Wu, Nicholas C. title: A large-scale systematic survey reveals recurring molecular features of public antibody responses to SARS-CoV-2 date: 2022-03-25 journal: Immunity DOI: 10.1016/j.immuni.2022.03.019 sha: 52f91a48c2a3a05677a93a526678a8635d036a37 doc_id: 981694 cord_uid: hxisj7fv Global research to combat the COVID-19 pandemic has led to the isolation and characterization of thousands of human antibodies to the SARS-CoV-2 spike protein, providing an unprecedented opportunity to study the antibody response to a single antigen. Using information derived from 88 research publications and 13 patents, we assembled a dataset of ∼8,000 human antibodies to the SARS-CoV-2 spike protein from >200 donors. By analyzing immunoglobulin V and D gene usages, complementarity-determining region H3 sequences, and somatic hypermutations, we demonstrated that the common (public) responses to different domains of the spike protein were quite different. We further used these sequences to train a deep learning model to accurately distinguish between human antibodies to SARS-CoV-2 spike protein and to influenza hemagglutinin protein. Overall, this study provides an informative resource for antibody research and enhances our molecular understanding of public antibody responses. From the beginning of the COVID-19 pandemic, many research groups worldwide turned their 39 attention to SARS-CoV-2 and, in particular, to the immune response to infection and vaccination. 40 Since 2020, thousands of human monoclonal antibodies to SARS-CoV-2 have been isolated and 41 characterized (Li et al., 2022a; Raybould et al., 2021) . The major surface antigen to which 42 antibodies are elicited is the SARS-CoV-2 spike (S) protein, which is a homotrimeric glycoprotein 43 Moreover, our recent study has shown that a public antibody response to influenza hemagglutinin 65 is driven by an IGHD gene with minimal dependence on the IGHV gene (Wu et al., 2018) . 66 Therefore, the true extent and molecular characterization of public antibody responses remain to 67 be explored. 68 69 Although information of many human clonal antibodies to SARS-CoV-2 is now publicly available, 70 it has been difficult to leverage all available information to investigate public antibody responses 71 to SARS-CoV-2. One major challenge is that the data from different studies are rarely in the same 72 format. This inconsistency imposes a huge barrier to data mining. The establishment of the 73 coronavirus antibody database (CoV-AbDab) has enabled researchers to deposit their antibody 74 data in a standardized format and has partially resolved the data formatting issue (Raybould et 75 al., 2021) . However, not every SARS-CoV-2 antibody study has deposited their data to AbDab. Furthermore, IGHD gene identities, nucleotide sequences, and donor IDs are not 77 available in CoV-AbDab, which makes it challenging to study public antibody responses using 78 CoV-AbDab. Thus, additional efforts must be made to fully synergize the information across many 79 different SARS-CoV-2 antibody studies to investigate and decipher public antibody responses. 80 In this study, we performed a systematic literature survey and assembled a large dataset of 82 human SARS-CoV-2 monoclonal antibodies with donor information. We then analyzed this 83 dataset and uncovered many antibody sequence features that contribute to the public antibody 84 responses to SARS-CoV-2 S. For example, we identified a public antibody response to RBD that 85 is largely independent of the IGHV gene, as well as involvement of a particular IGHD gene in a 86 public antibody response to S2. Our analysis also revealed a number of recurring somatic 87 hypermutations (SHMs) in different public clonotypes. All of these sequence features provide a 88 foundation for using deep learning to identify SARS-CoV-2 S antibodies. 89 J o u r n a l P r e -p r o o f the ridge region of SARS-CoV-2 RBD (Figure 5A) , and are able to potently neutralize multiple 220 variants of concern (VOCs) (Li et al., 2021b; Schmitz et al., 2021; Wang et al., 2021) , including 221 Omicron (Zhou et al., 2022a) . Furthermore, therapeutic antibody tixagevimab is derived from a 222 member of this IGHV1-58/IGKV3-20 public clonotype, namely COV2-2196 (Dong et al., 2021) . 223 Here, we compared two previously determined structures of IGHV1-58/IGKV3-20 antibodies in 224 complex with RBD (Dejnirattisai et al., 2021; Wheatley et al., 2021) . One has the germline-225 encoded VL S29 ( Figure 5B ) and the other carries a somatically mutated VL R29 ( Figure 5C) . 226 While neither VL S29 nor VL R29 directly interact with RBD, VL R29 is able to form a cation-π 227 interaction with VL Y32, which in turn forms a T-shaped π-π stacking with RBD-F486 and H-bonds 228 with RBD-C480 ( Figure 5C ). The positioning of VL R29 can further be stabilized by a salt bridge 229 with another SHM VL G92D ( Figure 5C ). The RBD binding affinity of COVOX-253, which is an 230 IGHV1-58/IGKV3-20-encoded antibody, was improved >3-fold by the VL S29R/G92D double 231 mutant, but only subtly enhanced or diminished by VL S29R and VL G92D, respectively ( Figure 232 5D), indicating a synergistic effect between VL S29R and VL G92D. In fact, VL G92D seemed to 233 have coevolved with VL S29R, since VL G92D was found in four out of the 67 antibodies and all 234 four that carried VL S29R ( Figure 5E ). Moreover, a phylogenetic analysis showed that VL G92D 235 emerged from a cluster of antibodies with VL S29R (Figure 5E ). These analyses illustrate that 236 recurring SHMs are associated with the public antibody response to SARS-CoV-2 S and further 237 suggest the existence of common affinity maturation pathways that involve emergence of multiple 238 SHMs in a defined order. 239 240 Since many sequence features of public antibody responses to the S protein could be observed 242 in our dataset, we postulated that the dataset was sufficiently large to train a deep learning model 243 to identify S antibodies. To provide a proof-of-concept, we trained a deep learning model to 244 distinguish between human antibodies to S and to influenza hemagglutinin (HA). Among different 245 antigens, HA was chosen here because there were a large number of HA antibodies with 246 published sequences, albeit still lower than the published SARS-CoV-2 S antibodies. Here, 1,356 247 unique human antibodies to HA and 3,000 unique human antibodies to SARS-CoV-2 S with 248 complete information for all six CDR sequences were used ( Table S3) . None of these antibodies 249 had identical sequences in all six CDRs. These antibodies to S and HA were divided into a training 250 set (64%), a validation set (16%), and a test set (20%), with no overlap between the three sets. 251 The overlap of clonotypes was also minimal ( Figure S6A) . Subsequently, the training set was 252 used to train the deep learning model. The validation set was used to evaluate the model 253 performance during training. The test set was used to evaluate the performance of the final model. We further tested if a deep learning model could be trained to distinguish antibodies to different 268 domains of S, namely RBD, NTD, and S2. Since the numbers of NTD and S2 antibodies were 269 small, the model was trained by the heavy chain CDRs (H1, H2, and H3), so that antibodies 270 without sequence information for the light chain could also be used ( Table S3 ). The ROC AUC 271 and PR AUC of the RBD/NTD/S2 model were 0.79 and 0.62, respectively ( Figure S6B) , which 272 were much worse than the S/HA model above. The poorer performance of the RBD/NTD/S2 273 model may be attributable to the smaller dataset. Since most known antibodies to SARS-CoV-2 274 S were RBD-specific, we also examined if a deep learning model that was trained to distinguish 275 RBD and HA antibodies could achieve a better performance than the S/HA model above. infection, in which 44 could cross-react with the ancestral Hu-1 strain and 37 were Beta-specific 282 (Reincke et al., 2022) . While these 81 antibodies were not included in the dataset that we 283 assembled (Table S1), they provided an opportunity to further evaluate the performance of our 284 deep learning model. Our deep learning model that was trained on all six CDRs to distinguish 285 between antibodies to S and HA (see above) successfully predicted that 77 of the 81 (95%) 286 antibodies as SARS-CoV-2 S antibodies ( Figure 6C and Table S5 ). Of note, since our model 287 was designed to distinguish between antibodies to SARS-CoV-2 S and influenza HA, the 288 prediction on non-RBD/non-HA antibodies was expected to be close to random. Consistent with 289 that expectation, when we applied our model to 691 HIV antibodies from GenBank (Table S6) , 290 46% were predicted to spike antibodies and 54% were predicted to HA antibodies ( Figure S6C) . 291 As different antigenic variants of SARS-CoV-2 emerge and individuals start to accumulate unique 292 SARS-CoV-2 immune histories, the antibody response to SARS-CoV-2 is likely to evolve and 293 diversify. Although our model still performs well on antibodies that were elicited by the Beta variant 294 ( Figure 6C) , it remains to be explored if this performance will hold for antibodies that are elicited 295 by SARS-CoV-2 variants that are more antigenically distinct from the ancestral Hu-1 strain 296 originally identified in Wuhan. 297 J o u r n a l P r e -p r o o f DISCUSSION 299 Through a systematic survey of published information on SARS-CoV-2 antibodies, we identified 300 many molecular features of public antibody responses to SARS-CoV-2. The large amount of 301 published information has allowed us to explore distinct patterns of germline gene usages in 302 antibodies that target different domains on the S protein (i.e. RBD, NTD, and S2). Notably, the 303 types and nature of public antibody responses to different domains appear to be quite different. 304 For example, convergence of CDR H3 sequences can be readily identified in the public antibody 305 responses to RBD and S2. In contrast, the public antibody response to NTD seems to be largely 306 independent of the CDR H3 sequence. Furthermore, an IGHD-dependent public antibody 307 response was enriched against S2, but not RBD or NTD. Together, our study demonstrates the 308 diversity of sequence features that can constitute a public antibody response against a single 309 antigen. See also Figure S1 and Tables S1-2. Figure S5 and Table S1 . antibodies. See also Figure S6 and Tables S3-6. 490 Further information and requests for resources and reagents should be directed to and will be 494 fulfilled by the lead contact, by the Lead Contact, Nicholas C. Wu Wec et al., 2020). Since it is unclear If they are the same SARS-CoV survivor, the same donor ID 524 "VRC_SARS1" was assigned to them to avoid overestimation of public antibody response. the 525 neutralization activity of a given antibody was only measured at a single concentration, 50% 526 neutralization activity or below was classified as non-neutralizing. We also downloaded the CoV-527 AbDab (Raybould et al., 2021) in September 2021 to fill in any additional information. As of 528 September 2021, there were 2,582 human SARS-CoV-2 antibodies in CoV-AbDab. Information 529 in the finalized dataset was manually inspected by three different individuals. For antibodies that 530 were shown to bind to S1 but not RBD, they were classified as NTD antibodies. Due to having 531 identical nucleotide sequences, IGKV1D-39*01 was classified as IGKV1-39*01, IGHV1-68D*02 532 as IGHV1-68*02, IGHV1-69D*01 as IGHV1-69*19, IGHV3-23D*01 as IGHV3-23*01, and IGHV3-533 29*01 as IGHV3-30-42*01. Putative germline genes for each antibody sequence in these repertoire sequencing datasets 542 from healthy donors were identified by were identified by IgBLAST (Ye et al., 2013) . 543 544 CDR H3 clustering analysis 545 Using a deterministic clustering approach, antibodies with CDR H3 sequences that had the same 546 length and at least 80% amino-acid sequence identity were assigned to the same cluster. As a 547 result, CDR H3 of every antibody in a cluster would have >20% difference in amino-acid sequence 548 identity with that of every antibody in another cluster. A cluster would be discarded if all of its 549 antibody members were from the same donor. The number of antibodies within a cluster was 550 defined as the cluster size. Sequence logos were generated by Logomaker in Python (Tareen 551 and Kinney, 2020). For each cluster, epitope assignment was performed using the following 552 scoring scheme. Briefly, there were three scoring categories, namely "RBD", "NTD", and "S2". 553 • 1 point was added to category "RBD" for each antibody with an epitope label equals to 554 "S:RBD" or "S:S1". 555 • 1 point was added to category "NTD" for each antibody with an epitope label equals to 556 "S:NTD", "S:S1", "S:non-RBD", or "S:S1 non-RBD". 557 • 1 point was added to category "S2" for each antibody with an epitope label equals to 558 "S:S2", " S:S2 Stem Helix", "S:non-RBD". 559 The category with >50% of the total points would be classified as the epitope for a given cluster. 560 If no category had >50% of the total points, the epitope for the cluster would be classified as 561 "unknown". 562 563 In this analysis, a public clonotype was classified as antibodies from at least two donors that had 565 the same IGHV/IGK(L)V genes and CDR H3s from the same CDR H3 cluster (see "CDR H3 566 clustering analysis" above). For each antibody, ANARCI was used to number the position of each 567 residue according to Kabat numbering (Dunbar and Deane, 2016) . The amino-acid identity at 568 each residue position of an antibody was then compared to that of the putative germline gene. 569 CDR H3, CDR L3, and framework region 4 in both heavy and light chains were not included in 570 this analysis. Insertions and deletions were also ignored in this analysis. SHM that occurred in at 571 least two donors within a public clonotype was defined as a recurring SHM. Amino acid residues 1-94 (Kabat numbering) in the light chain sequences of IGHV1-58/IGKV3-603 20 antibodies from CDR H3 cluster 3 were aligned using MAFFT (Katoh and Standley, 2013) . 604 The phylogenetic tree was generated using FastTree (Price et al., 2010) and visualized using 605 ggtree (Yu, 2020) . 606 607 Ramachandran plots were generated using the Ramachandran Plot Server 609 (https://zlab.umassmed.edu/bu/rama/) (Anderson et al., 2005) . The deep learning model consisted of two networks, namely multi-encoder (ME) and a stack of 614 multi-layered perceptrons (MLP). The CDR amino-acid sequences were taken as input and 615 passed to ME. Specifically, each CDR amino-acid sequence was described by a 21-letter 616 alphabet vector ⃗ ⃗ = ( 1 , 2 , … , −1 , ), ∈ ℝ , where L represented the length of sequence, and 617 represented the amino acid category. Each of the 20 canonical amino acids was one category, 618 whereas all the ambiguous amino acids were grouped as the 21 st category. Before passing to 619 ME, tokenized amino acid sequences were processed by zero padding, so that the size of each 620 input was the same. Subsequently, the inputs were mapped to the embedding vectors with 621 additional dimension . The sinusoidal positional encoding vectors were added to the embedding 622 vectors to encode the relative position of tokens (i.e. amino acids) in the sequence. Each 623 embedding vector, ⃗ ⃗ ∈ ℝ × , with size of × , was passed into transformer encoder layer by 624 self-attention mechanism to learn the sequence feature (Vaswani et al., 2017 For evaluating model performance, S antibodies and HA antibodies were considered "positive" 674 and "negative", respectively. False positives (FP) and false negatives (FN) were samples that 675 were misclassified by the model while true negatives (TN) and true positives (TP) were correctly 676 classified one. The following metrics were computed to evaluate model performance: 677 In addition, we also used the receiver operating characteristic (ROC) curve and precision-recall 681 Table S1 . Collection of SARS-CoV-2 antibody information, Related to Figures 1-4 . 689 Table S2 . Neutralization activity and binding affinity of antibodies in CDR H3 clusters 5, 10, 690 and 15, RBD antibodies encoded by IGHV3-13/IGKV1-39, and IGHD1-26 S2 antibodies, 691 Related to Figures 1-3 . 692 • Assembled a dataset of ~8,000 published antibodies to SARS-CoV-2 S from >200 donors • Antibodies to RBD, NTD, and S2 have distinct convergent sequence and molecular features • Public antibody clonotypes show recurring affinity maturation pathway • Provided a proof-of-concept for antibody specificity prediction using deep learning J o u r n a l P r e -p r o o f TensorFlow: A system for large-scale machine learning A broad atlas of somatic hypermutation allows prediction of 706 activation-induced deaminase targets Main-chain conformational 708 tendencies of amino acids Shaping a universally broad antibody response to 710 influenza amidst a variable immunoglobulin landscape SARS-CoV-2 neutralizing 713 antibody structures inform therapeutic strategies Commonality despite exceptional 717 diversity in the baseline human antibody repertoire Potent neutralizing antibodies against SARS-CoV-2 identified by high-throughput 720 single-cell sequencing of convalescent patients' B cells Potent SARS-CoV-2 neutralizing antibodies directed against spike N-723 terminal domain target a single supersite Convergent antibody responses to the 726 SARS-CoV-2 spike protein in convalescent and vaccinated individuals A neutralizing human antibody binds to the N-terminal domain of the Spike 729 protein of SARS-CoV-2 SARS-CoV-2 evolution in an 732 immunocompromised host reveals shared neutralization escape mechanisms Beyond bulk single-chain sequencing: Getting at the whole 735 receptor The antigenic anatomy of SARS-CoV-2 738 receptor binding domain Molecular mechanisms of antibody somatic 740 hypermutation Genetic and structural basis for SARS-CoV-2 743 variant neutralization by a two-antibody cocktail Highly conserved protective epitopes 746 on influenza B viruses ANARCI: antigen receptor numbering and receptor 748 classification Antibody recognition of a highly conserved influenza virus epitope Inferring processes underlying B-cell repertoire diversity A coherent interpretation of AUC as a 756 measure of aggregated classification performance Neutralization potency of monoclonal antibodies 761 recognizing dominant and subdominant epitopes on SARS-CoV-2 Spike is impacted by the 762 B.1.1.7 variant cAb-Rep: a database of 764 curated antibody repertoires for exploring antibody diversity and predicting antibody prevalence. 765 Front Immunol 10 Restricted, canonical, stereotyped and 767 convergent immunoglobulin responses Structural basis for potent neutralization 770 of SARS-CoV-2 and role of antibody affinity maturation Human responses to influenza vaccination show 773 seroconversion signatures and convergent antibody rearrangements Vaccine-induced antibodies that 777 neutralize group 1 and group 2 influenza A viruses Unraveling V(D)J recombination; insights into gene regulation Structure and function analysis of 782 an antibody recognizing all influenza A subtypes MAFFT multiple sequence alignment software version 7: 784 improvements in performance and usability Stereotypic neutralizing VH antibodies against SARS-CoV-2 spike protein 787 receptor binding domain in patients with COVID-19 and healthy individuals Inference of macromolecular assemblies from crystalline 790 state Antibody 27F3 792 broadly targets influenza A group 1 and 2 hemagglutinins through a further variation in VH1-69 793 antibody orientation on the HA stem Antibody-guided vaccine 795 design: identification of protective epitopes In vitro and in vivo functions of SARS-CoV-2 798 infection-enhancing and neutralizing antibodies SARS-CoV-2 800 neutralizing antibodies for COVID-19 prevention and treatment 802 (2021b). Potent SARS-CoV-2 neutralizing antibodies with protective efficacy against newly 803 emerged mutational variants Structural basis and mode of action for two broadly neutralizing 806 antibodies against SARS-CoV-2 emerging variants of concern Mapping neutralizing and 809 immunodominant sites on the SARS-CoV-2 spike receptor-binding domain by structure-guided 810 high-resolution serology Public antibodies to malaria antigens generated by two 813 LAIR1 insertion modalities Broad betacoronavirus neutralization by a 816 stem helix-specific human antibody FastTree 2--approximately maximum-likelihood 818 trees for large alignments CoV-AbDab: the 820 coronavirus antibody database SARS-CoV-2 Beta variant infection 823 elicits potent lineage-specific and cross-reactive antibodies Recurrent potent human 827 neutralizing antibodies to Zika virus in Brazil and Mexico Convergent antibody responses to SARS-830 CoV-2 in convalescent individuals The precision-recall plot is more informative than the 832 ROC plot when evaluating binary classifiers on imbalanced datasets V(D)J recombination: mechanisms of initiation B cell genomics behind cross-837 neutralization of SARS-CoV-2 variants and SARS-CoV A vaccine-induced public antibody protects against 840 SARS-CoV-2 and emerging variants Multi-donor longitudinal antibody 843 repertoire sequencing reveals the existence of public antibody clonotypes in HIV-1 infection Cell entry 846 mechanisms of SARS-CoV-2 Cross-reactive coronavirus antibodies 849 with diverse epitope specificities and Fc effector functions High frequency of shared clonotypes in 852 human B cell receptor repertoires PyIR: a scalable wrapper for processing billions of 855 immunoglobulin and T cell receptor sequences using IgBLAST Dropout: 857 A simple way to prevent neural networks from overfitting SARS-CoV-2 RBD antibodies that maximize 860 breadth and resistance to escape Structural and functional bases for broad-spectrum neutralization of avian 863 and human influenza A viruses Sequence signatures of two public 866 antibody clonotypes that bind SARS-CoV-2 receptor binding domain Logomaker: beautiful sequence logos in Python Circulating SARS-CoV-2 spike 871 N439K variants maintain fitness while evading antibody-mediated immunity Memory B cell repertoire for recognition of 875 evolving SARS-CoV-2 spike Ultrapotent human antibodies protect against 878 SARS-CoV-2 challenge via multiple mechanisms Identification of antigen-specific B cell receptor sequences using public 881 repertoire analysis Attention is all you need Prevalent, protective, and convergent IgG 887 recognition of SARS-CoV-2 non-RBD spike epitopes Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Ultrapotent antibodies against diverse and highly 893 transmissible SARS-CoV-2 variants Broad neutralization of SARS-related viruses 896 by human monoclonal antibodies Landscape of human antibody recognition 899 of the SARS-CoV-2 receptor binding domain Rapid single B cell antibody discovery 902 using nanopens and structured light Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Convergent evolution in breadth of two VH6-1-908 encoded influenza antibody clonotypes from a single donor Recurring 910 and adaptable binding motifs in broadly neutralizing antibodies to influenza virus are encoded 911 on the D3-9 segment of the Ig gene A natural mutation between SARS-CoV-2 and SARS-CoV 914 determines neutralization by a cross-reactive antibody IgBLAST: an immunoglobulin variable 916 domain sequence analysis tool Sequence-intrinsic mechanisms that target AID 919 mutational outcomes on antibody genes Using ggtree to visualize data on tree-like structures Structural basis of a shared antibody response to SARS-CoV-2 Recognition of the SARS-CoV-2 receptor 926 binding domain by neutralizing antibodies Potent and protective IGHV3-53/3-66 public antibodies and their shared 929 escape mutant on the spike of SARS-CoV-2 An elite broadly neutralizing antibody protects SARS-CoV-2 932 Omicron variant challenge A pneumonia outbreak associated with a new coronavirus of probable bat 935 origin A human antibody reveals a conserved site on beta-938 coronavirus spike proteins and confers protection against SARS-CoV-2 infection and test set. The Adam algorithm was used to optimize the model.