key: cord-0276495-nd0l9ajf
authors: Hie, Brian L.; Xu, Duo; Shanker, Varun R.; Bruun, Theodora U.J.; Weidenbacher, Payton A.; Tang, Shaogeng; Kim, Peter S.
title: Efficient evolution of human antibodies from general protein language models and sequence information alone
date: 2022-04-11
journal: bioRxiv
DOI: 10.1101/2022.04.10.487811
sha: d43c372241930c6233e9951edae935741ef9d703
doc_id: 276495
cord_uid: nd0l9ajf

Natural evolution must explore a vast landscape of possible sequences for desirable yet rare mutations, suggesting that learning from natural evolutionary strategies could accelerate artificial evolution. Here, we report that deep learning algorithms known as protein language models can evolve human antibodies with high efficiency, despite providing the models with no information about the target antigen, binding specificity, or protein structure, and also requiring no additional task-specific finetuning or supervision. We performed language-model-guided affinity maturation of seven diverse antibodies, screening 20 or fewer variants of each antibody across only two rounds of evolution. Our evolutionary campaigns improved the binding affinities of four clinically relevant antibodies up to 7-fold and three unmatured antibodies up to 160-fold across diverse viral antigens, with many designs also demonstrating improved thermostability and viral neutralization activity. Notably, our algorithm requires only a single wildtype sequence and computes recommended amino acid changes in less than a second. Moreover, the same models that improve antibody binding also guide efficient evolution across diverse protein families and selection pressures, indicating that these results generalize to many natural settings. Contrary to prevailing notions of evolution as difficult and resource-intensive, our results suggest that when constrained to a narrow manifold of evolutionary plausibility, evolution can become much easier, which we refer to as the “efficient manifold hypothesis.”

Introduction hypermutation evolves or "matures" an antibody lineage to have higher affinity for an antigen 24 via repeated mutagenesis [4] , [10] , [11] . Ex vivo, affinity maturation is a major application of 25 directed evolution due to the therapeutic potential of antibodies with high affinity for disease 26 targets [12] . 27 To model evolutionary plausibility, we use algorithms known as neural language models 28 ( Figure 1C) , which are trained on large datasets of sequences to learn patterns that are likely to 29 occur in natural proteins [13] [14] [15] [16] [17] [18] [19] [20] [21] . Importantly, we use general language models [17] , [18] 30 trained on sequence datasets that are meant to represent variation across all observed natural 31 proteins [22] , rather than a language model that is restricted to variation among antibodies [23] -32 [26] . Given a single starting sequence, we use these language models to recommend plausible 33 amino acid substitutions that we then experimentally screen for improved fitness. We design our 34 approach to be highly general: the algorithm requires only a single wildtype sequence, without 35 any initial binding affinity data, knowledge of the antigen, task-specific supervision, 36 evolutionary homologs, or protein structure information, and can recommend changes to the 37 wildtype sequence in seconds. 38 With this approach, we evolve seven human immunoglobulin G (IgG) antibodies that 39 bind to antigens from coronavirus, ebolavirus, and influenza A virus. We focus on viral antigens 40 given the importance of antibody therapeutics for epidemic and pandemic viral diseases [27]- 41 [30] . When evolving clinically relevant antibodies, which are already highly mature, our best Results 64 Efficient affinity maturation with general protein language models 65 Recent work has demonstrated that language models can predict natural evolution despite 66 having no knowledge of specific selection pressures [9] . However, this prior work only predicted 67 the direction of evolution retrospectively when given full knowledge of the evolutionary 68 trajectory. We therefore sought to investigate if the same language models could predict 69 unobserved evolution to prospectively design new proteins. 70 In particular, we hypothesized that the predictive capabilities of protein language models 71 might enable a researcher to provide only a single, wildtype antibody sequence to the algorithm 72 and receive a small, manageable set (~10 1 ) of high-likelihood variants to experimentally measure 73 for desirable properties. This is a very general setting that does not assume knowledge of protein 74 structure or task-specific training data, thereby avoiding the resource-intensive processes 75 associated with structure determination [34] or high-throughput screens [33] . A major question, 76 however, is if higher evolutionary likelihood would efficiently translate to higher fitness. 77 We tested our hypothesis by conducting separate directed evolution campaigns, guided 78 by language-model likelihood, to affinity-mature seven antibodies representing diverse antigens 79 and degrees of maturity (Supplementary Table 1 ). These antibodies are: 80 • MEDI8852: A broadly neutralizing antibody (bnAb) that binds influenza A hemagglutinin 81 (HA) across variants of both major phylogenetic groups (Group 1 and Group 2) and that 82 reached Phase-II clinical trials; this antibody is highly matured, with its parent being isolated 83 from a human followed by substantial artificial evolution [27] . 84 • MEDI8852 unmutated common ancestor (UCA): The unmatured, inferred germline sequence 85 of MEDI8852, which only neutralizes viruses with Group 1 HAs [27] . 86 • mAb114: A patient-derived antibody that neutralizes ebolavirus by binding to its 87 glycoprotein (GP) [28] and has been approved for clinical use by the United States Food and 88 Drug Administration (FDA). 89 • mAb114 UCA: The unmatured, inferred germline sequence of mAb114 with weak binding to 90 ebolavirus GP [28] . 91 • S309: A patient-derived antibody that cross-neutralizes the sarbecoviruses SARS-CoV-1 92 (severe acute respiratory syndrome coronavirus 1) and SARS-CoV-2 by binding to the spike 93 glycoprotein (Spike) [29] and is the parent antibody of sotrovimab [35] , which currently has 94 FDA emergency-use authorization (EUA) for treatment of COVID-19 (coronavirus disease 95 2019). 96 • REGN10987: A patient-derived antibody that binds early variants of SARS-CoV-2 Spike 97 [30] and that had an FDA EUA for use against these variants.

• C143: An unmatured, patient-derived antibody that binds the SARS-CoV-2 Wuhan-Hu-1 99 Spike but was isolated prior to extensive in-vivo somatic hypermutation [36] , [37] . 100 We performed evolution with the ESM-1b language model and the ESM-1v ensemble of 101 five language models (six language models in total) [17] , [18] . ESM-1b and ESM-1v were sequence variation would not be included. Therefore, to evolve these antibodies, the language 110 models cannot leverage antigen-or disease-specific biases in the training data and must instead 111 learn more intrinsic evolutionary patterns. 112 We used these language models to compute likelihoods of all single-residue substitutions 113 to the antibody variable regions of either the heavy chain (VH) or the light chain (VL). We 114 selected substitutions with higher evolutionary likelihood than wildtype across a consensus of six 115 language models; additional details are provided in Methods. In the first round of evolution, we 116 measured the antigen-binding affinity by biolayer interferometry (BLI) of variants that only 117 contain a single-residue substitution from wildtype. In the second round, we measured variants 118 containing combinations of substitutions, where we selected substitutions that corresponded to 119 preserved or improved binding based on the results of the first round. We performed these two 120 rounds for all seven antibodies, measuring 8 to 14 variants per antibody in round one and 1 to 11 121 variants per antibody in round two (Figure 2, [11] . 140 We were able to improve the binding affinities for all clinically relevant antibodies tested, 141 despite these antibodies being already highly evolved (starting at low nanomolar or picomolar 142 affinity). MEDI8852 is a potent binder with a sub-picomolar Fab Kd across many HAs and 143 picomolar or nanomolar binding to HAs from subtypes H4 and H7. While we explicitly screened 144 variants using an HA H4 antigen, the best design also improves binding across a broad set of Table 5 Table 7 ). 160 We were also able to improve affinities for all three unmatured antibodies, often 161 involving much higher fold changes than when evolving the matured antibodies, indicating 162 easier evolvability with respect to affinity. For MEDI8852 UCA, the best Fab design achieves a isolated prior to extensive affinity maturation [37] , our best design achieves a 13-fold 177 improvement for Beta S-6P and a 3.8-fold improvement for Omicron RBD (Supplementary Table 9 ). Results from our directed evolution campaigns are further summarized in Figure 2 , 179   Supplementary Tables 2-9 , and Supplementary Data 1. In total, across antibodies representing 180 diverse antigens and degrees of maturity, our approach consistently and efficiently produces 181 higher-affinity variants. 182 Improved thermostability and neutralization of evolved antibodies 183 Although we explicitly selected for variants with improved binding to specific antigens, 184 we also sought to establish if these variants have improved stability (Methods). We found that 185 Fabs for 21 out of the 31 language-model-recommended, affinity-enhancing variants that we 186 tested had a higher melting temperature (Tm) than wildtype, and all variants maintained 187 thermostability (Tm > 70°C). When evolving S309 to have higher affinity, our best design has a 188 Tm of 72.8°C compared to 72.5°C for wildtype, whereas the VH N55Q substitution introduced in 189 sotrovimab decreases the Tm to 69.6°C (Figure 2 ). Our evolved variants for mAb114, mAb114 190 UCA, REGN10987, and C143 also preserve or improve Tm; the highest change we observed was 191 an increase from 74.5°C to 82.5°C when evolving mAb114 UCA. Improved thermostability does 192 not completely explain our affinity maturation results, however, as we observe somewhat 193 decreased Tm for our affinity-matured variants of MEDI8852 and its UCA, though these Fabs are 194 still thermostable (Figure 2) . 195 We also wanted to determine if our affinity-matured variants have better viral 196 neutralization activity. We tested affinity-enhancing variants of four antibodies using at a >100-fold lower concentration compared to wildtype (Supplementary Fig. 1 ). In general, 206 change in binding affinity corelates well with change in neutralization (Spearman r = 0.82, two-207 sided t-distribution P = 1.9 × 10 -4 ) ( Figure 3B) . Given the limited number of variants tested, we 208 also note that alternative versions of our directed evolution campaigns could have instead 209 explicitly screened variants for neutralization activity. 210 Originality of affinity-enhancing substitutions 211 While the ability to find any improvement in affinity is itself useful for engineering 212 applications, we were also interested in whether some of the changes recommended by our 213 algorithm demonstrate "originality." We quantified originality by computing the frequency that a 214 given residue is observed in nature (Methods), where a change to a rarely observed residue 215 indicates that the model learns patterns that go beyond its literal training dataset. While many 216 affinity-enhancing substitutions are indeed observed at high frequency in both the model's 217 training data [22] and in a database of antibody sequences [39] , other substitutions demonstrate Table 10 ). These results indicate that the language complex rules that are not captured by a multiple sequence alignment or conventional antibody 225 evolution. Conceptually, these low-frequency, affinity-enhancing substitutions are analogous to 226 examples in other disciplines where an artificial-intelligence program occasionally makes 227 unusual but advantageous choices (for example, unintuitive game-playing decisions [40] ), and 228 likewise may be worth further study. 229 Generality across diverse protein families 230 Given the success of general protein language models at guiding antibody evolution, we 231 also tested how well the same models could acquire high-fitness variants across a range of 232 protein families. Previous work has demonstrated that the likelihoods from general protein 233 language models have good correlation with experimental phenotypes from high-throughput 234 assays over ~10 3 to 10 4 variants [9] , [18] . Previous computational simulations have also 235 indicated that these models can help bias multi-round evolution away from large regions of a 236 sequence landscape with zero or very low fitness [8] . 237 Here, we observe that the same models we used to affinity-mature antibodies can also 238 guide efficient evolution when measuring only a small number (~10 1 ) of variants according to 239 diverse definitions of extrinsic fitness including antibiotic resistance, cancer drug resistance, 240 enzyme activity, or viral replication fitness [41] . More specifically, we used the same algorithm 241 and language models in our affinity-maturation experiments to instead suggest changes to 242 wildtype sequences from human, bacterial, or viral organisms representing eight diverse protein 243 families. We then used fitness measurements from high-throughput scanning mutagenesis 244 experiments [41] , [42] to validate the language-model-recommended predictions (notably, these 245 measurements were not provided to the model). 246 Across diverse proteins, language-model-recommended variants are significantly 247 enriched (hypergeometric P < 0.05) for high fitness values, and high-fitness variants make up a 248 much larger portion of language-model-recommended variants compared to random guessing in 249 nearly all cases ( Figure 4A, Supplementary Fig. 2 , and Supplementary Table 11) . For 250 example, while ampicillin resistance is observed for just 7% of all single-residue substitutions to 251 β-lactamase, it is observed for 40% of language-model-recommended substitutions, and the same 252 set of language models can also help prioritize single-residue substitutions to HA that result in 253 high viral infectivity (from 7% to 31%) and substitutions to PafA that improve enzyme kinetics 254 (from 3% to 20%). Additionally, across all proteins, even the first round of a small-scale 255 evolutionary campaign guided by language models would yield variants that are near the local 256 fitness peak (Supplementary Fig. 2 ). In total, these results suggest that the evolutionary 257 efficiency that we observed for affinity-maturation of human IgGs also generalizes to diverse 258 natural settings. 260 We show that general protein language models can guide highly efficient affinity 261 maturation based on the wildtype antibody sequence alone. We improved binding affinities of a 262 highly evolved influenza A broadly neutralizing antibody (bnAb), MEDI8852, by up to 7-fold 263 and a clinically approved ebolavirus antibody, mAb114, by 3.4-fold. We also evolve S309, a 264 sarbecovirus bnAb, to have higher affinity and thermostability than a rationally designed and 265 clinically available variant, sotrovimab. We improved binding affinities of unmatured antibodies 266 from 13-to 160-fold across diverse antigens, which is within the 3.8-to 580-fold improvement 267 range previously achieved by a state-of-the-art, in-vitro evolutionary system applied to 268 unmatured, anti-RBD nanobodies (in which the computational portion of our approach, which 269 takes seconds, is replaced with rounds of cell culture and sorting, which takes weeks) [12] . We 270 also note that in-vitro, cell-surface-display methods encounter physical limits that make it 271 challenging to distinguish better binders when the wildtype binder already has high affinity (<1 272 nM) [43] , which is not a limitation of our approach. Moreover, our algorithm is based on language 273 models (trained on general protein sequence variation) that can also predict high-fitness variants 274 across diverse protein families and engineering applications. We envision our approach as useful 275 within preclinical development as a rapid way to identify improved variants of an existing 276 protein of interest (for example, an antibody isolated from a patient or from a naïve library). We 277 also anticipate that language models will become a key part of the antibody engineer's toolkit. 278 Interestingly, about half of the language-model-recommended substitutions (and about 279 half of the affinity-enhancing substitutions) fall in framework regions, which are typically not 280 proximal to the binding interface and are therefore sometimes excluded from directed evolution 281 [33] . While some of these framework changes may improve affinity via protein stabilization, 282 others do not appear to increase thermostability and may instead be causing larger-scale 283 rearrangements that improve affinity via structural reorientation, which has been observed in 284 natural affinity maturation [44] [45] [46] . Our algorithm also recommends a number of affinity-285 enhancing substitutions with low observed frequency in nature. An interesting area for future 286 work is to characterize additional biochemical or structural features of these unconventional 287 changes. 288 The broader relevance of our results, beyond affinity maturation of human antibodies, 289 arises from asking why this method works. Fundamentally, our results are surprising in that 290 modifying amino acid residues simply based on evolutionary plausibility, or "intrinsic fitness," 291 sufficiently enriches for changes that improve fitness under specific, natural selection pressures, 292 or "extrinsic fitness" (Figure 1B) . These results challenge a prevailing notion that evolution is 293 difficult because it is random. Instead, we hypothesize that, in many settings, as long as 294 evolution remains on a naturally plausible manifold, a substantial portion (greater than 10%) of 295 mutations are bound to improve extrinsic fitness, which we call the "efficient manifold 296 hypothesis" (Figure 4B ). Our findings for both antibodies and other natural proteins provide 297 direct support for this hypothesis. The efficient manifold hypothesis is also supported by the 298 recent successes of completely unsupervised models in predicting evolution under a variety of 299 specific selection pressures, from clinical variant risk to viral fitness and immune escape 300 potential [15] , [47] [48] [49] . 301 The efficient manifold hypothesis has direct, practical applications for those trying to 302 evolve proteins in the laboratory. Evolution guided by a language model can be used as a drop-in 303 replacement for current evolutionary tools based on randomization; for example, combinatorial 304 libraries [50] , [51] can recombine language-model-guided mutations alongside or instead of 305 rationally chosen mutations [33] . By leveraging increasingly efficient technologies for nucleic 306 acid printing [42] , language-model-guided evolution could also directly replace mutagenesis 307 strategies based on, for example, an error-prone polymerase. 308 The efficient manifold hypothesis also challenges the notion that explicitly modeling 309 extrinsic fitness with a supervised model should be the best or the default approach to machine-310 learning-guided directed evolution [32] . To the end user, guiding evolution via pretrained, 311 unsupervised models is less resource-intensive than collecting enough task-specific data to train 312 a supervised model [33] . Our techniques can also be used in conjunction with supervised 313 approaches [8] , [31] [32] [33] [34] , [52] [53] [54] [55] , and supervising a model over multiple experimental rounds 314 might ultimately lead to higher fitness. However, in many practical settings (for example, the 315 rapid development of sotrovimab in response to the COVID-19 pandemic [35] ), the efficiency of 316 an unsupervised, single-round approach is preferable to a protracted, multi-round (machine-317 learning-guided) directed evolution campaign. 318 We note that taking advantage of the efficient manifold hypothesis to improve extrinsic 319 fitness may be more difficult when the selection pressure is unnatural or if the wildtype sequence 320 is already at a fitness peak. Relatedly, a potential limitation of our specific algorithm is that we 321 use language models that are trained only on natural sequences and might therefore be less 322 applicable to unnatural proteins generated via de-novo design [56] , [57] . However, in many 323 practical design tasks, natural sequences and selection pressures are already preferrable; for 324 example, therapeutic development often prefers human antibodies due to considerations of 325 immunogenicity and toxicity. 326 Beyond protein engineering applications, the efficient manifold hypothesis may also 327 provide new insight into natural evolution. Our results suggest that many natural evolutionary 328 processes occur on efficient manifolds, which may explain how some proteins are able to quickly and consistently acquire new functions; for example, human immunodeficiency virus relies on 330 rapid and substantial intra-host evolution (on the scale of hours to days) to accomplish both 331 infection and transmission [3] . Nature could support efficient manifolds via a number of 332 mechanisms: for example, a recent analysis of Arabidopsis thaliana evolution suggests that 333 epigenomic features enable mutations to be intrinsically biased away from implausible choices 334 [42] . If epigenomic or other mechanisms predispose mutations to have high intrinsic fitness, then 335 natural evolution on an efficient manifold would also become easy. Beta pseudovirus, out of the three higher-affinity variants that we also screened for neutralization activity, the best improvement is the 32-fold improvement of VL G53V; for D614G pseudovirus, the best improvement is the 19-fold improvement of VL T33N-G53V (Supplementary Table   9 (A) The same strategy and language models that we use to affinity-mature antibodies can also recommend high-fitness changes across a diversity of protein families selection pressures, as identified experimentally using high-throughput scanning mutagenesis assays [41] , [42] (described in Supplementary Table 11 Conceptually, intrinsic fitness forms a manifold that is represented in this cartoon by the rainbow road, where ascending corresponds to improving extrinsic fitness and descending corresponds to lowering extrinsic fitness. Under the efficient manifold hypothesis, this manifold of intrinsic fitness is narrow, therefore moving in any direction (for example, via random or brute-force mutagenesis) would most likely decrease extrinsic fitness or fall off the manifold entirely (represented by the green ball). However, if movement is constrained to the narrow manifold of intrinsic fitness (for example, when guided by a language model), then the chance of improving extrinsic fitness increases substantially (represented by the red ball).

We aim to acquire variants for experimental measurement with high predicted evolutionary plausibility and therefore select amino acid substitutions recommended by a consensus of language models. We take as input a single wildtype sequence = ( 1 , … , ) ∈ , where is the set of amino acids and is the sequence length. We also require a set of masked language models, which are pretrained to produce conditional likelihoods ( ′ | ). To guide evolution based on a certain language model, we first compute the set substitutions with higher language-model likelihood than the wildtype, i.e., we compute the set

where denotes the language model, denotes the wildtype residue, and = 1. To further filter substitutions to only those with the highest likelihood, we choose substitutions based on a consensus scheme, where, for a new amino acid ′ , we compute

where {•} denotes the indicator function and there are language models. We then acquire the set of substitutions with higher likelihood than wildtype across multiple language models, i.e.,

we acquire

where is a user-supplied cutoff that controls the number of corresponding variants to measure.

While we focus on values of k that result in small values of | | (around 10) that can be screened via low-throughput assays, the number of substitutions can be increased by reducing the value of k or by lowering the cutoff stringency .

We used six large-scale masked language models, namely, the ESM-1b model [17] 

We 

All antigens were His-tagged and purified using HisPur™ Ni-NTA resin (ThermoFisher). 

We measured thermal melting profiles of proteins by differential scanning fluorimetry on for viral production, as previously described [58] . The Spike vector contained the 21 amino acid truncated form of the SARS-CoV-2 Spike sequence from the Wuhan-Hu-1 strain of SARS-CoV-2 (GenBank: BCN86353.1) or the Beta variant-of-concern (GenBank: QUT64557.1). The other viral plasmids, used as previously described [58] , After adding plasmids to medium, we added 30 μL BioT (BioLand) to form transfection complexes. Transfection reactions were incubated for 10 minutes at room temperature, and then 9 mL of medium was added slowly. The resultant 10 mL was added to plated HEK cells from which the medium had been removed. Culture medium was removed 24 hours post-transfection and replaced with fresh D10 medium. Viral supernatants were harvested 72 hours post-transfection by spinning at 300X g for five minutes followed by filtering through a 0.45-μm filter. Viral stocks were aliquoted and stored at -80ºC until further use.

The target cells used for infection in SARS-CoV-2 pseudovirus neutralization assays are from a HeLa cell line stably overexpressing human angiotensin-converting enzyme 2 (ACE2), as well as the protease known to process SARS-CoV-2, transmembrane serine protease 2 (TMPRSS2). Production of this cell line is described in detail previously [59] , with the addition 

We computed the frequency of residues involved in affinity-enhancing substitutions by aligning the wildtype VH and VL sequences of our antibodies to databases of protein sequences.

The first database we considered is UniRef90, where we use the same database release used to train ESM-1v. For each antibody protein sequence, we obtained the set of 2,000 sequences in the database that are closest to the antibody by sequence similarity based on Levenshtein distance (with the farthest sequences having between 24% to 49% sequence similarity); we compute sequence similarity using the fuzzywuzzy Python package version 0.18.0. We then use mafft version 7.475 to perform multiple sequence alignment among the set of sequences. We used the alignment to compute amino acid frequencies at each site in the VH or VL sequence. The second database we considered is provided by the abYsis webtool. We aligned VH and VL protein sequences using the default settings provided in the "Annotate" tool, using the database of "All" sequences as of March 1, 2022. While our frequency estimation procedure is based on the entire UniRef90 dataset, we also sought to count the number of annotated immunoglobulin variable regions in UniRef90; to do so, we used the UniRef query tool (https://www.uniprot.org/uniref/) with the queries "name:"immunoglobulin heavy variable" AND identity:0.9", "name:"immunoglobulin kappa variable" AND identity:0.9", and "name:"immunoglobulin lambda variable" AND identity:0.9".

We evaluated the ability for the language models and algorithms used in our study to guide efficient evolution in other settings beyond antibodies. We leveraged deep mutational scanning (DMS) datasets to validate that our approach would enable a researcher to acquire highfitness variants. We used all DMS datasets from the benchmarking study by Livesey and Marsh [41] with 90% or higher coverage of all single-residue substitutions; variants that were not measured were treated as having low fitness. We also used a scanning mutagenesis dataset To quantify the statistical significance of an enrichment, we assumed that the null distribution of the number of high-fitness, language-model-recommended variants was given by 

Raw data for this study has been deposited to Zenodo at DOI:10.5281/zenodo.6415457.

Kd, IC50, and Tm values across replicate experiments are available as Supplementary Data 1.

Information on the antibodies considered in each of our directed evolution campaigns. Matured indicates extensive somatic hypermutation from germline (and, in the case of MEDI8852, additional in-vitro affinity maturation). Source indicates how the antibody sequence was obtained; germline-inferred sequences were obtained from the original publications. Improved binding is Rare to rare  1%  4%  Q  52%   C143   VH V29F  3%  78%  --4%  69%  --VH L51Y  4%  18%  --1%  <1%  I  85%   VH A77T  1%  56%  --<1%  62%  --VH G91A  4%  95%  --2%  97%  --VL N27S  11%  53%  --3%  77%  --VL T33N  2%  44%  --2%  36%  --VL L34Y  4%  34%  --2%  59%  --VL Y41H  <1%  6%  K  59%  Rare to  uncommon  <1%  5%  K  67%   VL G53V  2%  10%  A  30%  Rare to  uncommon  4%  12%  A  40%   VL A96S  2%  57%  --2%  36% -- Each row corresponds to an amino acid substitution that enhances the binding affinity of its corresponding variant antibody, and some of which also enhance affinity in combination with other substitutions. We computed frequencies of amino acid substitutions among natural sequences using two datasets, UniRef90 and abYsis (Methods); UniRef90 was the sequence database used to train the language models in our algorithm and abYsis is a separate, curated database of natural antibody sequences. The "wildtype residue frequency" indicates the percentage of sequences in a multiple sequence alignment with the same residue as wildtype at the given position; the "mutant residue frequency" is the same statistic except for the mutant residue. The "top residue" indicates the amino acid with the highest frequency observed at the given site, the "top residue frequency" indicates the percentage of sequences that contain the top residue at the given site, and dashes indicate settings in which the mutant residue is also the top residue. Substitutions with frequencies up to 5% are considered "rare," those with frequencies above 5% and up to 10% are considered "uncommon," and those above 10% are considered "common." Blue shading indicates substitutions to rare or uncommon residues according to frequency information from either UniRef90 or abYsis. 66 Supplementary represent 90% or more coverage of all single-residue substitutions except for that of PafA, which changes every residue to either a glycine or a valine. The cutoff indicates the study-specific criterion for determining a high-fitness variant. The "Sample size" indicates the number of acquired variants (| |) and "Sample successes" indicates the number of those variants with high fitness according to the cutoff. The "Population size" indicates the number of variants profiled in the scanning mutagenesis assay, where "Population successes" indicates the number of those variants with high fitness according to the cutoff. "Hit rate" indicates the percentage fraction of high-fitness variants among the language-model-recommended variants (sample successes divided by sample size) whereas "Background" indicates the percentage fraction of high-fitness variants among all single-residue variants (population successes divided by population size). The hypergeometric P value computes enrichment of high-fitness variants among the acquired variants by assuming that the number of sample successes has a hypergeometric null distribution with parameters given by the other values (sample size, population successes, and population size); blue shading indicates a one-sided, hypergeometric P-value of less than 0.05.

To obtain the set of language-model-recommended variants, we varied two parameters controlling the stringency of acquired variants (where more stringent corresponds to fewer variants): is a cutoff controlling the likelihood ratio of the mutant probability to the wildtype probability, and is a cutoff controlling the number of consensus language models (Methods).

(A) At varying cutoffs, we computed the percentage fraction of variants in that correspond to high-fitness variants, using scanning mutagenesis data for validation. When = 0 and = 1, this value is equivalent to the percentage of high-fitness variants in the full scanning mutagenesis dataset (a black dashed line is also drawn at this value for each protein). In all cases except for P53, we observe that increasing the likelihood stringency generally improves the efficiency at which high-fitness variants are acquired. In Figure 4 , we report values for = 1, = 2, except for when these cutoffs result in | | < 5 (infA, MAPK1, and PafA), in which case we report = 

Below are the antibody protein sequences defined as wildtype in this study: 

Climbing Mount Improbable

The evolutionary history of 2,658 cancers

Bottlenecks in HIV-1 transmission: insights from the study of founder viruses

Germinal Centers

Life's solution: Inevitable humans in a lonely universe

Wonderful Life: The Burgess Shale and the Nature of History

Directed Evolution: Bringing New Chemistry to Life

Informed training set design enables efficient machine learning-assisted directed protein evolution

Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins

Variations in Affinities of Antibodies during the Immune Response

Affinity enhancement of antibodies: how low-affinity antibodies produced early in immune responses are followed by high-affinity antibodies later and in memory B-cell responses

Rapid generation of potent antibodies by autonomous hypermutation in yeast

Learning the protein language: Evolution, structure, and function

Learning protein sequence embeddings using information from structure

Learning the language of viral evolution and escape

Unified rational protein engineering with sequence-based deep representation learning

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Language models enable zero-shot prediction of the effects of mutations on protein function

Evaluating Protein Transfer Learning with TAPE

ProGen: Language Modeling for Protein Generation

ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing

UniRef: Comprehensive and non-redundant UniProt reference clusters

Deciphering antibody affinity maturation with language models and weakly supervised learning

Generative Language Modeling for Antibody Design

Antibody design using LSTM based deep generative model from phage display library for affinity maturation

Protein design and variant prediction using autoregressive generative models

Structure and Function Analysis of an Antibody Recognizing All Influenza A Subtypes

Protective monotherapy against lethal Ebola virus infection by a potently neutralizing antibody

Cross-neutralization of SARS-CoV-2 by a human monoclonal SARS-CoV antibody

Studies in humanized mice and convalescent humans yield a SARS-CoV-2 antibody cocktail

Machine-learning-guided directed evolution for protein engineering

Adaptive machine learning for protein engineering

Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning

Deep learning guided optimization of human antibody against SARS-CoV-2 variants with broad neutralization

Antibody therapies for SARS-CoV-2 infection

Evolution of antibody immunity to SARS-CoV-2

Affinity maturation of SARS-CoV-2 neutralizing antibodies confers potency, breadth, and resilience to viral escape mutations

Structure-based design of prefusion-stabilized SARS-CoV-2 spikes

abYsis: Integrated Antibody Sequence and Structure-Management, Analysis, and Prediction

Mastering the game of Go with deep neural networks and tree search

Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations

Revealing enzyme functional architecture via high-throughput microfluidic enzyme kinetics

Cell-Binding Assays for Determining the Affinity of Protein-Protein Interactions

ABangle: Characterising the VH-VL orientation in antibodies

Affinity maturation in an HIV broadly neutralizing B-cell lineage through reorientation of variable domains

Structural insights into the evolution of an antibody combining site

Deep generative models of genetic variation capture the effects of mutations

Disease variant prediction with deep generative models of evolutionary data

Mutation effects predicted from sequence co-variation

Molecular evolution by staggered extension process (StEP) in vitro recombination

Optimal Design of Stochastic DNA Synthesis Protocols based on Generative Sequence Models

Low-N protein engineering with data-efficient deep learning

D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of proteinprotein interactions

Using deep learning to annotate the protein universe

Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design

De Novo Design of Immunoglobulin-like Domains

The coming of age of de novo protein design

Protocol and Reagents for Pseudotyping Lentiviral Particles with SARS-CoV-2 Spike Protein for Neutralization Assays

Isolation of potent SARS-CoV-2 neutralizing antibodies and protection from disease in a small animal model

Structural and functional characterization of G protein-coupled receptors with deep mutational scanning

Evolvability as a Function of Purifying Selection in TEM-1 β-Lactamase

Experimental Estimation of the Effects of All Amino-Acid Mutations to HIV's Envelope Protein on Viral Replication in Cell Culture

Accurate measurement of the effects of all amino-acid mutations on influenza hemagglutinin

Deep mutational scanning of hemagglutinin helps predict evolutionary fates of human H3N2 influenza variants

RNA Structural Determinants of Optimal Codons Revealed by MAGE-Seq

Phenotypic Characterization of a Comprehensive Set of MAPK1 /ERK2

Mutational processes shape the landscape of TP53 mutations in human cancer

The wildtype row is highlighted in gray; variants with improved affinity are highlighted in blue. An asterisk (*) indicates examples where binding was observed but BLI data were not suitable for fitting

Kd apparent; NB: no binding; ND: not determined; CV: coefficient of variation

Supplementary Fig. 1: Pseudovirus neutralization of affinity-matured variants

Neutralization curves for wildtype antibodies (gray) and variants obtained by our languagemodel-guided affinity maturation campaigns. Also see Supplementary Tables 5, 8, and 9 for corresponding IC50 values. Points indicate the mean

We thank Benjamin Bell and Ashwin Narayan for helpful discussions. We thank the