key: cord-271849-wxmr8eki authors: Meysman, Pieter; Postovskaya, Anna; De Neuter, Nicolas; Ogunjimi, Benson; Laukens, Kris title: Tracking SARS-CoV-2 T cells with epitope-T-cell receptor recognition models date: 2020-09-09 journal: bioRxiv DOI: 10.1101/2020.09.09.289355 sha: doc_id: 271849 cord_uid: wxmr8eki Much is still not understood about the human adaptive immune response to SARS-CoV-2, the causative agent of COVID-19. In this paper, we demonstrate the use of machine learning to classify SARS-CoV-2 epitope specific T-cell clonotypes in T-cell receptor (TCR) sequencing data. We apply these models to public TCR data and show how they can be used to study T-cell longitudinal profiles in COVID-19 patients to characterize how the adaptive immune system reacts to the SARS-CoV-2 virus. Our findings confirm prior knowledge that SARS-CoV-2 reactive T-cell diversity increases over the course of disease progression. However our results show a difference between those T cells that react to epitope unique to SARS-CoV-2, which show a more prominent increase, and those T cells that react to epitopes common to other coronaviruses, which begin at a higher baseline. The emergence of a novel coronavirus in 2019, termed SARS-CoV-2, has led to the most prominent global pandemic in recent history. Infection by SARS-CoV-2 manifests as COVID-19, a disease of which symptoms and severity greatly vary, and which has caused substantial loss of life all over the world. Characterizing the immune response against this novel virus has become a top priority, in the hope that novel insights can lead to new treatment plans or can aid in vaccine development. Special attention is now being paid to T-cell response in particular, which was already determined an important factor in long-term immunity against coronaviruses during SARS and MERS outbreaks (1) (2) (3) . Moreover, SARS-specific T cells are still found in individuals 17 years later and demonstrate robust cross-reactivity against SARS-CoV-2 suggesting the possibility of long-term protection for SARS-CoV-2 (4). Current findings indicate that T cells play a key role in susceptibility to and severity of the ongoing COVID-19 pandemic as well. In particular, disease severity has been found to be linked with a sub-optimal or excessive T-cell response (5) . Similarly, while most of the patients present with high antibody titers (6) , frequently they do not provide necessary virus neutralization (7) . Moreover, some studies have reported that around 30% of patients recover with very low levels of neutralizing antibodies or none at all (8) . Most of the patients, however, developed CD8+ T cell response (9) , with cells predominantly showing activation signals (10) . Activated CD4+ T cells were also detected in the majority of patients, sometimes even reaching 100% in convalescent patients (9) . Interestingly, 83% of patients were found to have CD4+ T cells specific to the SARS-CoV-2 spike protein in one study (11) , and the magnitude of antibody response was correlated to the CD4+ T cell with the same protein specificity in another (9) . Surprisingly, T cells specific to the SARS-CoV-2 epitopes were even found in a high proportion of unexposed individuals (4, 9, 11) , which was later demonstrated to be due to cross-reactivity with common cold coronaviruses (12) and might explain pre-existing protection from the SARS-CoV-2 in some individuals. Altogether those findings support a key role for T cells in the immune defense against SARS-CoV-2 and are important considerations for vaccine development. A multitude of computational approaches to discover T-cell epitopes for rational SARS-CoV-2 vaccine design have been performed using different perspectives: genetic similarity between SARS and SARS-CoV-2 (9, 13-15), previous knowledge about SARS immunogenic epitopes (13), de novo epitope prediction with respect to affinity to MHC estimated utilizing structural biology (16, 17) or machine learning (13, 14, 17) , and MHC distribution across populations (14) . Different approaches resulted in various numbers of potential targets in different groups. For instance, Lee et al. reported 28 SARS-CoV-2 peptides identical to those of SARS (13), and Fast et al. yielded 405 de novo candidate T-cell epitopes (17) . For some predictions, experimental validation was later performed by different groups (18) . One key technique that facilitates insights into the makeup of an individual's T-cell repertoire is high-throughput Tcell receptor sequencing. Several TCR sequencing studies have now been performed in COVID19 patients and revealed some characteristics of the T-cell receptor repertoire during COVID-19. One study, for example, found clusters of T cells tied to disease severity, which predominantly comprised public clonotypes (19) . Another group identified statistically enriched public TCR sequences from a large number of repertoires to distinguish SARS-CoV-2 positive from healthy individuals (20) . In addition, several studies have determined SARS-CoV-2 specific TCRs and distinct CDR3 motifs down to the individual epitope level (18, 20) . However, one key downside of T-cell receptor sequencing is the high diversity of TCR sequences across individuals. Only a handful of TCR sequences can be found across different individuals. These so-called 'public' clones can therefore be tracked across individuals and stored in databases for reference. However, TCR repertoires consist mostly of individual-specific sequences, which due to their individual nature cannot be expected to be found in a database. Recently, computational methods have been developed to convert epitope-TCR pairing data to predictive models (21) that generalize the epitope-TCR specificity determinants. Such models can be used to screen TCR repertoires to find additional potential epitope-specific TCRs that are not contained in any database (22) . These models are based on the concept that TCRs targeting the same epitope tend to have similar amino acid sequences (23) . In this study, we create such prediction models for SARS-CoV-2 epitopes and apply them to track epitope-specificity over time. Epitope-TCR data. A collection was established of experimentally validated TCR-epitope pairs by combining two primary sources: • The VDJdb database, which contained tetramerderived data from Shomuradova et al (18) . Accessed on the 26th of May, 2020. • The ImmuneCODE collection from Adaptive Technologies and Microsoft, which contained pairs derived through MIRA assay (20) . Accessed on the 25th of June, 2020. For all extracted pairs, several data curation steps were performed. All pairs matching more than one possible SARS-CoV-2 epitope were removed for training data. Only valid TCR sequences that could be matched to standard IMGT were kept. Where needed, a limit was placed on 5000 unique TCRs, which were then selected randomly. Longitudinal TCR data. Public repertoire data was retrieved through the iReceptor gateway (24) in the week of the 13th of July, 2020. In particular, TCR data was extracted of those studies that had longitudinal tracking of COVID-19 patients, namely from Minervina et al. (25) and Schultheiß et al (19) . Only those TCRs that occur at a frequency of at least 1 in 100 000 were retained, to compensate for the different sequencing depths between studies. Meta data was made uniform so that the time points are annotated by days after onset of symptoms. Protein and sequence data. Protein sequence data for 119 Nidovirales species, which included the human and non-human SARS viruses along with other coronaviruses and single-strand RNA viruses, were downloaded from the Corona OMA Orthology Database (26) . In this manner, the used protein amino acid sequences for SARS-CoV-2 corresponds to Genbank accession GCA_009858895.3, and the protein sequences for SARS-CoV to GCA_000864885.1. Epitopes were matched to all proteins for all 119 species with an exact match, as the degree of variation allowed in the epitope space while retaining TCR recognition is still an unsolved question. Matches across all species for each epitope were tallied, and the annotation for SARS-CoV-2 was retained. Sequence identity between proteins was established using a pairwise protein BLAST. Model training and application. For the machine learning part of this study, we made use of the TCRex framework (22) . We trained models for all epitopes that had more than 30 distinct TCRs. Only those models that had an AUC ROC higher than 0.7 and a AUC PR higher than 0.35 in a cross-validated setting were retained, as per the default TCRex criteria. The models were then applied to full TCR repertoires, where a match was defined as a probable epitope-specific TCR if the score is higher than 0.9 and the BPR is lower than 1e-4. For normalization, reported hits were divided by the unique TCR repertoire size. Recognition model performance. In total, 47 distinct epitope TCRex models could be trained for SARS-CoV-2. An overview of all models and their performance can be found in Table S1 . This is almost as many models as are available for all non-SARS-CoV-2 epitopes combined (49), indicating the vast amount of data that has been generated in the past few months compared to what has been collected for all prior pathogens and diseases. 24 of these epitopes match the SARS-CoV-2 replicase protein coded by ORF1ab, 16 match the SARS-CoV-2 spike protein encoded by ORF2 and the final 7 are distributed across the remaining proteins. In addition, 19 of the epitopes are unique to SARS-CoV-2 in our data set of 119 Nidovirales species. As can been seen in figure 1 , this does not seem to be evenly distributed across the protein of origin. 9 out of the 16 epitopes derived from the spike protein are unique to SARS-CoV-2, whereas only 6 out of 24 are unique for the ORF1ab replicase protein. As previously reported (27) , the spike protein of SARS-CoV-2 shares 76% amino acid sequence identity to that of SARS-CoV. In contrast, the replicase protein has 86% sequence identity between SARS-CoV-2 and SARS-CoV. It is well known that viruses accumulate mutations to avoid potential immunogenic epitopes (28, 29) , and the same process may be playing a role here. As we are integrating models from different resources and diverse experimental methods, we wished to confirm if this data is comparable. Interestingly, one epitope has both tetramer (315 TCRs) and MIRA data (366 TCRs), namely YLQPRT-FLL (YLQ). However in the case of the MIRA data, the TCRs were not uniquely assigned to this epitope, but all were assigned to the trio YLQPRTFL,YLQPRTFLL, and YYV-GYLQPRTF. These 366 TCRs were thus excluded from the training data. Thus the tetramer-based YLQ model can be applied on the MIRA YLQ data as an independent model. In this manner, TCRex predicted 81 putative YLQ-reactive T-cells in the YLQ MIRA data out of 366 TCRs. Note that only 35 TCRs matched between the two datasets based on CDR3 sequence (not accounting for V/J genes), showing that TCRex is able to extrapolate from found TCR patterns. This number of TCRex predictions was assigned an enrichment P-value of 6.44e-246 based on the built-in binomial test. No other epitopes present in TCRex (including the 49 non-SARS-CoV-2 models) were predicted to have a single TCR target within this data set. Thus the data is comparable and the models can be used without respect of their origin. Longitudinal tracking. Once established, these models can be applied to any TCR repertoire data and thus can be used to study putative SARS-CoV-2 reactive T cells in the currently available COVID-19 data. A large TCR data set was made available by Schultheiß et al (19) , which featured longitudinal samples from both patients with active disease and those that have recovered. As can be seen in Figure 2 , the percentage of predicted SARS-CoV-2 reactive TCRs increases as time goes on after on-set of symptoms (Spearman rho = 0.36, P-value = 0.0022). This is both due to an expansion of distinct SARS-CoV-2 TCR sequences and a contraction of the remainder of the TCR repertoire. This matches findings from prior studies investigating T-cell immunity, which have observed that SARS-CoV-2 specific T-cell immunity mounts as disease progresses, alongside an absolute decrease in Tcell population size (30, 31) . In addition, the increase is irrespective of final disease outcome. As can be seen in Figure 3 , there seems to be a difference between those TCRs that are unique to SARS-CoV-2 versus those that are not. The fraction of TCRs that are predicted to match unique SARS-CoV-2 epitopes shows a similar increase as was found for all epitopes (Spearman rho = 0.36, P-value = 0.0021). While that for the epitopes occurring across coronaviruses have a markedly lower increase signal that is not significant (Spearman rho = 0.23, P-value = 0.057). Indeed cross-reactive TCRs predicted to target epitopes not unique to SARS-CoV-2 start at a higher level, and only seem to gradually increase as the infection progresses. This may be in line with prior reports of existing cross-reactive T cells in uninfected patients (9, 11, 32 ). The number of SARS-CoV-2 TCRs do decrease once the pa- tient enters recovery. An example can be seen in Figure S1 . This can also be seen based on the TCR data from Minervina et al., where both donors were sampled after symptoms had disappeared as per Figure S2 . T cells contracted over time points within a single individual were considered as associated with SARS-CoV-2. A list of 661 and 372 contracted TCRs (one set for each donor) were identified using EdgeR in the original study and published as supplemental materials. At the time, no epitope-specific TCR data was available for SARS-CoV-2 and thus it could not be analysed in this manner. If we apply all TCRex models (both SARS-CoV-2 and other viruses), we find strong enrichment for two SARS-CoV-2 epitopes in one donor (donor M), as can be seen in Figure 4 . The TCRs associated with the most prominent epitope, namely YLQPRTFLL, clearly decrease over time in donor M as seen in the original TCR data set as can been seen in figure S3 . Note that the YLQ epitope originates from the spike protein and is unique to SARS-CoV-2, which matches the previous findings. The other donor had no such enriched epitopes of any origin, indicating that these TCRs might still be resulting from a not-included set of epitopes. In this paper, we have shown that there is sufficient SARS-CoV-2 epitope-TCR data to create a large number of epitopespecific TCR recognition models. These models can be used to screen TCR data from various individuals to track their T-cell immunity. In addition, using such models on longitudinal data reveals a potential difference in temporal dynamics between T cells predicted to react against epitopes that are unique to SARS-CoV-2 and those that are shared among other coronaviruses. T-cell immunity of SARS-CoV: Implications for vaccine development against MERS-CoV Understanding the T cell immune response in SARS coronavirus infection Long-lived effector/central memory T-cell responses to severe acute respiratory syndrome coronavirus (SARS-CoV) S antigen in recovered SARS patients SARS-CoV-2-specific T cell immunity in cases of COVID-19 and SARS, and uninfected controls T cell responses in patients with COVID-19 Kinetics of SARS-CoV-2 specific IgM and IgG responses in COVID-19 patients Convergent Antibody Responses to SARS-CoV-2 Neutralizing Antibody Responses to SARS-CoV-2 in a COVID-19 Recovered Patient Cohort and Their Implications A Sequence Homology and Bioinformatic Approach Can Predict Candidate Targets for Immune Responses to SARS-CoV-2 Eliisa Kekäläinen, and Petter Brodin. Systems-Level Immunomonitoring from Acute to Recovery Phase of Severe COVID-19 Presence of SARS-CoV-2 reactive T cells in COVID-19 patients and healthy donors. medRxiv Selective and cross-reactive SARS-CoV-2 T cell epitopes in unexposed humans In silico identification of vaccine targets for 2019-nCoV Bioinformatic prediction of potential T cell epitopes for SARS-Cov-2 Preliminary Identification of Potential Vaccine Targets for the COVID-19 Coronavirus (SARS-CoV-2) Based on SARS Immunoinformatics-aided identification of T cell and B cell epitopes in the surface glycoprotein of 2019-nCoV Potential T-cell and B-cell Epitopes of 2019-nCoV SARS-CoV-2 epitopes are recognized by a public and diverse repertoire of human T-cell receptors. medrxiv Next-Generation Sequencing of T and B Cell Receptor Repertoires from COVID-19 Patients Showed Signatures Associated with Severity of Disease Magnitude and Dynamics of the T-Cell Response to SARS-CoV-2 Infection at Both Individual and Population Levels On the feasibility of mining CD8+ T cell receptor patterns underlying immunogenic peptide recognition Detection of Enriched T Cell Epitope Specificity in Full T Cell Receptor Sequence Repertoires On the viability of unsupervised T-cell receptor sequence clustering for epitope preference iReceptor: A platform for querying and analyzing antibody/B-cell and T-cell receptor repertoire data across federated repositories Longitudinal high-throughput TCR repertoire profiling reveals the dynamics of T cell memory formation after mild COVID-19 infection The OMA orthology database in 2018: Retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces Phylogenetic Analysis and Structural Modeling of SARS-CoV-2 Spike Protein Reveals an Evolutionary Distinct and Proteolytically Sensitive Activation Loop Viruses selectively mutate their CD8+ T-cell epitopes-a large-scale immunomic analysis Immunological evasion of immediate-early varicella zoster virus proteins Characteristics of Peripheral Lymphocyte Subset Alteration in COVID-19 Pneumonia. The Journal of infectious diseases Longitudinal characteristics of lymphocyte responses and cytokine profiles in the peripheral blood of SARS-CoV-2 infected patients SARS-CoV-2 reactive T cells in uninfected individuals are likely expanded by beta-coronaviruses. bioRxiv We wish to thank the scientists who have made their data available, and doing so have made these studies possible.