key: cord-0996079-b3aueaxk authors: Shoukat, M. S.; Foers, A. D.; Woodmansey, S.; Evans, S. C.; Fowler, A.; Soilleux, E. title: Use of machine learning to identify a T cell response to SARS-CoV-2 date: 2021-01-16 journal: Cell Rep Med DOI: 10.1016/j.xcrm.2021.100192 sha: 60f675dafe76a57e7db9405317f697739386086e doc_id: 996079 cord_uid: b3aueaxk The identification of SARS-CoV-2-specific T cell receptor (TCR) sequences is critical for understanding T cell responses to SARS-CoV-2. Accordingly, we reanalyse publicly available data from SARS-CoV-2-recovered patients who had low severity disease (n=17) and SARS-CoV-2 infection-naïve (control) individuals (n=39). Applying a machine learning approach to TCR beta (TRB) repertoire data, we can classify patient/ control samples with a training sensitivity, specificity and accuracy of 88.2%, 100%, and 96.4%, and a testing sensitivity, specificity and accuracy of 82.4%, 97.4%, and 92.9%, respectively. Interestingly, the same machine learning approach cannot separate SARS-CoV-2 recovered from SARS-CoV-2 infection-naive individual samples on the basis of B cell receptor (IGH) repertoire data, suggesting that the T cell response to SARS-CoV-2 may be more stereotyped and longer-lived. Following validation in larger cohorts, our method may be useful in detecting protective immunity acquired through natural infection or in determining the longevity of vaccine-induced immunity. infection-naïve (control) individuals, as a first step towards identifying a signal that might 23 indicate that an individual has protective immunity to SARS-CoV-2. Detection of such a 24 signal could be useful in indicating that an individual has developed protective immunity 25 through natural infection or in post-vaccination follow-up, when considering the longevity of 26 vaccine-induced immunity. Until a safe and effective vaccine becomes widely available, determining likely protective 28 immunity to SARS-CoV-2 at an individual level is paramount 2 both for healthcare personnel 29 and to enable wider society to return to a level of normality, with attendant economic 30 recovery. Immune status may be assessed by testing for SARS-CoV-2 specific antibodies, 31 although it remains unclear whether the presence of antibodies confers robust SARS-CoV-2 32 protective immunity 3 , and studies of other human coronaviruses have demonstrated re-33 infection despite the presence of virus-specific antibodies 4 . Antibody decay, a recognised 34 phenomenon in response to other human coronaviruses 5 , occurs in post-COVID-19 patients 2, 35 6 . This is more common in individuals who experienced mild/ asymptomatic infection and 36 had low antibody titres in the early convalescent period 7 . Complete absence of a detectable 37 antibody response is also described in some individuals following mild/ asymptomatic 38 J o u r n a l P r e -p r o o f To investigate whether TCR/BCR analysis can identify patients with evidence of adaptive Applying our bioinformatic algorithm 16 , to Schultheiß's TCR beta (TRB) repertoire dataset, 78 samples were classified with a training sensitivity, specificity and overall accuracy of 88.2%, 79 100%, and 96.4%, respectively ( Figure 1B and 1C) . Due to the relatively small sizes of the 80 cohorts (low severity infection: n=17; uninfected: n=39), we were unable to use a fully 81 independent test cohort. We therefore undertook a leave-one-out-cross-validation (LOOCV) 82 approach, achieving a testing sensitivity, specificity and overall accuracy of 82.4%, 97.4%, 83 and 92.9%, respectively ( Figure 1C ). This is a small and preliminary study, utilising an analytical approach that we have 136 previously successfully applied to the diagnosis of coeliac disease using duodenal biopsy 137 samples 16 . A larger dataset, with separate training and test sets is required to corroborate our 138 findings. Such a dataset will ideally need to have been generated using the same T/B cell 139 receptor repertoire sequencing methodology, preferably in one or more separate laboratories, in different thirds of the CDR3 region, were treated as non-identical, as described 320 previously 16 . Individual kmer frequencies were assigned as per the frequency of the parent 321 CDR3 sequence and a matrix of kmer sequences and their total frequencies in each sample 322 was generated for each sample. The 1000 most frequent kmers in each patient sample were 323 then selected for sample classification. Sample classification was performed as previously described 16 . Briefly, kmer matrix Memory T cell responses targeting the SARS coronavirus persist up to 11 years post-204 infection Learning the high-dimensional immunogenomic features that predict public and 207 private antibody repertoires Statistical classifiers for diagnosing disease from immune repertoires: a 210 case study using multiple sclerosis s.d.%) training accuracy across all permutations We are grateful to Schultheiß et al. for the provision of the publicly