key: cord-0683093-5x31wvug authors: Li, Chen; Revote, Jerico; Ramarathinam, Sri H.; Chung, Shan Zou; Croft, Nathan P.; Scull, Katherine E.; Huang, Ziyi; Ayala, Rochelle; Braun, Asolina; Mifsud, Nicole A.; Illing, Patricia T.; Faridi, Pouya; Purcell, Anthony W. title: Resourcing, annotating, and analysing synthetic peptides of SARS‐CoV‐2 for immunopeptidomics and other immunological studies date: 2021-04-14 journal: Proteomics DOI: 10.1002/pmic.202100036 sha: 718a11371fb177eeb74e30701040fb743346ab1c doc_id: 683093 cord_uid: 5x31wvug SARS‐CoV‐2 has caused a significant ongoing pandemic worldwide. A number of studies have examined the T cell mediated immune responses against SARS‐CoV‐2, identifying potential T cell epitopes derived from the SARS‐CoV‐2 proteome. Such studies will aid in identifying targets for vaccination and immune monitoring. In this study, we applied tandem mass spectrometry and proteomic techniques to a library of ∼40,000 synthetic peptides, in order to generate a large dataset of SARS‐CoV‐2 derived peptide MS/MS spectra. On this basis, we built an online knowledgebase, termed virusMS (https://virusms.erc.monash.edu/), to document, annotate and analyse these synthetic peptides and their spectral information. VirusMS incorporates a user‐friendly interface to facilitate searching, browsing and downloading the database content. Detailed annotations of the peptides, including experimental information, peptide modifications, predicted peptide‐HLA (human leukocyte antigen) binding affinities, and peptide MS/MS spectral data, are provided in virusMS. The SARS (severe acute respiratory syndrome)-CoV-2 virus was first identified in Wuhan, China in December 2019 and has gone on to cause a global pandemic [1, 2] . The pneumonia and other related syndromes caused by SARS-CoV-2 infection was further defined as COVID-19 (i.e., coronavirus disease 2019) by the WHO [3, 4] . To date, a number of studies using diverse biological techniques, including mass spectrometry, have explored and characterised the human proteome-wide functional disruptions and immune responses upon SARS-CoV-2 infection [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] . T cell mediated immunity plays a crucial role in controlling and eliminating viral disease [22] [23] [24] [25] . Antigen processing and presentation are two of the most important steps of T cell mediated immunity, where peptides derived from viral antigens are generated and presented at the cell surface by MHC (major histocompatibility complex or human leukocyte antigen; i.e., HLA) class I and class II molecules. These peptide-MHC complexes are scrutinised by the clonally distributed T cell receptors (TCRs) expressed on the surface of T cells, with recognition of foreign peptides triggering immune responses [26] [27] [28] . A number of studies have been dedicated to the discovery of the T cell epitopes derived from SARS-CoV-2 [29] [30] [31] [32] [33] . However, the accurate identification of SARS-CoV-2 peptides presented to the immune system by MHC molecules (collectively termed the immunopeptidome) remains challenging and is critical for a better understanding of human immune responses to SARS-CoV-2, vaccine design and clinical monitoring of COVID-19. Given its complexity (particularly with the host antigen-derived peptide background), the task of mapping the viral immunopeptidome can be hampered by ambiguous spectral assignments, often requiring extensive validation of peptide spectra using synthetic peptides [26] . In order to facilitate rapid SARS-CoV-2 peptide validation, herein we describe the generation of an interactive and comprehensive online database of SARS-CoV-2 peptides, termed virusMS, which harbours in total 39,650 synthetic peptides generated by extensive and diverse proteolytic digestion of peptide precursors derived from the viral proteome. We have synthesized the SARS-COV-2 15mers peptides and digested by four proteases to generate a large dataset of SARS-COV-2-derived peptides without synthesising each peptide. We used two non-specific proteases (pepsin and elastase) and also trypsin and chymotrypsin for mimicking the activity of β2 (trypsin like) and β5 (chymotrypsin-like) proteasome subunits. To the best of our knowledge, virusMS is the first database offering MS/MS information for SARS-CoV-2 synthetic peptides. For each peptide, virusMS provides comprehensive annotation regarding the experimental MS/MS information, predicted peptide-HLA (human leukocyte antigen) binding affinity for HLA class I supertypes, and the full MS/MS spectral data. The implementation of a user-friendly interface significantly boosts the user experience when searching and browsing the entries from virusMS. In addition, all data documented in virusMS is publicly available for customised and bulk download. We anticipate that virusMS will facilitate hypothesis generation for immunological studies and provide foundational data for the validation of immunopeptidomics studies related to SARS-CoV-2. A peptide library (1809 total peptides, comprising 15 amino acid length each, overlapping by nine amino acids) was synthesized from the entire SARS-CoV-2 proteome (Mimotopes, Australia). Peptides were dissolved in 100 μL of 5% DMSO ( Figure 1A ). In this study, we conducted two major predictions for peptide-HLA binding affinity, including (i) using peptides from experiment and (ii) proteome-wide binding prediction. NetMHC 4.0 [35, 36] was employed for the binding affinity between peptides and HLA class I molecule UniProt database (as described above). When conducting binding prediction using the proteome data, the peptide length for was set as 8 to 14 amino acids to reflect typical ligand lengths of HLA class I molecules. We used all default settings for the parameter configuration. A strong binder is defined as one with prediction ranking equal to or lower than 0.5%; while a weak binder has the predicted ranking between 0.5% and 2%. Any peptides with predicted ranking higher than 2% are regarded as non-binders. We cross-referenced two additional databases to help users further interpret the data documented in virusMS, namely IEDB (Immune Epitope Database and Analysis Resource) [37] (data compared on 8th October 2020) and UniProt [38] (downloaded on 23rd September 2020) ( Figure 1B) . We mapped the peptides documented in . For any peptide that can be mapped to IEDB, the related information of the peptide, including IEDB accession, the protein name and species are provided. The UniProt database was mainly utilised to provide coronavirus-related protein sequences for peptide mapping and proteome-wide peptide-HLA binding prediction. Three open-source JavaScript plugins were applied in virusMS to provide interactive data presentation, including the NeXtProt sequence viewer (part of the NeXtProt project [39] ), Lorikeet (https://github.com/UWPR/Lorikeet), and Vue.js (https://vuejs.org/). The NeXtProt sequence viewer is used to present the peptide and its parent protein sequences. Lorikeet is applied to demonstrate interactive presentation of mass spectrometry data for peptides in virusMS; while Vue.js is employed to illustrate the predicted peptide-HLA class I binding affinity. VirusMS To generate a comprehensive dataset of as many possible SARS-CoV-2 peptides, we first synthesised an overlapping peptide library (15mers, overlapping by nine amino acids) of the SARS-CoV-2 viral proteome. Next, we subjected peptide pools to proteolytic digestion using four tides were predicted to bind to at least three HLA class I supertypes. Note that the statistics is subject to change upon the database updates. Detailed breakdowns of peptide-HLA binding prediction can be viewed at and downloaded from virusMS. VirusMS provides several major functionalities as demonstrated in Figure 3A ) directly to extract all the peptides harvested from the experiment ( Figure 3B ). It is fairly straightforward and easy to search virusMS. Four types of keywords have been provided to facilitate efficient searching against virusMS, including peptide sequence, peptide source protein, peptide modification type, and peptide-HLA binding (Figure 4 ). VirusMS uses a "fuzzy search" strategy when processing the user-provided peptide sequences ( Figure 4A ). As such, users do not need to provide a full Users can perform binding searches within a total of 24 combinations to directly extract the peptides of interest. We have designed an informative webpage ( Figure 5 ) to present the experimental data (including mass spectrometry data, peptide modification), prediction information (such peptide-HLA binding affinity), and other identity information by cross-referencing third-party databases including IEDB and UniProt. For each peptide, we extracted a variety of experimental information, including basic peptide information (e.g., mass, ppm, m/z, normalised retention time, and -10logP) (Figure 5A) , peptide-HLA-I binding prediction ( Figure 5B ), and the MS data displayed using the Lorikeet JavaScript plugin (https://github. com/jmchilton/lorikeet), which allows users to investigate the peptide spectral data interactively, including ions, mass type, peak alignment and labels ( Figure 5C ). In addition, users have an option to compare the spectral data of the same peptide with different modifications by clicking the "Compare the spectral data with other identical peptides" link. All MS data will be displayed using the Lorikeet plug-in on a newly opened webpage. Meanwhile, for peptides with modification(s), the modified residue(s) are highlighted by the Lorikeet plugin. We further mapped the peptide sequence to the IEDB platform and extracted the possible IEDB accession for this peptide. Note that the accession ID was obtained by sequence match to a range of coronavirus species (see Section 2 for more details); therefore, one should bear in mind that the matching just aims to provide relevant information regarding the peptides with identical sequences. In addition to the experimental information, virusMS provides predicted peptide-HLA binding affinities ( Figure 5B ). The detailed predicted results including binding affinity (nM) and rank of binding level (%) are illustrated using two interac- VirusMS allows both bulk and customisable download of the metadata in the database. Several ways have been offered to download the database content ( Figure 6 ). First, users can download either selected or all the search results using keywords in virusMS, by clicking the "Download Selected" or the "Download All" button, respectively (Figure 6A) . Alternatively, for the peptide of interest, users can click the "Download the metadata" button on the webpage for the detailed peptide information ( Figure 6B ). The downloaded file is in TSV (Tab Separated Values) format and contains all the information for the selected peptide(s), including the virus, experimental description, peptide basic information, modification, spectral data, and peptide-HLA binding predictions ( Figure 6C ). In addition, we have exported the entire database to a SQL file, which can be easily imported to the MySQL database workbench or the phpMyAdmin management system ( Figure 6D ). The peptide spectral data is also available for MS users to generate libraries. To keep up with the research progress, we endeavour to incorporate state-of-the-art literature to virusMS in a timely manner. Individual submission is also welcome, but users need to contact us for next steps. In this study, two large-scale synthetic peptide datasets from the SARS- open-access and all the information in the database is freely available. We hope virusMS will serve as an essential data resource for immunological and vaccine studies of SARS-CoV-2 and COVID-19. As more data becomes available, we will continue to expand the content of the virusMS database, including further coverage of the viral proteome and additional MS platforms such as the Bruker TIMS TOF Pro. [dataset] VirusMS is accessible at https://virusms.erc.monash.edu/. The mass spectrometry proteomics data of SARS-CoV-2 is available at the ProteomeXchange Consortium with the dataset identifier PXD022191. The authors declare no conflict of interest. VirusMS is freely available for academic purposes at https://virusms. Emergence of a novel coronavirus, severe acute respiratory syndrome coronavirus 2: Biology and therapeutic options Clinical characteristics of refractory COVID-19 pneumonia in Wuhan, China The Novel Coronavirus Originating in Wuhan, China Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: A retrospective cohort study Proteomics of SARS-CoV-2-infected host cells reveals therapy targets The global phosphorylation landscape of SARS-CoV-2 infection A SARS-CoV-2 protein interaction map reveals targets for drug repurposing Proteomic and metabolomic characterization of COVID-19 patient sera patient sera Proteomics and informatics for understanding phases and identifying biomarkers in COVID-19 disease Site-specific N-glycosylation characterization of recombinant SARS-CoV-2 spike proteins. Molecular and Cellular Proteomics Harnessing CAR T-cell insights to develop treatments for hyperinflammatory responses in patients with COVID-19 T cell responses in patients with COVID-19 Targets of T cell responses to SARS-CoV-2 coronavirus in humans with COVID-19 disease and unexposed individuals Suboptimal SARS-CoV-2-specific CD8 + T cell response associated with the prominent HLA-A*02:01 phenotype Humoral and circulating follicular helper T cell responses in recovered patients with COVID-19 COVID-19 patients display distinct SARS-CoV-2 specific T-cell responses according to disease severity SARS-CoV-2-specific T cell immunity in cases of COVID-19 and SARS, and uninfected controls Selective and crossreactive SARS-CoV-2 T cell epitopes in unexposed humans Cell-mediated and humoral adaptive immune responses to SARS-CoV-2 are lower in asymptomatic than symptomatic COVID-19 patients Divergent SARS-CoV-2-specific T and B cell responses in severe but not mild COVID-19 patients Singlecell landscape of immunological responses in patients with COVID-19 Memory T-cell-mediated immune responses specific to an alternative core protein in hepatitis C virus infection T cell-mediated immune response to respiratory coronaviruses Not just antibodies: B cells and T cells mediate immunity to COVID-19 T cell mediated immunity to influenza: Mechanisms of viral control Mass spectrometry-based identification of MHC-bound peptides for immunopeptidomics The known unknowns of antigen processing and presentation Pathways of antigen processing and presentation Shortlisting SARS-CoV-2 peptides for targeted studies from experimental data-dependent acquisition tandem mass spectrometry data SARS-CoV-2-derived peptides define heterologous and COVID-19-induced T cell recognition Two linear epitopes on the SARS-CoV-2 spike protein that elicit neutralising antibodies in COVID-19 patients Evidence supporting the use of peptides and peptidomimetics as potential SARS-CoV-2 (COVID-19) therapeutics SARS-CoV-2 infected cells present HLA-I peptides from canonical and out-of-frame ORFs The ProteomeXchange consortium in 2020: Enabling 'big data' approaches in proteomics Gapped sequence alignment using artificial neural networks: Application to the MHC class I system Reliable prediction of T-cell epitopes using neural networks with novel sequence representations The Immune Epitope Database (IEDB): 2018 update UniProt: A worldwide hub of protein knowledge The neXtProt knowledgebase in 2020: Data, tools and usability improvements The PRIDE database and related tools and resources in 2019: Improving support for quantification data in the Supporting Information section at the end of the article. How to cite this article Resourcing, annotating, and analysing synthetic peptides of SARS-CoV-2 for immunopeptidomics and other immunological studies