key: cord-0890655-usr673cu authors: Dai, Yufeng; Chen, Hongzhi; Zhuang, Siqi; Feng, Xiaojing; Fang, Yiyuan; Tang, Haoneng; Dai, Ruchun; Tang, Lingli; Liu, Jun; Ma, Tianmin; Zhong, Guangming title: Immunodominant regions prediction of nucleocapsid protein for SARS-CoV-2 early diagnosis: a bioinformatics and immunoinformatics study date: 2020-11-16 journal: Pathogens and global health DOI: 10.1080/20477724.2020.1838190 sha: 898cf7b3084253023cebc9fe79e4aee72937396e doc_id: 890655 cord_uid: usr673cu COVID-19 caused by SARS-CoV-2 is sweeping the world and posing serious health problems. Rapid and accurate detection along with timely isolation is the key to control the epidemic. Nucleic acid test and antibody-detection have been applied in the diagnosis of COVID-19, while both have their limitations. Comparatively, direct detection of viral antigens in clinical specimens is highly valuable for the early diagnosis of SARS-CoV-2. The nucleocapsid (N) protein is one of the predominantly expressed proteins with high immunogenicity during the early stages of infection. Here, we applied multiple bioinformatics servers to forecast the potential immunodominant regions derived from the N protein of SARS-CoV-2. Since the high homology of N protein between SARS-CoV-2 and SARS-CoV, we attempted to leverage existing SARS-CoV immunological studies to develop SARS-CoV-2 diagnostic antibodies. Finally, N(229-269), N(349-399), and N(405-419) were predicted to be the potential immunodominant regions, which contain both predicted linear B-cell epitopes and murine MHC class II binding epitopes. These three regions exhibited good surface accessibility and hydrophilicity. All were forecasted to be non-allergen and non-toxic. The final construct was built based on the bioinformatics analysis, which could help to develop an antigen-capture system for the early diagnosis of SARS-CoV-2. Since the outbreak of coronavirus disease 2019 caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), 33,249,563 cases of infection and 1,000,040 COVID-19 related deaths have been reported as of 29 September 2020 (https://www.who. int/). A large portion of infected people are asymptomatic, or with only mild symptoms, which pose a major challenge in controlling the COVID-19 pandemic since some asymptomatic patients are still contagious [1] . Thus, early diagnosis and screening of COVID-19 in the large population is urgently needed to contain the pandemic. RT-PCR (short for real-time reverse transcriptasepolymerase chain reaction) is the primary method for the detection of SARS-CoV-2 as well as other respiratory viruses [2, 3] . However, RT-PCR requires timeconsuming and labor-intensive RNA preparation and professional operation, which increases the difficulty of on-site detection. Antibody testing for SARS-CoV-2 is another option to screen infected patients in the high prevalence areas. As it takes time for hosts to generate antibodies against viruses, antibody detection is suitable for population immunity investigation, but not for early diagnosis [4] . Hence, developing an appropriate antibody for viral antigen detection is highly valuable for the pandemic containment. Structural proteins have been regarded as important targets for antigen detection, such as nucleoprotein of influenza virus, p24 antigen of human immunodeficiency virus (HIV), VP6 of Rotavirus (RV), etc. [5] [6] [7] . The coronavirus nucleocapsid (N) protein is a structural protein that plays a critical role in viral RNA replication [8] . According to the studies of severe acute respiratory syndrome coronavirus (SARS-CoV), high levels of circulating N protein in the serum of SARS-CoV patients could be caught as early as clinical symptoms appeared [9] . A comparison of detecting SARS-CoV RNA, specific IgG, and N protein during the early period of illness showed that the detection capability of N protein was notably higher than the other two indicators [10] . These evidences suggested that the N protein could be an appropriate candidate for the early diagnostic testing and screening of SARS-CoV-2. Bioinformatics is a scientific field combining biology, computing, and information technology. It organizes plentiful biological information to systematically and accurately interpret the information from genome transcriptome and proteome. It is extensively applied in immunodiagnostics, immunotherapeutics, and vaccine design [11] [12] [13] . Moreover, bioinformatics was used to identify the epitopes of SARS-CoV for raising neutralizing antibodies and diagnostic antibodies in previous studies [14, 15] . Given the scarcity of biological data on the antigenic epitopes of SARS-CoV-2, bioinformatics is crucial in the early stages of exploring epitope information. Therefore, we utilized a bioinformatics and immunoinformatics approach to comprehensively deduce the potential immunodominant regions on the N protein of SARS-CoV-2. The complete study workflow is presented in Figure 1 . Our study could provide an important complementary strategy in the development of early diagnostic systems to combat the current pandemics. N protein sequences of HCoV-NL63, HCoV-229E, HCoV-HKU1, HCoV-OC43, MERS-CoV, SARS-CoV, and SARS-CoV-2 were downloaded in FASTA format from NCBI database. Sequence alignment of N protein in these seven coronaviruses was performed on EMBL-EBI server Clustal Omega. Clustal Omega exploits seeded guide trees along with HMM profile-profile techniques to produce alignments between multiple sequences [16] . The phylogenetic analysis was also executed to find evolutionary ties among those seven coronaviruses, and the branch length represented the evolutionary distances between two nodes [17] . ABCpred, BepiPred-2.0, and Antibody Epitope Prediction online server in the Immune Epitope Database (IEDB) were adopted for linear B-cell epitope prediction. ABCpred server predicts the peptides according to the scores that acquired by the trained recurrent neural network, the higher the peptide score, the higher the prediction accuracy [18] . In our study, the cutoff of ≥0.80 (corresponding to 95.50% specificity) and the length of amino acids of 16 (default window length) of ABCpred server was employed [18] . BepiPred-2.0 server originates a random forest algorithm, which is derived from peptides annotated by antibody-antigen constructions, the residues with scores higher than the threshold are forecasted to be the segment of an epitope. We used a threshold value of ≥0.55 to achieve a specificity of 81.66% for epitope prediction [19] . IEDB is a depository of information associated with epitopes, which provides bioinformatics implements combined with algorithms [20] . The Antibody Epitope Prediction servers were accessible on the B-cell prediction tool page in IEDB, and the threshold was set at 0.35 (default threshold) [20] . To forecast murine T-cell epitopes, we utilized the TepiTool resource in IEDB, which employs SMM, ANN, and combinatorial library methods [21] . Here, we set the method as 'IEDB recommended'. For MHC (Major Histocompatibility Complex) class I binding prediction, the selected model exploits the consensus method comprising of CombLib, ANN, and SMM [22] . In this study, we set predicted consensus percentile rank ≤1 and the length of amino acids to 9. For MHC class II binding prediction, the selected model exploits the consensus method embracing of Sturniolo/ Combinatorial Library, NN_align, and SMM_align [23] . The predicted consensus percentile rank ≤10 and 15 residues in length were set. The selected epitopes were submitted to the VaxiJen v2.0 with the given threshold of 0.40 (corresponding to 70% accuracy) for assessing the antigenic propensity [24] . VaxiJen is an alignment-free method for antigen prediction, it depends on auto cross-covariance (ACC) transformation of protein sequences into uniform vectors of principal amino acid properties, the prediction accuracy is between 70% and 89%. The higher the score, the higher the likelihood to induce immune response [24] . The hydrophilicity was evaluated with ProtScale. We chose the Kyte & Doolittle model as the amino acid scale. This program uses the approach of moving-segment to continuously determine the average hydrophilicity within a segment of a predetermined length in the process of sequence advance [25] . Surface accessibility of predicted peptides was evaluated with the NetsurfP-2.0, which uses an architecture comprised of convolutional along with long short-term memory neural networks trained on solved protein structures to forecast the relative exposure of amino acids [26] . The threshold was set to 25% exposure, but we filtered the regions where RSA (Relative Surface Accessibility) ≥50%. The secondary structure was analyzed by SOPMA. SOPMA has a success rate of 73.2% for a three-state (a-helix, βsheet, and aperiodic states) description of secondary structure [27] . Toxicity was appraised via the ToxinPred server, which is established on the machine learning technique and quantitative matrix using numerous characters of fragments for forecasting the toxicity. The precision of the dipeptide-based model is 94.50% [28] . Allergenicity was assessed via the AllerTOP v2.0 server, it employs amino acid E-descriptors, auto-and cross-covariance transformation, and some machine learning methods for division [29] . A Protein BLAST search was carried out to determine the possibility of cross-reactivity among the final construct with other proteins. It can yield functional and evolutionary clues about amino acid sequences. Human coronavirus (HCoV) includes α-coronaviruses and β-coronaviruses. HCoV-229E and HCoV-NL63 belong to the former, HCoV-HKU1, HCoV-OC43, the Middle East respiratory syndrome-related coronavirus (MERS-CoV), SARS-CoV, and the SARS-CoV-2 belong to the latter [30] . N protein is a relatively conservative protein in coronaviruses and has been successfully used as a diagnostic antigen [8, 31] . Amino acid sequences of N proteins from these HCoV were obtained from the NCBI database, of which accession IDs were presented in Figure S1A . To better understand the divergence of N protein sequences between SARS-CoV-2 and other HCoVs, Clustal Omega was utilized to compare the full-length N protein sequences of the seven coronaviruses mentioned above. The result revealed that the N protein sequences between SARS-CoV-2 and SARS-CoV are highly similar and the evolutionary relationship of these species based on their N protein sequence information was presented in the phylogenetic tree (Figure S1A & S1B; File S1). As the close homology of N protein between SARS-CoV-2 and SARS-CoV, we analyzed the immunodominant antigenic regions of the SARS-CoV-2 N protein and compared them with the existing immunological studies of SARS-CoV. We abandoned the identical epitopes shared by both to enhance the specificity. The full-length sequence of SARS-CoV-2 N protein was evaluated through ABCpred, BepiPred-2.0, and IEDB. The antigenicity was calculated via VaxiJen v2.0 with the given cutoff of ≥0.40. Using the ABCpred algorithm with the threshold value of ≥0.80, we identified 24 peptides (Table S1 ). Fourteen peptides were obtained via BepiPred-2.0 with a cutoff of ≥0.55 (Table S2 ). In parallel, a total of 16 peptides were identified by using IEDB (Table S3) . The short peptides (less than six amino acids) were discarded. To ensure the specificity, the peptides consistent with the SARS-CoV N protein sequence were excluded. Finally, 11, 8, and 7 potential linear B-cell epitopes forecasted by ABCpred, BepiPred-2.0, and IEDB, respectively, were obtained after stringent filtering (Table 1) . After mapping the positions of peptides identified by those three servers, eight regions containing differently predicted epitopes were obtained (Figure 2 ). To further evaluate the potentiality of these eight antigenic regions as targets for antibody binding, their hydrophilicity, surface accessibility, and secondary structure were analyzed. Eight sequences were forecasted to be hydrophilic via ProtScale ( Figure S2A ); and when RSA ≥50%, four main regions with good surface accessibility were predicted in NetsurfP-2.0 ( Figure S2B ). After comparing the predicted peptides with the hydrophilic regions and the surface accessible regions, N 58-106 , N 275-290 , and N 327-349 were eliminated due to their surface inaccessibility, which might sterically hinder the approachability of antibody (Figure 2 ). At this point, we obtained five predicted regions (N 1-51 , N 164-221 , N 229-269 , N 361-399 , N 408-416 ). Considering that structures such as beta-turn and random coil are more conducive to bind with the specific B cell receptor (BCR), we adopted SOPMA online software to predict the secondary structure of N protein ( Figure S2C ). To ensure the structural integrity of the predicted epitopes, we adjusted the terminus of the selected regions by appropriately adding several amino acid residues at both ends. By synthesizing the results of hydrophilicity, surface accessibility, and secondary structure analysis, we altered N 164-221 to N 161-221 ; N 361-399 to N 354-399 . Hereinabove, N 1-51 , N 161-221 , N 229-269 , N 354-399 , and N 408-416 were selected, which contained several potential B-cell epitopes with a high antigenicity score. For potential murine T-cell epitopes prediction, the TepiTool resource incorporated in IEDB was utilized. Thirteen peptides of MHC class I binding epitopes along with 16 peptides of MHC class II binding epitopes were identified (Table S4 ). All predicted T-cell epitopes that overlapped with the selected B-cell epitope regions were displayed in Table S5 . Theoretically, MHC-I molecules promote the activation of cytotoxic T lymphocyte (CTL) which kills virus-infected cells [32] . Helper T cells (Th) activated by MHC-II presenting epitopes could provide essential signals to B cells for antibodies production [15] . Predicted murine MHC class II binding epitope N 244-258 was included in our predicted B-cell epitope region N 229-269 ; N 357-371 and N 384-398 were contained in predicted region N 354-399 ; predicted B-cell epitope N 408-416 was contained in predicted murine T-cell epitope N 405-419 . Therefore, N 229-269 , N 354-399 , and N 405-419 were finally chosen because they contained potential murine MHC class II binding epitopes without murine MHC class I binding epitopes, which makes them more conducive to the production of antibodies. Immunome Browser 3.0 in IEDB comprises the records of existing reference sequences. It can form a response frequency (RF) score to indicate the frequency of the residues in the positive epitopes together with the independent experimental records [33, 34] . After scanned in Immunome Browser 3.0, none of the predicted murine MHC class II binding epitopes of SARS-CoV-2 N protein had been confirmed yet. Concurrently, we retrieved the murine MHC class II binding epitopes of SARS-CoV with Immunome Browser 3.0, we found that N 351-365 (Epitope ID: 69,035) and N 353-365 (Epitope ID: 985,589) in SARS-CoV N protein had been verified as murine MHC class II binding epitopes by experiments [15, 35] , they corresponded to the identical sequences N 350-364 and N 352-364 in SARS-CoV-2, respectively. Moreover, we noticed that Val 350 and Leu 352 were located in an extended strand structure. Thus, to keep the secondary structure integrity, we expanded the predicted region N 354-399 to N 349-399 . Allergenicity of the selected regions was assessed via AllerTOP v2.0, the results demonstrated that all identified regions were predicted to be non-allergen; The toxicity of the three predicted regions was examined by ToxinPred, all of them were forecasted to be non-toxin (Table 2) . Three predicted regions (N 229-269 , N 349-399 , and N 405-419 ) were connected using flexible linkers (GGGGS) 2 . Flexible linker (GGGGS) 2 is excellent in segmenting protein fragments, maintaining biological activity, and promoting protein expression [36] . It had been proved that the Pan DR epitope PADRE (AKFVAAWTLKAAA) functions as a universal T helper epitope, which can induce specific high titer antibodies and lasting antibody responses [37, 38] . Hence, we added it to the N-terminal of our construct to boost the humoral immune response (Figure 3 ). The result of Protein BLAST showed little similarity between the final construct with any known encoded protein, it prompted that the antibody derived from the immune fragment designed in our study is not likely to cross-react with other peptides beyond SARS-CoV-2, which will be confirmed by experiments in the follow-up study. Non-Toxin h, Alpha helix; e, Extended strand; t, Beta turn; c, Random coil. The predicted epitopes, the hydrophilic regions, and the surface accessible areas were marked in Bold, Italic, and Underlined, respectively. In this study, we utilized multiple bioinformatics and immunoinformatics approaches to forecast potential immunodominant regions of SARS-CoV-2 N protein. Though several bioinformatic predictions of potential epitopes for SARS-CoV-2 have been reported, these studies mainly focused on vaccine development [33,-39-41] . Compared to these studies, we employed distinctive strategies. As we aimed to develop diagnostic antibodies against SARS-CoV-2, mouse was selected as the host species for MHC class II binding epitopes prediction. Additionally, peptides that shared identical sequences with SARS-CoV were excluded to enhance the specificity. Nevertheless, the results of this study were derived from computational algorithms. Whether the antibody can effectively bind to SARS-CoV-2 N protein in clinical samples and the performance of the assay reach the national standard, that is, crossreactivity, sensitivity, and specificity, remain to be verified by experiments in vitro and in vivo. N protein was reported as a good diagnostic antigen because of its high immunogenicity and affluence during coronavirus infection [8, 31] . N protein of the influenza virus is also the main target in antigen-detection tests [42, 43] . The influenza antigen detection was useful especially for patients tested within the first 48 hours of illness, when the influenza viral load in the upper respiratory tract was high [44] . Accordingly, we attempted to develop an efficient and accurate N proteindetection assay for SARS-CoV-2. In the follow-up study, newly developed materials such as fluorescent dyes and nanoparticles could be adopted to improve the sensitivity of detection [5] . Influenza virus infection causes respiratory symptoms similar to COVID-19, which makes it difficult to distinguish the diseases by symptoms [45] . Hence, we compared the N protein sequences between the influenza virus and SARS-CoV-2 (File S2). The results showed that the sequence similarity is low, which suggested that direct detection of N protein could distinguish COVID-19 from influenza virus infection. Besides, we noticed that the full-length N protein of SARS-CoV-2 may cross-react with the serum of patients infected with SARS-CoV [4] , while truncated protein was proved to reduce the cross-reactivity without reducing sensitivity [46, 47] . Hence, we chose to use truncated recombinant protein rather than the full-length N protein for developing diagnostic antibodies. Recently, cladistic studies based on N protein sequence of SARS-CoV-2 have been reported [48] [49] [50] [51] . None of the reported mutations are located in the selected immunodominant regions of the current study (data not shown). Nevertheless, attention should be paid to the diagnostic efficiency of mAbs derived from the fragments of SARS-CoV-2 N protein, which need to be evaluated by experiments. SARS-CoV-2 N protein displayed almost the same distribution of hydrophilic regions as that of SARS-CoV N protein, which is easily understood owing to their sequence homology. Yu and colleagues demonstrated that the N 122-422 incorporated the main immunogenic sites of the SARS-CoV N protein and could be used for efficient diagnosis [52] . Interestingly, the selected fragments in our study were all located within the corresponding region of SARS-CoV-2 N protein, suggesting an advantage of these fragments in generating optimal diagnostic antibodies. In conclusion, three potential immunodominant regions of SARS-CoV-2 N protein: N 229-269 , N 349-399 , and N 405-419 that contain both linear B-cell epitopes and murine MHC class II binding epitopes were identified. A construct with 150 amino acids was built ( Figure 3 ). The final construct consists of seven B-cell epitopes, six murine MHC class II binding epitopes, and a PADRE sequence ( Table 2) . After cloning, expression, and purification of the recombinant protein derived from this study, we will immunize Balb/c mice to generate mAbs with the hybridoma technique. The cross-reactivity, reactivity, specificity, and titer of mAbs will be further evaluated. After confirming the biological functions, these mAbs would be utilized to develop an antigencapture-based assay system for early diagnosis of SARS-CoV-2. Transmission of 2019-nCoV infection from an asymptomatic contact in Germany Molecular diagnosis of respiratory viruses Molecular diagnosis of a novel Coronavirus (2019-nCoV) causing an outbreak of Pneumonia Severe acute respiratory syndrome Coronavirus 2-specific antibody responses in Coronavirus disease 2019 patients Current approaches for diagnosis of influenza virus infections in humans Enzyme-linked immunosorbent assay based on recombinant human group C rotavirus inner capsid protein (VP6) To detect human group C rotaviruses in fecal samples Diagnosing acute HIV infection at point of care: a retrospective analysis of the sensitivity and specificity of a fourth-generation point-of-care test for detection of HIV core protein p24 The coronavirus nucleocapsid is a multifunctional protein Nucleocapsid protein as early diagnostic marker for SARS Detection of the nucleocapsid protein of severe acute respiratory syndrome coronavirus in serum: comparison with results of other viral markers An overview of bioinformatics tools for epitope prediction: implications on vaccine development A whole-genome bioinformatics approach to selection of antigens for systematic antibody generation Gapped sequence alignment using artificial neural networks: application to the MHC class I system Identification of an epitope of SARS-coronavirus nucleocapsid protein Identification and characterization of dominant helper T-cell epitopes in the nucleocapsid protein of severe acute respiratory syndrome coronavirus The EMBL-EBI search and sequence analysis tools APIs in 2019 Molecular phylogenetics: state-of-the-art methods for looking into the past Prediction of continuous B-cell epitopes in an antigen using recurrent neural network BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes IEDB-AR: immune epitope database-analysis resource in 2019 TepiTool: a pipeline for computational prediction of T cell epitope candidates Peptide binding predictions for HLA DR, DP and DQ molecules A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines A simple method for displaying the hydropathic character of a protein NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments Open Source Drug Discovery C. In silico approach for predicting toxicity of peptides and proteins AllerTOP v.2-a server for in silico prediction of allergens Origin and evolution of pathogenic coronaviruses The carboxyl-terminal 120-residue polypeptide of infectious bronchitis virus nucleocapsid induces cytotoxic T lymphocytes and protects chickens from acute infection Understanding the T cell immune response in SARS coronavirus infection A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARS-CoV-2 ImmunomeBrowser: a tool to aggregate and visualize complex and heterogeneous epitopes in reference proteins Airway memory CD4(+) T cells mediate protective immunity against emerging respiratory coronaviruses Fusion protein linkers: property, design and functionality Linear PADRE T helper epitope and carbohydrate B cell epitope conjugates induce specific high titer IgG antibody responses Potent immunogenic short linear peptide constructs composed of B cell epitopes and Pan DR T helper epitopes (PADRE) for antibody responses in vivo Preliminary identification of potential vaccine targets for the COVID-19 SARS-CoV-2) based on SARS-CoV immunological studies In silico identification of vaccine targets for 2019-nCoV Bioinformatics analysis of epitope-based vaccine design against the novel SARS-CoV-2 Detection methods of human and animal influenza virus-current trends Rapid diagnostic tests for influenza Very high sensitivity of a rapid influenza diagnostic test in adults and elderly individuals within 48 hours of the onset of illness Modeling the onset of symptoms of COVID-19 Production of specific antibodies against SARS-coronavirus nucleocapsid protein without cross reactivity with human coronaviruses 229E and OC43 The SARS-CoV nucleocapsid protein: a protein with multifarious activities Comprehensive analyses of SARS-CoV-2 transmission in a public health virology laboratory A distinct phylogenetic cluster of Indian SARS-CoV-2 isolates SARS-CoV2 genome analysis of Indian isolates and molecular modelling of D614G mutated spike protein with TMPRSS2 depicted its enhanced interaction and virus infectivity Genomic variations in SARS-CoV-2 genomes from Gujarat: underlying role of variants in disease epidemiology Recombinant truncated nucleocapsid protein as antigen in a novel immunoglobulin M capture enzyme-linked immunosorbent assay for diagnosis of severe acute respiratory syndrome coronavirus infection We appreciate the help of Dr. Hezhi Fang and Dr. Xiang Wu for the proofreading of the manuscript. Special thanks to Xiaoke Huang for her revision to the language of the manuscript. The authors declare no conflict of interest.