key: cord-1040997-ly6jfukc authors: Kruglikov, Alibek; Rakesh, Mohan; Wei, Yulong; Xia, Xuhua title: Applications of Protein Secondary Structure Algorithms in SARS-CoV-2 Research date: 2021-02-22 journal: J Proteome Res DOI: 10.1021/acs.jproteome.0c00734 sha: 4a4e4b09e9f0e29be19e54cc74ab554f8e4bf6d5 doc_id: 1040997 cord_uid: ly6jfukc [Image: see text] Since the outset of COVID-19, the pandemic has prompted immediate global efforts to sequence SARS-CoV-2, and over 450 000 complete genomes have been publicly deposited over the course of 12 months. Despite this, comparative nucleotide and amino acid sequence analyses often fall short in answering key questions in vaccine design. For example, the binding affinity between different ACE2 receptors and SARS-COV-2 spike protein cannot be fully explained by amino acid similarity at ACE2 contact sites because protein structure similarities are not fully reflected by amino acid sequence similarities. To comprehensively compare protein homology, secondary structure (SS) analysis is required. While protein structure is slow and difficult to obtain, SS predictions can be made rapidly, and a well-predicted SS structure may serve as a viable proxy to gain biological insight. Here we review algorithms and information used in predicting protein SS to highlight its potential application in pandemics research. We also showed examples of how SS predictions can be used to compare ACE2 proteins and to evaluate the zoonotic origins of viruses. As computational tools are much faster than wet-lab experiments, these applications can be important for research especially in times when quickly obtained biological insights can help in speeding up response to pandemics. Since the outbreak of COVID-19 in late December of 2019, more than 450 000 full genomes of SARS-CoV-2 have been sequenced and deposited in GISAD database (https://www. gisaid.org/, last accessed February 1, 2021). Both SARS-CoV-2 1 and SARS-CoV 2−4 encode a Spike (S) protein, hereafter respectively referred to as SARS-2-S and SARS-S. The S1 receptor binding domain (RBD) binds to host Angiotensinconverting enzyme 2 (ACE2) receptor to mediate cell entry. The efficacy of this interaction determines host specificity and severity of infection. 4−6 Given a mammalian species, a high similarity between human ACE2 (hACE2) and mammalian ACE2 at S protein contact sites implies high susceptibility, and one can expect to determine species susceptibility to SARS-CoV or SARS-CoV-2 infections by comparative amino acid sequence analyses at contact sites at the ACE2 receptors. TO UNDERSTAND HOST SUSCEPTIBILITY TO SARS-COV-2 The above expectation, while largely correct, is not completely accurate. For example, of the 18 amino acid (aa) sites in contact between hACE2 and the RBD of SARS-S, nine aa sites differ between ferret ACE2 and hACE2, but both ferret ACE2 and hACE2 are effective as receptors for binding to RBD and mediating viral entry into host cells. In contrast, ACE2 from mouse and rat also differ from hACE2 by nine aa sites, but they cannot support viral RBD binding and viral entry. 2 This discrepancy invokes two simple explanations. First, aa sites beyond the 18 contact sites may also contribute to structural interactions and those sites might be more similar between hACE2 and ferret ACE2 than between hACE2 and mouse and rat ACE2. Second, structural similarity is not fully reflected in sequence similarity; i.e., structural similarity between hACE2 and ferret ACE2 may be greater than that between hACE2 and the mouse and rat ACE2. Only through structural studies can we hope to gain mechanistic insights into the differences in mammalian susceptibility to SARS-CoV-2. Nevertheless, protein structure is difficult to obtain, and well-predicted protein secondary structure (SS) may serve as the next best answer. The Protein Data Bank (PDB) is the main depository of experimentally determined 3D protein structures, and around 160 thousand protein structures are deposited. 7 In comparison, over 216 million aa sequences can be found in the NCBI GenBank database as of May 2020. 8 This inequality arises because experimental determination of structures is an expensive and lengthy process. 9, 10 In silico structure prediction techniques are faster and cheaper, and they have been useful in many research areas. For example, SS predictions have been used in enzyme structure similarity calculations, 11 ribosomal protein comparison, 12 protein activity mechanisms, 13 COVID-19 proteomics, 14 and many other areas. In section 3 we review examples of protein secondary structure predictions (PSSP) algorithms, and in section 4 we review their practical uses in pandemics research. In section 5, we describe examples of our own PSSP analyses on S protein-ACE2 binding to study species' susceptibility to SARS-CoV and SARS-2-CoV. The examples described in this review highlight how PSSP can be a useful tool in pandemics research. In protein structure models, aa sequences are used to predict secondary and tertiary protein structures. SS are often classified in either three states or eight states of structures. Early PSSP models predict three secondary structure states: helix (H), strand (E), and coil (C), whereas in recent years, PSSP models have shifted to predict structures in eight states. Figure 1 summarizes PSSP programs developed over the years. In addition to PSSP, protein structures can be modeled at the 2D level as contact maps 15 and at the 3D level as tertiary structures. 16, 17 While modeling in 2D or 3D are appealing, there are several reasons why PSSP can be practical. First, unlike 2D or 3D structures, PSSP is reported as a sequence and can be used together with aa chains in multiple sequence alignments. This makes PSSP modeling useful in determining proteins that might be more similar in structures than in nucleotide or aa sequence. Second, the sequential nature allows alignment of SS elements with known or exploratory protein hotspots. Lastly, PSSP is faster and less computation-heavy than 3D predictions. Typically, three metrics are used to evaluate accuracy of PSSP programs: Q3, Q8, and Segment Overlap (SOV) scores. Q3 and Q8 represent the percentages of SS sequence positions correctly predicted by the models using three or eight structure states, respectively. SOV is a more complex measure that represents the percentage of segment overlap between predicted and correct sequences. Different protein databases can be used for the evaluation, and the best practice is to use multiple data sets. Tables 1 and 2 show a collection of different PSSP models' accuracies calculated using various protein data sets. 27−33 Note that models are continually retrained with new protein structures, so there are discrepancies in reported accuracy values. Also, depending on data sets and metrics used, results of PSSP programs comparisons vary. In addition to prediction accuracy, it is important to consider the programs' usability and their limitations. While some programs are readily available through web servers, predictions through server are often limited by sequence length or number. For example, Mufold-SS only allows sequences of up to 700 aa long and Jpred4 only allows sequences of up to 800 aa long. In addition, most web servers only allow prediction of one protein sequence at a time, which is often impractical when working with a large number of sequences. Standalone versions of the programs do not have the restrictions of the web servers. Lu et al. 34 explored the structure of the SARS-CoV nsp5 gene. With reference to SARS-CoV strain GD, comparative sequence analyses with 110 strains at nsp5 showed that five nsp5 had mutations. Secondary structure predictions were performed at the five mutated strains using PSIPRED and the analysis showed that all five mutated strains had identical predicted secondary structure, which implies that nsp5 encoded proteins retain a conserved structure and may be a better therapeutic target than more rapidly evolving genes. Bull et al. 35 examined RNA polymerase and capsid protein similarities in five norovirus genogroups, of which the GII.4 genogroup was associated with acute gastroenteritis global outbreaks. To evaluate whether this highly pathogenic genogroup had a greater epidemiological fitness than the other four genogroups, rate of mutation at RNA polymerase and capsid secondary structures were modeled using the CPHmodels Server. 36 The PSSP model revealed that the 15 varying amino acid residues on capsid were located on the exposed loops in GII.4. Moreover, more pathogenic genogroups had more similarities with GII.4 in structure than less pathogenic ones. Seniya et al. 37 studied the potential effect of the Boesenbergia pandurata metabolite 4-hydroxy panduratin A to inhibit spread of Influenza A H1N1 (swine flu) infection. Influenza has two major surface proteins, neuraminidase (NA) and hemagglutinin (HA), to facilitate viral breach into host cell. To evaluate the potential of 4-hydroxy panduratin A to dock into active binding pockets of H1N1 NA, a homology-based protein structure prediction program, Modeler 9.10, 38 was used. In addition, I-TASSER 39 prediction was also used in combination with ab initio methods of modeling. These steps required secondary structure templates which were predicted using the PSIPRED server and rated using Z scores in LOMETS. 40 The combination of PSSP and I-TASSER enabled the downstream analysis of protein interactions between the viral NA and the plant metabolite. Sarkar et al. 16 examined the Avian Influenza A (H7N9) hemagglutinin (HA) protein to determine conserved HA regions that could serve as potential peptide vaccines. As aforementioned, HA is one of the two major surface proteins that facilitate viral entry into host cells. In addition, HA can also elicit an antibody response during infection. The PSSP server, SABLE, 41 was used to predict accessible surface area (ASA) in 120 HA sequences from H7N9 strains, and Jpred 42 and HHpred 43 were used to verify results. ASA, like secondary structure, is a 1D prediction; the aa sequence is converted to a sequence of numerical values, between 0 and 100, that describes aa sites accessibility in the solvent. Eight highly accessible regions were predicted by ASA and through epitope prediction, four regions were found with promising immunogenic potential. Good binding between SARS-2-S and host ACE2 receptor is crucial for viral entry into host cells. This interaction has been extensively explored by experimental research as a COVID-19 vaccine target and by computational research aiming to design competitive binding peptides 44 to bring forth new avenues to COVID-19 treatment. Using computational tools EvoEF2 45 and EvoDesign, 46 Huang et al. 44 designed peptide sequences that potentially bind competitively to SARS-2-S to limit viral entry. On the basis of a hACE2 structure template, they explored thousands of peptide designs through 3D modeling and selected best candidates by SARS-2-S binding affinity scored by PSSP performed in EvoDesign. The computational nature of this study allowed results to be obtained rapidly; currently, the computationally designed peptides are being evaluated experimentally. 44 Focusing on SARS-CoV-2, we tested the ability of several PSSP programs to predict SS of hACE2 and SARS-2-S S1 domain. We used experimentally derived SS from ACE2 structures available on PDB (1r42:A, 6m0j:A, 6m18:B, 6m1d:B, and 6m17:B; S1: 6vxx:A, 6vyb:A, 6m0j:E, and 6m17:E) to compare with SS predictions. Table 3 shows that the accuracy metrics of SS predicted for ACE2 and for SARS-2-S S1 were much lower than test scores from Tables 1 and 2, possibly because membrane protein structures are hard to predict. Another possible reason is that the training data used for the PSSP programs were not specific enough to predict ACE2 and S1 proteins more accurately. The Q8 results for PSIPRED and JPRED4, which only predict three structure states, were expected to be lower than that of PORTER5 and MUFOLD-SS, which predicted eight structure states. However, Q8 results were similar for all four programs (Table 3) , possibly because extra types of secondary structures are rare in the studied proteins. As previously mentioned, mammalian susceptibility to SARS-CoV cannot always be accurately predicted by differences in ACE2 aa sequences. This problem can be viewed as a mismatch between empirical and theoretical results. Using ACE2 PSSP instead of aa sequences, we attempt to explain this mismatch. To showcase that PSSP can circumvent this mismatch, Table 4 shows the P_distance, a measurement of differences in predicted SS between hACE2 and other species' ACE2. Here, we choose to use Mufold-SS to predict ACE2 SS (Table 3 ). P_distance is based on Q3 and Q8 scores, and the formula used for calculation is shown in eq 1, where M is the number of residues that are the same in both windows and L is sequence length (analogous to Q3/Q8 evaluations). Mufold-SS can be robust with three states but not with eight states, as it assumes equal weight for all SS differences. Hence, all calculated P_distances (Table 4 ) were based on three-state SS predictions. The P_distance shows that SS variations better explain patterns of SARS-CoV infectivity than hotspot aa differences. First, unlike differences in ACE2 aa, differences in ACE2 SS corroborate the finding that rats 47 are less susceptible to SARS-CoV than palm civets 48 and mice, 49 with P_distances of 0.0509 (rats) vs 0.0472 (palm civets and mice). Second, ACE2 SS explains why Chinese horseshoe bats (P_distance = 0.0335) are more susceptible to SARS-CoV than Pearson's horseshoe bats (P_distance = 0.0410). 50 Nonetheless, our findings cannot be generalized further, as not all patterns of infectivity are explained through P_distance. For example, P_distance cannot explain why palm civets (0.0472) are more susceptible to SARS-CoV than Pearson's horseshoe bat (0.0410). 48, 50 To further examine the ACE2 of species shown in Table 4 , we calculated aa sequence similarities using the Lake94 51 phylogenetic distance with hACE2 as reference. Indeed, with respect to hACE2, aa sequence similarities as measured by Lake94 poorly reflect similarities at SS as measured by P_distance in many species (Figure 2 : R 2 = 0.179, P = 0.150), an example is Rhinolophus sinicus. We next performed multiple sequence alignment (MSA) using MAFFT 52 on ACE2 aa sequence and on predicted ACE2 SS sequence for Rhinolophus sinicus highlighted in red in Figure 2 . Hotspot sites were highlighted in the alignment, representing hACE2 sites S19, Q24, D30, K31, H34, E35, E37, D38, Y41, Q42, L79, M82, Y83, K353, and R393 that form contact with SARS-2-S at sites K417, G446, Y449, L455, F456, A475, F486, N487, Y489, Q498, T500, N501, G502, and Y505, as previously identified through X-ray crystallography experiments. 53, 54 Rhinolophus sinicus ACE2 seems to be more conserved at hotspot locations (boxed in light blue) than other regions at the SS level ( Figure 3) . Furthermore, lack of SS differences at some aa substitution sites can be explained by the nature of aa substitutions: some aa substitutions are considered conservative as they have similar physicochemical properties. 55 Indeed, Figure 3 ); these amino acids have similar properties and reduced substitution effects on predicted SS folding. On the other hand, some regions have many SS differences but relatively conserved aa (Figure 3 : boxed in light red), one explanation for this discrepancy is that aa substitutions may influence SS at distant loci rather than closer ones due to complexities of hydrogen bond formation. Table 4 . Journal of Proteome Research pubs.acs.org/jpr Reviews Moreover, Lysine has been reported as preferred amino acids at C-terminus of proteins for α-helix formation, 56 and reduced helix stabilization in the light red region could be caused by the K → N substitution. Here we reviewed potential applications of PSSP programs to gain biological insights. These fast methods can be helpful to obtain important answers as an immediate response in pandemics research. Because some mutations, especially substitutions, might not induce structural changes, analysis on SS expands upon analysis of aa. In this review, we evaluated some of the current PSSP programs and discussed PSSP applications in pandemics research. Additionally, we offered examples of PSSP analyses with a focus on SARS-CoV and SARS-CoV-2. Because coronavirus infection is achieved through binding between the viral Spike protein and the host ACE2 receptor, mammals with similar ACE2 structures could be potentially susceptible to these viruses. To identify ACE2 similarities between mammals and humans, comparisons were made at aa and SS levels. We showed that variations between predicted SS is not always consistent with variations in corresponding aa sequences. Specifically, differences at aa rarely led to different SS at ACE2 hotspot locations in Rhinolophus sinicus. A Pneumonia Outbreak Associated with a New Coronavirus of Probable Bat Origin Bat-to-Human: Spike Features Determining 'Host Jump' of Coronaviruses SARS-CoV, MERS-CoV, and Beyond Coronavirus Spike Protein and Tropism Changes Priming Time: How Cellular Proteases Arm Coronavirus Spike Proteins. Activation of Viruses by Host Proteases The Spike Glycoprotein of the New Coronavirus 2019-NCoV Contains a Furin-like Cleavage Site Absent in CoV of the Same Clade The Proximal Origin of SARS-CoV-2 Protein Data Bank (PDB): The Single Global Macromolecular Structure Archive Lessons from Structural Genomics Anticipating the $1,000 Genome Structure and Functional Analysis of the Legionella Pneumophila Chitinase ChiA Reveals a Novel Mechanism of Metal-Dependent Mucin Degradation Structures of the Human and Drosophila 80S Ribosome A Genetically Encoded Photoactivatable Rac Controls the Motility of Living Cells Computational predictions of protein structures associated with COVID-19 Protein Contact Map Prediction Basic Characterization H7N9 Influenza Outbreak in China 2013: In Silico Analyses of Conserved Segments of the Hemagglutinin as a Basis for the Selection of Peptide Vaccine Targets Prediction of Protein Conformation Protein Secondary Structure Prediction Based on the GOR Algorithm Incorporating Multiple Sequence Alignment Information Prediction of Protein Secondary Structure by the Hidden Markov Model Protein Secondary Structure Prediction Using Nearest-Neighbor Methods A Novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach1 1Edited by B. Holland PHD-an Automatic Mail Server for Protein Secondary Structure Prediction The PSIPRED Protein Structure Prediction Server JPred4: A Protein Secondary Structure Prediction Server Protein 8-Class Secondary Structure Prediction Using Conditional Neural Fields Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields Art Ab Initio Prediction of Protein Secondary Structure in 3 and 8 Classes. bioRxiv Capturing Non-Local Interactions by Long Short-Term Memory Bidirectional Recurrent Neural Networks for Improving Prediction of Protein Secondary Structure, Backbone Angles, Contact Numbers and Solvent Accessibility Prediction of 8-State Protein Secondary Structures by a Novel Deep Learning Architecture Sixty-Five Years of the Long March in Protein Secondary Structure Prediction: The Final Stretch? Protein Secondary Structure Prediction: A Review of Progress and Directions Sequence Analysis and Structural Prediction of the Severe Acute Respiratory Syndrome Coronavirus Nsp5 Rapid Evolution of Pandemic Noroviruses of the GII.4 Lineage CPH Models 2.0: X3M a Computer Program to Extract 3D Models In-Silico Modelling and Identification of a Possible Inhibitor of H1N1 Virus Comparative Protein Modelling by Satisfaction of Spatial Restraints The I-TASSER Suite: Protein Structure and Function Prediction LOMETS: A Local Meta-Threading-Server for Protein Structure Prediction Combining Prediction of Secondary Structure and Solvent Accessibility in Proteins JPred: A Consensus Secondary Structure Prediction Server The HHpred Interactive Server for Protein Homology Detection and Structure Prediction De Novo Design of Protein Peptides to Block Association of the SARS-CoV-2 Spike Protein with Human ACE2 EvoEF2: Accurate and Fast Energy Function for Computational Protein Design EvoDesign: Designing Protein−Protein Binding Interactions Using Evolutionary Interface Profiles in Conjunction with an Optimized Physical Energy Function Adaptation of SARS Coronavirus to Humans Isolation and Characterization of Viruses Related to the SARS Coronavirus from Animals in Southern China Efficient Replication of Severe Acute Respiratory Syndrome Coronavirus in Mouse Cells Is Limited by Murine Angiotensin-Converting Enzyme 2 Angiotensin-Converting Enzyme 2 (ACE2) Proteins of Different Bat Species Confer Variable Susceptibility to SARS-CoV Entry Reconstructing Evolutionary Trees from DNA and Protein Sequences: Paralinear Distances MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability Structure of the SARS-CoV-2 Spike Receptor-Binding Domain Bound to the ACE2 Receptor Structural Basis of Receptor Recognition by SARS-CoV-2 The Exchangeability of Amino Acids in Proteins Stabilization of Alpha-Helical Structures in Short Peptides via End Capping Alibek