key: cord-323871-2hx4fuk2 authors: Ho, Sheau Ling; Wang, Andrew H.-J. title: Structural bioinformatics analysis of free cysteines in protein environments date: 2009-03-14 journal: J Taiwan Inst Chem Eng DOI: 10.1016/j.jtice.2008.07.015 sha: doc_id: 323871 cord_uid: 2hx4fuk2 Cysteine has been considered as a “hydrophilic” amino acid because of its pK(a) and its ability to form (weak) hydrogen bonds. However, cysteines are found mostly in hydrophobic environments, either in S–S (disulphide) form or in free cysteine form. When free cysteines are found on the surface of proteins, they are often involved in catalytic residues, as in cysteine proteases, P-loop phosphatases, etc. Additionally, a unique property of cysteines is that their side-chain volume is different from all other amino acids. This study is focused on the discrimination between structural versus active free cysteines based on a local environment analysis which does not appear to have been attempted previously. We have demonstrated the corresponding structural positions associated with free cysteines in their three-dimensional localization environment. We examined protein samples including nine, sequenced, coronavirus proteases and cysteine-rich non-membrane proteins. Our present study shows that the sequential environments of free cysteines of coronavirus proteases are rather hydrophobic and that the free cysteines of non-membrane proteases have a higher amount of contacts to hydrophobic residues and lower amount of contacts to polar or charged residues. Cysteine has been considered as a ''hydrophilic'' amino acid because of its pK a and its ability to form (weak) hydrogen bonds. However, cysteines are found mostly in hydrophobic environments, either in S-S (disulphide) form or in free cysteine form. When free cysteines are found on the surface of proteins, they are often involved in catalytic residues, as in cysteine proteases, P-loop phosphatases, etc. Additionally, a unique property of cysteines is that their side-chain volume is different from all other amino acids. This study is focused on the discrimination between structural versus active free cysteines based on a local environment analysis which does not appear to have been attempted previously. We have demonstrated the corresponding structural positions associated with free cysteines in their three-dimensional localization environment. We examined protein samples including nine, sequenced, coronavirus proteases and cysteine-rich non-membrane proteins. Our present study shows that the sequential environments of free cysteines of coronavirus proteases are rather hydrophobic and that the free cysteines of non-membrane proteases have a higher amount of contacts to hydrophobic residues and lower amount of contacts to polar or charged residues. ß 2008 Taiwan Institute of Chemical Engineers. Published by Elsevier B.V. All rights reserved. of cysteines has enabled us to define their association in a sequence alignment by grouping residues into families. Supplementary research included a consensus approach to cysteine residues within the 3CLpro and extended it to a number of other proteins. The identification of the free cysteines, combined with classification based on functional features and spatial schematics, provides a basis for experimental validation and association of new molecules involved in cysteine activities. Since neighboring residues share physical characteristics (Zvelebil et al., 1987) , we have undertaken a more detailed study of the surroundings of cysteine residues in protein structures. The patterns discerned in the distribution of various residues and their constituent around cysteine should be useful in improving our understanding of protein stability, molecular recognition and binding. 2.1. Database search and sequence analysis 2.1.1. Cysteine-rich proteinases Sequences of cysteine-containing proteases were retrieved from the public 3D structural databases (Berman et al., 2000) using combinations of sequence, and conserved motif searches to choose identifiable categories. All of the selected proteins employed in this study were characterized by X-ray crystallography with 3.5 Å or better resolution. Membrane proteins are known to be complicated targets for structure determination and were excluded. Therefore, according to their environmental differences, a total of 21 cysteine-rich proteins were randomly collected. This further resulted in a sample of 15 distinct types of non-membrane proteins. The PDB codes and the corresponding protein names for each are listed in Table 1 . Moreover, based on the literature data available for each, the 15 non-membrane proteins were classified according to common functions into nine different categories. These included cysteine proteases, phosphatases, metabolic enzymes, kinases, interleukins, transcription factors, motility, virus capsid proteins, and ribosomes. All sequence data were downloaded, and PERL scripts (www.perl.org) were used to count the cysteines and the counts (length) were reported. Several methodical approaches have been proposed for such a study. They include analysis of amino acid characteristics of spatial neighbors to the target residue (free cysteine, Cys_SH) which is measured within a 3.7 Å radius sphere with the sulfur atom of the cysteine residue as the center point, analysis of the hydrophobicity distribution around the target residue (free cysteine, Cys_SH), and structure-based threading. When choosing the sphere radius of 3.7 Å , we took into account that spheres should, cover as much of the space between the atoms as possible but, then again, we would not want the spheres to overlap too strongly. The secondary structure classes from the HSSP (Dodge et al., 1998) files were grouped in the following ways: helices were defined by the class G, I and H, strands by B and E, turns by T and bends by S (Kabsch and Sander, 1983) . Geometric alignments and backbone superposition processes of two protein structures were made using the log procedure of program O (Jones et al., 1991) . The main coronaviruse proteinases were extracted from the public DDBJ/EMBL/GenBank database (abbreviations in parentheses): SARS 3CLpro coronavirus (SARS, PDB ID code 1UJ1), human coronavirus 229E (229E, PDB ID code 1P9S), transmissible gastroenteritis virus (TEGV, PDB ID code 1LVO), human coronavirus OC43 (OC43), bovine coronavirus (BCoV), murine hepatitis virus (MHV), porcine epidemic diarrhea virus (PEDV), avian infectious bronchitis virus (IBV), and feline infectious peritonitis virus (FIPV). Multiple sequence alignment of the nine coronavirus proteinases with their homolog was performed using the CLUSTAL W program (Thompson et al., 1994) . Those selected query sequences were characterized as: cysteine, identical, hydrophobic, small, and charge/hydrophilic similar amino acid residues. Accordingly, through comparative sequence analysis, we examined their identities, similarity and differences. In addition, more precise distributions of residues around the sulfur atom of free cysteine were analyzed. In Table 1 , we listed the PDB codes and names of 15, nonmembrane proteins respectively as well as their concise clarifications that are used throughout the following discussion of our results. As shown in Table 1 columns, total aa, total hydrophobic aa, C total, C free, second structure (the locations of cysteine residues), and the residues surrounding cysteine in a 3.7 Å radius sphere are addressed. Table 1 shows a total of 87 cysteine residues collected from the nonmembrane protein data sets, and those were further divided into two forms: 2 disulfide-bonding cysteines (Cys_SS), and 85 free cysteines (Cys_SH). Our data show that free cysteines of non-membrane proteins prefer a b strand environment. For non-membrane proteins: the hydrophobic residues such as leucine, valine, isoleucine and alanine were more frequently seen in the spatial neighborhood around free cysteines; the same was observed for the aromatic phenylalanine residue. Thus, the sequential differences in the positions between Cys_SH and its local neighborhood among the non-membrane protein data sets suggest that the surrounding residues are mostly hydrophobic (Table 1) . Moreover, free cysteines, among non-membrane proteins, have a high number of leucine contacts. In an attempt to determine information about the uniqueness of cysteine within 3CL cysteine proteases, we compared SARS 3CLpro to other coronavirus proteases. Thus, we applied a structure-base sequence alignment of these nine coronavirus main proteinases to identify any spatial correspondences involving cysteine among them. Residues comprising the nine coronavirus main proteases are illustrated in Fig. 1 . It can be observed that this multiple sequences alignment figure shows a strong conservation of hydrophobic residues (valine, leucine, isoleucine, alanine phenylalanine and proline) and small residues (serine and glycine) in proximity to cysteine residues. The presences of 28 cysteine residues (around 9%) in well-conserved positions were also noticed. This indicates that these homologous proteins had a higher proportion of cysteine residues than others. Moreover, our findings show that the environment of free cysteines is rather hydrophobic and are in fair agreement with the results reported by 229E, gij30024078; TGEV, gij30146762; OC43, gij50844478; BCoV, gij26008084; MHV, gij25121563; PEDV, gij30138155; IBV, gij25121547; FIPV, gij37999875). The ahelices, b-strands and the domains as revealed in the SARS 3CLpro crystal structure are shown above the sequence alignment. The alignment was produced using CLUSTAL W (Thompson et al., 1994) . Colored outlines indicate cysteine, identical, hydrophobic, small, and charge/hydrophilic similar amino acid residues, respectively, (yellow, blue, green, red, purple). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of the article.) Fiser et al. (1992) and Muskal et al. (1990) , who showed that hydrophobic residues accumulated in the vicinity of free cysteines. and a-chymotrypsin protein SARS 3CLpro (PDB ID code 1UJ1) forms a dimer with the two promoters oriented almost at right angles to each others. Initially, we identified SARS 3CLpro domains that somewhat matched the a-chymotrypsin domains. Fig. 2 shows that each monomer is folded into three domains, the first two of which are antiparallel b-barrels and together resemble the architecture in serine proteinases of the chymotrypsin family (PDB ID code 4CHA). Domain II of SARS 3CLpro is smaller than domain I and also smaller than the homologous domain II of a-chymotrypsin. As the quantitative results show the RMSD among 110 atoms is 1.906, including domain I and domain II for these two structures with a limit of 3.8 Å . With a limit of 2.0 Å , the RMSD result among 48 atoms for domain I is 0.990, and the RMSD result among 40 atoms for domain II is 1.156. These results show that these two proteins are barely different in conformation. We have highlighted the side-chain of cysteines and the residues with the spatially equivalent residue positions of cysteine in the two proteins ( Fig. 2(b) ). a-Chymotrypsin has total eight residues (tryptophan, threonine, aspartic acid, glycine, proline, serine, alanine, valine) in the corresponding positions (cysteine residues in SARS 3CLpro), whereas SARS 3CLpro has total seven residues (leucine, valine, tyrosine, phenylalanine, asparagines) in the corresponding position (cysteine residues in a-chymotrypsin) Apparently, most of the spatially equivalent residues are subject to hydrophobic which may be somehow involved in catalysis. A superimposition (stereo image) of the structures of SARS 3CLpro demonstrates that cysteine residues not only favor positioning in a hydrophobic environment but also develop hunched posture in the surroundings of aromatic residues, see Fig. 3 . Fig. 2 . A MOLSCRIPT diagram showing the SARS 3CLpro monomer and a-chymotrypsin (PDB code 4CHA) structures (a) a monomer of SARS 3CLpro is presented as ribbons, and cysteines exposed. It contains two b-barrel domains and the a-helical C-terminal domain. The b-barrels of each I and II are composed of 6-standard b-sheets of domain. Domain III is composed mainly a-helices. The first two of which are antiparallel b-barrels reminiscent of those found in the chymotrypsin family (b) an a-chymotrypsin (PDB code 4CHA) presented as ribbon and side-chain of certain residues are highlighted to the corresponding cysteins of SARS 3CLpro. Domain II of SARS 3CLpro is not only smaller the domain I but also smaller than the homologous domain II of a-chymotrypsin. Cysteines and the residues surrounding cysteine in a 3.7 Å radius sphere were identified in a Ca-trace outline for SARS 3CLpro monomer (PDB ID code 1UJ1). See Fig. 4(a) . This figure shows that as the hydrophobic residues such as leucine, valine, phenylalanine, and isoleucine were more frequently seen in the spatial neighborhood around free cysteines. SARS 3CLpro (PDB ID code 1UJ1) was then superposed with the other two 3CL proteases 229E (PDB ID code 1P9S) and TGEV (PDB ID code 1LVO). These three proteins are similar to each other and can be superposed well (RMSD 1.12-1.18). Fig. 4(b) shows the superimposed Ca profile of SARS 3CLpro, 229E and TGEV. The side-chains are also highlighted on the spatially conserved residues which correspond to the residue of the free cysteine. The spatial corresponding residues to free cysteines within each are serine, tyrosine, alanine and valine. These residues are well conserved according to a Risler matrix (Risler et al., 1988) . Proteins having similar functions but from different sources can be identified by their sequences. A statistical analysis of the amino acids frequencies associated with nine 3CLpro alignment sequences has revealed the results shown in Fig. 5(a) . Fig. 5(a) shows the frequency of occurrences of alternative residues in the cysteine conservation. Our findings reveal that alanine, valine, serine, leucine, and threonine were more frequently observed among these nine coronaviruses (3CLpro) main proteinases. Thus, we can conclude: (1) the cysteines are frequently found in hydrophobic regions containing alanine, valine, and leucine; (2) small residues (serine and threonine) are favorable to be substituted with cysteine residues. These residues are relative smaller than the existing cysteine, thereby producing conservative changes that would not disrupt the native structure. We have analyzed the distribution of residues around the sulfur atom of cysteine. It has been verified that the top eight occurrences of residues embedding cysteine (in a radius of a 3.7 Å sphere) in non-membrane proteins were leucine, phenylalanine, valine, alanine, isoleucine, proline, threonine and serine. These are classified as either hydrophobic or small residues (see Fig. 5(b) ). It has been reported that in order to stabilize interactions involving cysteine residues that the free sulfhydryl group prefers to interact closely with the face of aromatic rings (Klingler and Brutlag, 1994; Pal and Chakrabarti, 1998) . Thus, our data show that, based on the 3D spatial orientation, the aromatic rings of a selected residue, i.e. phenylalanine are most likely to be in contact with the sulfur atom of cysteines. We carefully examined some other non-membrane proteins. Approximately 62% and 60% of the aromatic ring faces of phenylalanine and tyrosine, respectively were clustered contiguously to the sulfur atom of cysteins, contrary to the way tryptophan (W) behaved. See Fig. 6 . This might be because tryptophan not only has the largest nonpolar accessible surface area, but also observed less frequently in our Fig. 5 . Plots of occurrences of residues. (a) 28 cysteines conserved locations were observed of the sequence alignment of 9 coronaviruses (3CLpro) main proteinases. An occurrence of alternative residues in the cysteine conservation is revealed. This indicates cysteines are highly conserved in hydrophobic region, and smaller residues are favorite substitution. (b) Discrepancy in occurrences of residues which embedding cysteine (a distance less than 3.7 Å , sidechain-S of cysteine is as the center) for various classes of proteins (cysteine proteases, phosphatases, kinases, interleukins, transcription factors, motility proteins, metabolic enzymes, virus capsid proteins, and ribosomes). This indicates that either hydrophobic residues or small residues were cluster with cysteines. Fig. 6 . Distribution of the relative position between sulfur atom of cysteine and the aromatic ring of F, Y, W in a 3.7 Å radius sphere (non-membrane proteins). The aromatic ring faces of phenylalanine and tyrosine respectively were clustered contiguously to the sulfur atom of cysteins, contrary to the way tryptophan (W) behaved. selected proteins. In this regard, further investigation to provide a quantitative explanation is needed. From the data collected, we observed that approximately 23% of free cysteine residues of non-membrane proteins were located in the a-helix and 77% were found in the b-strand (Table 1) . This observation correlates positively with a tendency for cysteine to occur preferentially in the b-strand (Williams et al., 1987; Wilmot and Thornton, 1988) . Knowledge of the details of protein structures offer a range of possibilities for investigating their biological functions (Baker et al., 2003; Eisenstein et al., 2000; Teichmann et al., 2000) . Conserved residues mapping on certain characteristics may likewise identify a key site or perhaps a sensible function. Accordingly, the approaches we have presented were based on the structures. Although spatial contacts have been studied to derive contact potentials for the different amino acid interactions (Brocchieri and Karlin, 1995; Jernigan, 1996, 1999) , the common strategy is to study the number of contacts within a given distance cut-off. To our knowledge structural bioinformatics detection of the cysteines has not been attempted previously. This study has revealed, for the first time, the discrimination of structural versus active free cysteines based on local environment analysis. The computational prediction and annotation of free cysteines in the protein environments has been described through analysis of 3D spatial correspondences. Essentially we have demonstrated the corresponding structural positions associated with free cysteines in their three-dimensional environment and the frequency of occurrence of the residues surrounding the free cysteines in selected proteases. The types of residues involved in spatial contacts with free cysteines of nonmembrane proteins found in the present study indicated that free cysteines have a higher capacity for contacts to hydrophobic residues and lower capacity for contacts to polar/charged residues. We also examined nine sequenced coronavirus proteases including three primarily coronavirus proteases (SARS 3CL, 229E, TGEV) whose structures have been solved. For these it was shown that the sequential environments around free cysteines were rather hydrophobic. The use of combined 3CL main proteases and cysteine-rich proteins (membrane/non-membrane proteins) database mining approaches allowed for the classification of free cysteines in proteins. The validity of this approach was supported by the identification of some known proteases. The identification and functional characterization of the free cysteines will have implications in many aspects of biology. Moreover the sets of these proteins and the knowledge-based methods used to identify them will form the foundation in the algorithms used for detection, in particular, within the protein sequence. Protein Structure Prediction and Analysis as a Tool for Functional Genomics The Protein Data Bank How Are Close Residues of Protein Structures Distributed in Primary Sequence? The HSSP Database of Protein Structure-Sequence Alignments and Family Profiles Biological Function Made Crystal Clear-Annotation of Hypothetical Proteins via Structural Genomics Different Sequence Environments of Cysteines and Half Cysteines in Proteins: Application to Predict Disulfide Forming Residues Improved Methods for Building Protein Models in Electron Density Maps and the Location of Errors in These Models Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features Discovering Structural Correlations in Alpha-Helices Oligopeptide Biases in Protein Sequences and Their Use in Predicting Protein Coding Regions in Nucleotide Sequences Residue-Residue Potentials with a Favorable Contact Pair Term and an Unfavorable High Packing Density Term, for Simulation and Threading Evaluation of Short-Range Interactions as Secondary Structure Energies for Protein Fold and Sequence Recognition Prediction of the Disulfide-Bonding State of Cysteine in Proteins Chemical Synthesis of Proteins Different Types of Interactions Involving Cysteine Sulfhydryl Group in Proteins Amino Acid Substitutions in Structurally Related Proteins. A Pattern Recognition Approach. Determination of a New and Efficient Scoring Matrix Fast Assignment of Protein Structures to Sequences Using the Intermediate Sequence Library Pdb-Isl Clustal W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice Secondary Structure Predictions and Medium Range Interactions Analysis and Prediction of the Different Types of Beta-Turn in Proteins Prediction of Protein Secondary Structure and Active Sites Using the Alignment of Homologous Sequences The authors are truly grateful to the Academic Sinica, R.O.C. for supporting this study.