An expert system to predict protein thermostability using decision tree Expert Systems with Applications 36 (2009) 9007–9014 Contents lists available at ScienceDirect Expert Systems with Applications j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a An expert system to predict protein thermostability using decision tree Li-Cheng Wu a, Jian-Xin Lee b, Hsien-Da Huang c, Baw-Juine Liu d, Jorng-Tzong Horng a,b,e,* a Institute of System Biology and Bioinformatics, National Central University, Taiwan b Department of Computer Science and Information Engineering, National Central University, Taiwan c Institute of Bioinformatics, National Chiao-Tung University, Taiwan d Computer Science and Information Engineering, Yuan Ze University, Taiwan e Department of Bioinformatics, Asia University a r t i c l e i n f o Keywords: Expert system Machine learning Bioinformatics Protein thermostability Decision Tree 0957-4174/$ - see front matter � 2008 Elsevier Ltd. A doi:10.1016/j.eswa.2008.12.020 * Corresponding author. Address: Department of Central University, Taiwan. E-mail address: horng@db.csie.ncu.edu.tw (J.-T. Ho a b s t r a c t Protein thermostability information is closely linked to commercial production of many biomaterials. Recent developments have shown that amino acid composition, special sequence patterns and hydrogen bonds, disulfide bonds, salt bridges and so on are of considerable importance to thermostability. In this study, we present a system to integrate these various factors that predict protein thermostability. In this study, the features of proteins in the PGTdb are analyzed. We consider both structure and sequence features and correlation coefficients are incorporated into the feature selection algorithm. Machine learning algo- rithms are then used to develop identification systems and performances between the different algorithms are compared. In this research, two features, (E + F + M + R)/residue and charged/non-charged, are found to be critical to the thermostability of proteins. Although the sequence and structural models achieve a higher accuracy, sequence-only models provides sufficient accuracy for sequence-only thermostability prediction. � 2008 Elsevier Ltd. All rights reserved. 1. Introduction Protein stability is intimately connected with protein folding and all proteins have to be folded into their final active state to be active and stable. In biotechnology, chemical reactions need to be performed at high temperatures to decrease reaction time. However, many proteins are not very stable when heated. Research is needed that helps proteins to remain active and stable at high temperatures, which will overcome many limitations to their industrial applications (Huang et al., 2004). There are four thermo- stability classes of protein, psychrophilic, mesophilic, thermophilic and hyperthermophilic, separated by denaturation temperature of 20 �C, 50 �C and 80 �C (Vieille & Zeikus, 2001). Current research has shown that the properties of proteins such as number of ion pairs, and salt bridges are considerably related to protein thermostability (Vieille & Zeikus, 2001). Research has shown that single point mutation, when related to thermostability, may allow thermosta- bility prediction (Capriotti, Fariselli, & Casadio, 2004; Capriotti, Fariselli, & Casadio, 2005; Saraboji, Gromiha, & Ponnuswamy, 2006). Instead of analyzing mutant proteins, research is needed that investigates the global relationships between thermostability, protein structure and sequence properties. Recent investigations have shown that features such as amino acid composition and ion pairs are related to the thermostability. ll rights reserved. Computer Science, National rng). However, a general model of how these features are related to thermostability remains a scientific challenge. Our previous work (Huang et al., 2004) has identified a general model relating ther- mostability and ion pairs. Since structured folding is a complex question, practically a model of thermostability based on sequence information would be more useful than one based on other types of analysis. Our goal is to develop a predictive model for thermostability based on sequence and structural features. Given a protein se- quence/structure, the system aims to classify proteins into three different thermostability classes: mesophilic, thermophilic and hyperthermophilic. We aim to achieve high accuracy, sensitivity, specificity and precision. Higher accuracy may be available if both specific protein structure and sequence information are available, but a general model that will accept sequence-only information is also constructed. 2. Related works The optimal growth temperature of an organism indicates the temperature at which the organism’s growth is most rapid. The prokaryotic Growth Temperature Database (PGTdb1) is a database of prokaryotic thermostability information. We adapted the thermo- stability information from the PGTdb for this work. The PDB2 is the 1 http://pgtdb.csie.ncu.edu.tw/. 2 http://www.rcsb.org/pdb/. mailto:horng@db.csie.ncu.edu.tw http://pgtdb.csie.ncu.edu.tw/ http://www.rcsb.org/pdb/ http://www.sciencedirect.com/science/journal/09574174 http://www.elsevier.com/locate/eswa 9008 L.-C. Wu et al. / Expert Systems with Applications 36 (2009) 9007–9014 single worldwide repository for the processing and distribution of 3- D structure data of large molecules including proteins and nucleic acids (Berman et al., 2000). Protein sequence and structure informa- tion are collected through the PDB database. Prediction of protein thermostability based on single mutation points has shown that prediction of thermostability is possible (Capriotti et al., 2004, 2005; Farias, van der Linden, Rego, Araujo, & Bonato, 2004; Saraboji et al., 2006). Previous research (Haney, Stees, & Konisky, 1999; Vieille & Zeikus, 2001) has shown that many special properties of proteins are related to protein thermo- stability. Usually, protein stability is intimately connected with the protein three-dimensional structure. It is believed that sequence determines the final folding of protein although three-dimensional free folding of an arbitrary protein sequence is still a scientific challenge. The thermostability of proteins is directly connected to its structural stability. Thus, the features maintaining the structure folding are often considered as significant factors in protein ther- mal stability. It has been demonstrated that there is a relationship between thermostability and sequence information. Amino acid composition and intrinsic propensity are thought to play important roles in thermal stability (Vieille & Zeikus, 2001). For instance, the protein’s content of hydrophobic amino acids is reasonably related to its thermal stability. Many mutant rules, such as Trp ? Tyr, Cys ? Ile, have been proposed to enhance thermostability (Grom- iha, Oobatake, & Sarai, 1999). In addition to the amino acid compo- sition, research has also found that there is a relationship between sequence pattern and optimal growth temperature. The pattern [EdH] and [EdT] are favored by mesophilic proteins (Liang, Huang, Ko, & Hwang, 2005). Sometimes sequence analysis can focus on a special pattern and its corresponding structure. The sequence pat- tern is transformed into a favored secondary structure by the data- base records giving a good linear relationship between local structure and melting temperature (Chan et al., 2004). There are various structure features that show a relationship with thermostability. Secondary structure, such as a-helices, and intrahelical interactions within the protein have been analyzed and it has been concluded that high a-helical stability is necessary for protein thermostability (Petukhov, Kil, Kuramitsu, & Lanzov, 1997). Hydrogen bonds, ion pairs, hydrophobic interactions and disulfide bonds are thought to be the major forces affecting protein tertiary structure. Hydrogen bonds are non-covalent bonds be- tween donor and acceptor atoms and have been widely discussed in protein structure research. Some studies have indicated that an increase in hydrogen bonds contributes to thermostability (Vogt, Woell, & Argos, 1997), and that side-chain–side-chain hydrogen bonding plays the major role in this (Ragone, 2001). Ion pairs are another feature that has been widely discussed with respect to thermostability. Researchers have found that �1.1 ions pairs are lost for every 10 �C fall in thermostability per subunit (Gianese, Bossa, & Pascarella, 2002). When the distance between the ion pair is subdivided into <4 Å, 4–6 Å and 6–8 Å, it was found that the 6–8 Å group seem to play a more significant role in ther- mostability than the other two (Szilagyi & Zavodszky, 2000). Other possible properties, such as a hydrophobic core (Haney et al., 1999), electrostatic interactions and dielectric response (Dominy, Minoux, & Brooks, 2004), aromatic clusters on protein surface (Kannan & Vishveshwara, 2000) and B values reported in high-res- olution X-ray crystal structure (Parthasarathy & Murthy, 2000), are all considered to be related to thermostability. An important restriction is that high-resolution and a high-quality structure are needed to calculate such tertiary structure properties. Researchers usually utilize homology tools to produce theoretical protein structures when the real structure is unknown, and then compare those properties across the different thermostability classes. In addition to structural and sequence features, combination of features may also act as a key to thermostability (Farias et al., 2004). Research (Farias et al., 2004) has shown that the ratio between glutamic acid plus lysine and glutamine plus histidine gives a higher thermostability. This means that combination of amino acid (E + K)/(Q + H) may be important to protein thermo- stability. The above research shows that almost all properties of a protein are related to the thermostability based on observations that com- pare homologous protein pairs or several protein mutants. A global view of how these features are related to thermostability is needed. In this study, we incorporate several machine learning approach such as Naïve Bayes (Huang et al., 2004), SVM and neural network together with k-NN (Baumgartner et al., 2004) into an investiga- tion of the relationship between protein features and thermostability. 3. Materials and methods 3.1. System flow We have developed a data-mining system to analysis the rela- tionship between protein features and thermostability. The system builds a model based on given protein features and thermostability information. The system then takes the inputted protein features and tries to predict the thermostability of a given protein as one of three classes, mesophilic, thermophilic and hyperthermophilic. The system incorporates feature selection to eliminate low-con- tributing features and incorporates several machine learning ap- proaches. Fig. 1 shows the system flow. 3.2. Materials 3.2.1. Sampling data The optimal growth temperature of an organism indicates the temperature at which the organism’s growth is most rapid. Re- search on the source organism’s optimal growth temperature can be used in place of experimental thermostability data when com- paring proteins. Most proteins of an organism are reasonably ac- tive and quite stable at the optimal growth temperature of the organism. Two databases, PGTdb and PDB, are used in our research. We identified 5487 proteins for which there are both optimal growth temperature information in PGTdb and structural information in PDB. Fig. 2 show the data distribution for these proteins with re- spect to the temperature. In total, there are 41 different optimal temperatures for the pro- teins. Fig. 2 shows that the number of proteins with an optimal temperature of 37 �C is far more than others. Most studied pro- karyotic organisms are mesophilic and therefore over-sampling at this temperature in the PGTdb is not unexpected. As a result, the distribution is unbalanced and therefore any prediction will fo- cus on proteins with an optimal temperature of 37 �C, In other words, a prediction model that predicts every protein to have an optimum temperature of 37 �C will achieve high accuracy, but will fail to reach the goal of linking features and thermostability. In or- der to balance the temperature difference sample size, we carried out a Z-test to pick the temperatures that contain significantly more proteins then other temperatures. There were seven temper- atures that exceeded the Z-test limit, which indicated a barrier of 125 proteins. We select 125 proteins randomly by computer from each of these seven temperatures to balance the data. As a result, 1810 proteins from different temperatures form our sample data- set. The dataset consists of 878 mesophilic, 580 thermophilic and 352 hyperthermophilic proteins. Fig. 1. System flow. 3 NBC (http://www.cs.pdx.edu/~timm/dm/nbc.html). L.-C. Wu et al. / Expert Systems with Applications 36 (2009) 9007–9014 9009 3.2.2. Candidate features Several candidate features are considered to be globally related to the thermostability of proteins. We categorize these features into four categories as follows. (A) Primary structure: amino acid composition of the protein sequence. (B) Secondary structure: the amount of helix structure found within the protein and the number of atoms making up the helices. (C) Tertiary structure: Ion pairs, hydrogen bonds, disulfide bonds and accessible surface area (ASA). (D) Extended properties: Normalize the value obtained from (A), (B) and (C) above and the ratio of various other features such as polar/nonpolar. 3.2.3. Data generation Several tools were used to calculate the secondary and tertiary structural features of each protein as needed. Helix packing pair (Dalton, Michalopoulos, & Westhead, 2003) was used to calculate the helix properties of the proteins. HBPLUS v3.0 (McDonald & Thornton, 1994) was used to calculate the hydrogen bonds in the proteins using the default parameters and with the distance be- tween donor and accept atoms set at <3 Å for strong hydrogen bonds. EDPDB v03c (Matthews, 1995) was used to calculate the disulfide bonds and the accessible surface areas (ASA) of the pro- teins using the default parameters. In total, 111 features were cre- ated for each protein. 3.3. Method 3.3.1. Feature selection The total number of features was 111. Some features will con- tribute in only a minor way to the thermostability, but others will have a major effect. Clearly, not all of these candidate features are critical related to protein thermostability. Inputting the whole dataset will therefore result in the data analysis process screening a lot of noisy information and this will give rise to weak results. Therefore, we adapted the correlation coefficient method in order to carry out feature selection. The use of correlation coefficients is similar to the way they are normally used in statistical (Wacker- ly & Scheaffer, 1996) and genetic analysis (Baumgartner et al., 2004). In statistics, correlation coefficients are used to measure- ment the linear relation between two random variables. Let the two random variables, X and Y, have n pair elements each. Under these circumstances, the correlation coefficient (r) between X and Y is defined as follow: r ¼ Pn i¼1ðxi � �xÞðyi � �yÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn i¼1ðxi � �xÞ 2Pn i¼1ðyi � �yÞ 2 q where (xi, yi) is the ith pair and �x�y is the mean of X and Y, respectively. Here we use correlation coefficients to measure the significance of features. We calculate the correlation between optimal growth temperature and each feature. After calculating all absolute values for the correlation coefficients, we extracted the features with |r| = 0.3. 3.3.2. Naïve Bayes Naïve Bayes is a data-mining and machine learning tool base on the Bayes theorem and assumes that each variable (property) is independence of each other. Compared with other machine learn- ing approach, Naïve Bayes usually shows great speed and high accuracy when analyzing a large dataset. Let c denote a class, x denote a piece of evidence available to the machine learning algo- rithm. The conditional probability P(c|x) represents the probability of class c when evidence x occurs. By Bayes’ rule: PðcjxÞ¼ pðc \ xÞ pðxÞ ¼ pðcÞpðxjcÞ pðxÞ where p(x) is the prior probability of evidence x, p(c) is the prior probability of class c, p(x|c) is the probability of occurence that the evidence x is found under class c. For each test data with an un- known class label, we can assign a class label ci which has maxi- mum P(ci|x). We used the implementation of the Naïve Bayes classifier (NBC) by Tim Menzies.3 http://www.cs.pdx.edu/~timm/dm/nbc.html Fig. 2. Distribution of data. 9010 L.-C. Wu et al. / Expert Systems with Applications 36 (2009) 9007–9014 3.3.3. Decision Tree Decision Tree is a flow-chart-like tree structure (Han and Kam- ber, 2005). The topmost node is the root node, each internal node denotes an attribute test, each branch represents an outcome of the test, and each leaf node represents classes. In order to classify an unknown dataset, the attribute values are tested against the Decision Tree. A path is traced from the root to a leaf node that holds the class prediction for the test data, Different to Naïve Bayes and Neural Network, the output model built by Decision Tree is interpretable and can easily be converted to classification rules. We use Decision Tree tool C4.54 to training our dataset. 3.3.4. Neural Network Neural Network is a machine learning approach that was in- spired by biological nervous systems. Usually, the network consists of three major layers, input layer, hidden layer, output layer. Each layer consists of several perceptions unit (nerves), and sometimes the hidden layer consists of more than one layer when solving a complex problem. We use a neural network tool (SNNS5) to design a network using the standard backpropagation algorithm, 10,000 cycles, learning rate 0.2, weight range [�1, 1] and one hidden layer .The amount of units in input layer is equal to the dimension of feature vector after feature selection. The number of units in hidden layer is double that of the input layer. 3.3.5. Model evaluation An evaluation process is needed to measure the performance of models created by machine learning approach. We used 10-fold cross-validation to evaluate the model. Thus, 1810 proteins are separated into 10 subsets randomly and then each subset is taken as test data in turn. Four performance indices, accuracy, specificity, sensitivity and precision, were used during our evaluation. Accu- 4 C4.5 (http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/ tutorial.html). 5 Stuttgart Neural Network Simulator (http://www-ra.informatik.uni-tuebin- gen.de/SNNS/). racy is the percentage of all correct decisions made by classifica- tion algorithm. Sensitivity is the proportion of the data in class A that is also classified into class C. Specificity represent the propor- tion of data not in class C that are not classified into class C either. Precision is the proportion of data predicted in class C that is really in class A. For example, each protein in data set has two thermostability class labels after the K-fold process. One is the protein’s real ther- mostability class (Creal), and another is the thermostability class (Cpredict) assigned by the model built by machine learning. If we fo- cus on thermostability class mesophilic, then there is a subset la- beled REAL that is made up of proteins whose Creal = mesophilic. After the K-fold process, another subset os produced labeled PRE- DICT, which consists of protein whose Cpredict = mesophilic. Let the four subsets TP, FP, TN and FN refer to True Positive, False Po- sitive, True Negative, and False Negative, respectively, then TP is consists of proteins whose both Creal and Cpredict are mesophilic. FP is consists of proteins whose Cpredict = mesophilic, but Creal is not mesophilic. TN is consists of proteins whose neither Creal nor Cpredict are mesophilic. FN is consists of proteins whose Creal = mesophilic, but Cpredict is not mesophilic. REAL is the union of TP and FN. PREDICT is the union of FP and TP Then the four performance indices are as follows: Accuracy ¼ TP þ TN TP þ TN þ FP þ FN Sensitivity ¼ TP TP þ FN Specificivity ¼ TN TN þ FP Precision ¼ TP TP þ FP http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.html http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.html http://www-ra.informatik.uni-tuebingen.de/SNNS/ http://www-ra.informatik.uni-tuebingen.de/SNNS/ Table 1 Top 20 features with the highest correlation coefficients. Feature Correlation coefficient Glutamate (E)/residue 0.561 Charged/non-charged 0.496 Non-charged/residue 0.496 Charged/residue 0.489 Basic/residue 0.458 (E + F + M + R)/residue 0.437 Lysine (K)/residue 0.421 L.-C. Wu et al. / Expert Systems with Applications 36 (2009) 9007–9014 9011 To minimize the bias within such a random process, we re- peated the above procedure one hundred times and used the aver- age of the 100 results to measure our system. Fig. 3 show the training and predicting processes. A model was created using a machine learning approach. The model contains several interpretable or non-interpretable rules depend on the training algorithm. From a given feature set for a protein, the mod- el will calculate the thermostability class of protein following these rules. Glutamine (Q)/residue 0.404 Threonine (T)/residue 0.388 Acidic/residue 0.363 Isoleucine (I)/residue 0.350 Glutamate (E) 0.328 Serine (S)/residue 0.325 Asparagine (N)/residue 0.286 Lysine (K) 0.285 Alanine (A)/residue 0.282 Glutamine (Q) 0.280 Ion pair (GLU_ARG_24) 0.330 Ion pair (GLU_HIS_24) 0.370 Ion pair (GLU_LYS_24) 0.380 4. Results 4.1. Correlation coefficient We considered the 111 proteins features and the correlation coefficients showed that some feature only contributed in a minor way to thermostability. Table 1 shows the top 20 sequence features and their correlation coefficients. A full table of all correlation coef- ficients is given in the Appendix. For this study, we selected a correlation coefficient threshold of 0.3, which gave a subset of eleven sequence features and three structural features. The sequence features were made up of gluta- mate (E)/residue, charged/non-charged, charged/residue, basic/ residue, (E + F + M + R)/residue, lysine (K)/residue, glutamine (Q)/ residue, threonine (T)/residue, acidic/residue, isoleucine (I)/residue and serine (S)/residue. The structural features consisted of the three strong ion pair formed by glutamate/arginine, glutamate/his- tidine and glutamate/lysine. 4.2. Cross validation result 4.2.1. Three test cases For this study, we designed three test cases to examine the models built by machine learning. We constructed three datasets that divided proteins into different temperature types, MT, TH and MTH. The MT dataset divided the protein thermostability by 50 �C. Using this division, we tried to identify the difference be- Fig. 3. Training and pred tween mesophilic proteins and thermophilic proteins in the MT dataset. Thus, hyperthermophilic proteins are treated as thermo- philic proteins in the MT dataset. Similarly, the TH dataset divided the protein thermostability by 80 �C and mesophilic proteins are treated as thermophilic proteins in the TH dataset. In the MTH dataset, the data is divided into three classes by 50 �C and by 80 �C and we attempted to identify all three different thermosta- bility classes during the same process. 4.2.2. Comparing the three machine learning approaches Table 2 show the accuracy results for mesophilicity based on the eleven features. This table reveals that the Decision Tree and Neural Network approaches gave better performance than Naïve Bayes. Overall, Decision Tree gave the highest performance overall because Neural Network gave a significantly poorer performance with class TH. icting system flow. Table 2 Accuracy results based on the 11 sequence-only features. Classifier MT TH MTH Naïve Bayes 0.76 0.78 0.77 Decision Tree 0.87 0.84 0.85 Neural Network 0.88 0.75 0.81 Table 3 Accuracy results based on all 14 features. Classifier MT TH MTH Naïve Bayes 0.76 0.78 0.75 Decision Tree 0.87 0.84 0.86 Neural Network 0.85 0.82 0.74 Table 5 Detailed results for the four indices based on all 14 features using the MT dataset. Mesophilic Thermophilic AVG. (%) STD. AVG. (%) STD. Accuracy 86.8 0.005 86.8 0.005 Sensitivity 85.8 0.007 87.7 0.010 Specificity 87.7 0.010 85.8 0.007 Precision 87.5 0.009 86.1 0.006 Table 6 Detailed results for the four indices based on 11 features using the TH dataset. Thermophilic Hyperthermophilic AVG. (%) STD. AVG. (%) STD. Accuracy 84.3 0.016 84.3 0.016 Sensitivity 81.1 0.024 87.6 0.018 Specificity 87.6 0.018 81.1 0.024 Precision 86.8 0.018 82.3 0.019 Table 7 Detailed results for the four indices based on all 14 features using the TH dataset. Thermophilic Hyperthermophilic AVG. (%) STD. AVG. (%) STD. Accuracy 83.5 0.016 83.5 0.016 Sensitivity 80.5 0.021 86.5 0.019 Specificity 86.5 0.019 80.5 0.021 Precision 85.6 0.019 81.7 0.017 Table 8 Detailed results for the four indices based on 11 features using the MTH dataset. Mesophilic Thermophilic Hyperthermophilic AVG. (%) STD. AVG. (%) STD. AVG. (%) STD. Accuracy 87.0 0.0066 83.6 0.0075 89.3 0.0056 Sensitivity 87.3 0.0090 73.1 0.0144 73.2 0.0181 Specificity 86.8 0.0086 88.5 0.0084 93.2 0.0057 Precision 86.2 0.0080 75.0 0.0143 72.2 0.0171 Table 9 Detailed results for the four indices based on all 14 features for the MTH dataset. Mesophilic Thermophilic Hyperthermophilic AVG. (%) STD. AVG. (%) STD. AVG. (%) STD. Accuracy 87.7 0.006 81.2 0.007 89.6 0.006 Sensitivity 85.2 0.009 75.8 0.015 69.7 0.023 Specificity 90.0 0.007 83.7 0.010 94.3 0.006 Precision 88.9 0.007 68.9 0.013 74.5 0.020 9012 L.-C. Wu et al. / Expert Systems with Applications 36 (2009) 9007–9014 Table 3 show the accuracy result for mesophilicity based on all 14 sequence and structural features. Again, like the results shown in Table 2, Decision Tree gave the best result overall. 4.3. Decision Tree detailed results In the above section, Decision Tree gave good performance and the accuracy was generally better than 84%. In order to analyze the stability of the model, we repeat the cross-validation process 100 times and calculate the mean value and standard deviation of these 100 results. Using the MT database, we can see from Table 4 that the mean values (AVG.) and standard deviations (STD.) for the four perfor- mance indices based on 100 results using 11 features. Firstly, accu- racy, sensitivity, specificity and precision for mesophilicity and thermophilicity are all higher than 86%. Secondly, the standard deviations of all indices are 0.01 or lower. This means that our pre- diction system built by Decision Tree is both stable and gives good performance. The features were then increased to 14 and Table 5 shows the mean values and standard deviations for the four performance indices based on 100 results. Compared to Table 4, the result still show good performance, which is all better than 85%. For the TH dataset, the same format results are show in Tables 6 and 7. For 11 features system, all these indices are higher than 81% and the standard deviations are all lower than 0.024; thus the per- formance and stability are still good. For the 14 features system, all the indices are higher than 80% and the standard deviations are all lower than 0.021. The same approach was used for the MTH dataset and Table 8 shows the results for the 11 features system, while Table 9 shows the results for the 14 features system. For 11 features, only preci- sion for thermophilicity (65.0%), sensitivity for hyperthermophilic- ity (63.0%) and precision for hyperthermophilicity (69.4%) are lower than 70%. For mesophilicity, all four indices are better than 83%, while specificity for hyperthermophilicity is even higher than 93%. The standard deviations for the indices for three of the ther- mostability classes are all lower than 0.03. For 14 features, only precision in thermophilicity (68.9%) and sensitivity in hyperther- mophilicity (69.7%) are lower than 70%, and all indices have im- proved compared with 11 features. Thus, based on standard Table 4 Detailed results for the four indices based on 11 features using the MT dataset. Mesophilicity Thermophilicity AVG. (%) STD. AVG. (%) STD. Accuracy 87.3 0.005 87.3 0.005 Sensitivity 86.1 0.008 88.5 0.010 Specificity 88.5 0.010 86.1 0.008 Precision 88.2 0.009 86.4 0.007 deviation and error classification, we can say that our model shows stable performance under cross-validation. In addition to these four indices, we also carried out a special test to evaluate the model’s consistency comparing the model cre- ated by the MT dataset and that created by the TH dataset. For each protein, we use both models to predict the thermostability. Incon- sistencies between the models may occur if the MT model specifies a protein is mesophilic while the TH model specifies it is hyper- thermophilic. We calculate the percentage of proteins that show such inconsistency. Using two random models, the result should be an average of inconsistency of 25%. The output inconsistencies based on 11 features and 14 features were 1.8% and 1.6%, respec- tively. This meant that the models created by MT and TH datasets showed a low level of inconsistency. Table 10 New protein sequence dataset prediction results. Index MT TH MTH Accuracy 0.67 0.74 0.68 Sensitivity 0.56 0.82 0.54 Specificity 0.77 0.51 0.80 Precision 0.68 0.82 0.70 Appendix A All features and their correlation coefficients. Features Correlation coefficient Alanine (A) �0.117 Arginine (R) 0.171 Asparagine (N) �0.197 Aspartic acid (D) �0.126 Cysteine (C) �0.009 Glutamic acid (E) 0.328 Glutamine (Q) �0.280 Glycine (G) �0.082 Histidine (H) �0.049 Isoleucine (I) 0.177 Leucine (L) 0.101 Lysine (K) 0.285 Methionine (M) �0.042 Phenylalanine (F) �0.013 Proline (P) �0.002 Serine (S) �0.181 Threonine (T) �0.216 Tryptophan (W) �0.195 Tyrosine (Y) �0.049 Valine (V) 0.139 Length 0.014 Acidic 0.161 Basic 0.218 Polar �0.001 Non-polar 0.032 Cyclic �0.057 Acyclic 0.030 Aliphatic 0.037 Aromatic �0.075 Hydrophobic 0.032 Hydrophobic (+G) 0.014 Hydrophilic �0.001 Charged 0.193 Non-charged �0.056 Alanine (A)/residue �0.282 Arginine (R)/residue 0.246 Asparagine (N)/residue �0.286 Aspartic acid (D)/residue �0.233 Cysteine (C)/residue 0.003 Glutamic acid (E)/residue 0.561 Glutamine (Q)/residue �0.404 Glycine (G)/residue �0.183 Histidine (H)/residue �0.092 Isoleucine (I)/residue 0.350 Leucine (L)/residue 0.158 Lysine (K)/residue 0.421 Methionine (M)/residue �0.040 Phenylalanine (F)/residue �0.022 L.-C. Wu et al. / Expert Systems with Applications 36 (2009) 9007–9014 9013 4.4. Prediction results Here we applied new protein datasets to our model. First, 11415 proteins were downloaded from SWISS-Pro via PGTdb. Among these, there were 9959 proteins with an optimal growth tempera- ture lower than 50 �C, 869 proteins with one between 50 �C and 80 �C and 990 proteins with one higher than 80 �C. In order to make this test fair, we also used sampling of this data set. After sampling, the new dataset consisted of a total of 2943 proteins made up of 1377 mesophilic, 752 thermophilic and 814 hyper- thermophilic proteins. Table 10 shows the performance result. These proteins had only sequence information available and therefore the structural fea- tures aspect of the analysis could not be applied. This dataset is made up of many difference sequences, and the results showed a lower performance index than with the previously used dataset, especially that including structural information. The model based on the TH dataset showed highest sensitivity, which suggests that the TH model is better at detecting hyperthermophilic proteins. 5. Discussion and conclusions A large amount of useful thermostability data on protein was obtained from PGTdb (Huang et al., 2004) and thermostability pre- diction based on this data was carried out. Several important fea- tures need to be highlighted from this research and the comparison of the various different machine learning techniques. Important structural and sequence features are highlighted. Those features with a higher correlation coefficient absolute values are important not only to our prediction but will also be useful to pro- tein engineering. After comparing the difference methods, Decision Tree showed the best performance. Four performance indices, accuracy, sensitivity, specificity and precision were obtained for each dataset and overall high accuracy was achieved. The predic- tion results from the test data made up of 2943 protein from SWISS-Prot showed that an accuracy of better then 67% could be obtained when doing real predictions. The sequence and structure features together gave a better per- formance than sequence features alone, which confirms previous studies, which have shown that structural feature contribute to ther- mostability. The sequence and structural model increased not only the four indices of accuracy, sensitivity, specificity and precision, but also reduced their standard deviations. Thus the sequence and structural model gives higher stability than the sequence alone model. Two non-standard features were used in our research. (E + F + M + R)/residue and charged/non-charged get high linear correlation coefficients with optimal growth temperature. In Far- ias’s research (Farias et al., 2004), a special ratio (E + K)/(Q + H) was also considered a significant factor when distinguishing ther- mophilic and mesophilic proteins. In our research, charged amino acids consisted of aspartic acid, glutamic acid, lysine, arginine and histidine. Non-charged amino acids are the remaining 15 ami- no acids. Although the biological significance of these two features has not been mentioned in previous research, the features (E + F + M + R)/residue and charged/non-charged were important factors in this work and need to be consider as new key features when studying protein thermostability. Our final prediction system is based on Decision Tree. Decision Tree is an interpretable machine learning approach, which means that the output from Decision Tree consists of many explicable rules. Thus, hundreds of such rules were made available by this re- search. Each rule shows a series of conditions for determining pro- tein thermostability. Future works will be related to how to use these rules to alter protein thermostability and this has been initiated. (continued on next page) Appendix A (continued) Features Correlation coefficient Proline (P)/residue �0.027 Serine (S)/residue �0.325 Threonine (T)/residue �0.388 Tryptophan (W)/residue �0.235 Tyrosine (Y)/residue �0.121 Valine (V)/residue 0.249 Acidic/residue 0.363 Basic/residue 0.458 Polar/residue �0.087 Non-polar/residue 0.087 Cyclic/residue �0.185 Acyclic/residue 0.185 Aliphatic/residue 0.081 Aromatic/residue �0.195 Hydrophobic/residue 0.087 Hydrophobic (+G)/residue 0.004 Hydrophilic/residue �0.087 Charged/residue 0.489 Non-charged/residue �0.489 Acidic/basic �0.109 Basic/acidic 0.124 Polar/nonpolar �0.121 Nonpolar/polar 0.052 Cyclic/acyclic �0.171 Acyclic/cyclic 0.188 Charged/non-charged 0.496 Non-charged/charged �0.393 Hydrophobic/hydrophilic 0.052 Hydrophilic/hydrophobic �0.121 Hydrophobic (+G)/hydrophilic 0.012 Hydrophilic/hydrophobic (+G) �0.074 (E + F + M + R)/residue 0.437 ASA 0.029 Disulfide bond 0.017 Hydrogen bond �0.038 Strong hydrogen bond �0.019 Hydrogen bond/residue �0.034 Strong hydrogen bond/residue �0.015 Helix 0.034 Atom in helix 0.063 Helix/residue 0.050 Atom in helix/residue 0.091 Ion pair (ASP_ARG_24) 0.013 Ion pair (ASP_HIS_24) 0.025 Ion pair (ASP_LYS_24) �0.008 Ion pair (GLU_ARG_24) 0.330 Ion pair (GLU_HIS_24) 0.370 Ion pair (GLU_LYS_24) 0.380 All ion pair_24 0.224 Ion pair (ASP_ARG_46) 0.233 Ion pair (ASP_HIS_46) 0.245 Ion pair (ASP_LYS_46) �0.067 Ion pair (GLU_ARG_46) �0.201 Ion pair (GLU_HIS_46) �0.174 Ion pair (GLU_LYS_46) 0.033 All ion pair_46 0.082 Ion pair (ASP_ARG_68) 0.064 Ion pair (ASP_HIS_68) �0.042 Ion pair (ASP_LYS_68) �0.057 Ion pair (GLU_ARG_68) �0.035 Ion pair (GLU_HIS_68) 0.181 Ion pair (GLU_LYS_68) 0.194 All ion pair_68 0.180 9014 L.-C. Wu et al. / Expert Systems with Applications 36 (2009) 9007–9014 References Baumgartner, C., Bohm, C., Baumgartner, D., Marini, G., Weinberger, K., Olgemoller, B., et al. (2004). Bioinformatics, 20(17), 2985–2996. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., et al. (2000). Nucleic Acids Research, 28(1), 235–242. Capriotti, E., Fariselli, P., & Casadio, R. (2004). Bioinformatics, 20(Suppl. 1), I63–I68. Capriotti, E., Fariselli, P., & Casadio, R. (2005). Nucleic Acids Research, 33(Web Server issue), W306–310. Chan, C. H., Liang, H. K., Hsiao, N. W., Ko, M. T., Lyu, P. C., & Hwang, J. K. (2004). Proteins, 57(4), 684–691. Dalton, J. A., Michalopoulos, I., & Westhead, D. R. (2003). Bioinformatics, 19(10), 1298–1299. Dominy, B. N., Minoux, H., & Brooks, C. L. 3rd, (2004). Proteins, 57(1), 128–141. Farias, S. T., van der Linden, M. G., Rego, T. G., Araujo, D. A., & Bonato, M. C. (2004). In Silico Biology, 4(3), 377–380. Gianese, G., Bossa, F., & Pascarella, S. (2002). Proteins, 47(2), 236–249. Gromiha, M. M., Oobatake, M., & Sarai, A. (1999). Biophysical Chemistry, 82(1), 51–67. Haney, P. J., Stees, M., & Konisky, J. (1999). Journal of Biological Chemistry, 274(40), 28453–28458. Huang, S. L., Wu, L. C., Huang, H. D., Liang, H. K., Ko, M. T., & Horng, J. T. (2004). Applied Bioinformatics, 3(1), 21–29. Huang, Shir-Ly, Wu, Li-Cheng, Huang, Hsien-Da, Liang, Han-Kuen, Ko, Ming-Tat, & Horng, J.-T. (2004). Applied Bioformatics, 3(1), 21–29. Huang, S. L., Wu, L. C., Liang, H. K., Pan, K. T., Horng, J. T., & Ko, M. T. (2004). Bioinformatics, 20(2), 276–278. Han, J., & Kamber, M. (2005). Data mining concepts and techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Kannan, N., & Vishveshwara, S. (2000). Protein Engineering, 13(11), 753–761. Liang, H. K., Huang, C. M., Ko, M. T., & Hwang, J. K. (2005). Proteins, 59(1), 58–63. Matthews, X. Z. a. B. W. (1995). Journal of Applied Crystallography, 28, 624–630. McDonald, I. K., & Thornton, J. M. (1994). Journal of Molecular Biology, 238(5), 777–793. Parthasarathy, S., & Murthy, M. R. (2000). Protein Engineering, 13(1), 9–13. Petukhov, M., Kil, Y., Kuramitsu, S., & Lanzov, V. (1997). Proteins, 29(3), 309–320. Ragone, R. (2001). Protein Science, 10(10), 2075–2082. Saraboji, K., Gromiha, M. M., & Ponnuswamy, M. N. (2006). Biopolymers. Szilagyi, A., & Zavodszky, P. (2000). Structure with Folding and Design, 8(5), 493–504. Vieille, C., & Zeikus, G. J. (2001). Microbiology and Molecular Biology Reviews, 65(1), 1–43. Vogt, G., Woell, S., & Argos, P. (1997). Journal of Molecular Biology, 269(4), 631–643. Wackerly, D. D. W. M., III, & Scheaffer, R. L. (1996). Mathematical statistics with applications. New York: Duxbury Press. An expert system to predict protein thermostability using decision tree Introduction Related works Materials and methods System flow Materials Sampling data Candidate features Data generation Method Feature selection Naïve Bayes Decision Tree Neural Network Model evaluation Results Correlation coefficient Cross validation result Three test cases Comparing the three machine learning approaches Decision Tree detailed results Prediction results Discussion and conclusions Appendix A References