key: cord-0001424-qdld7hdc authors: Fan, Yue-Nong; Xiao, Xuan; Min, Jian-Liang; Chou, Kuo-Chen title: iNR-Drug: Predicting the Interaction of Drugs with Nuclear Receptors in Cellular Networking date: 2014-03-19 journal: Int J Mol Sci DOI: 10.3390/ijms15034915 sha: ee55aea26f816403476a7cb71816b8ecb1110329 doc_id: 1424 cord_uid: qdld7hdc Nuclear receptors (NRs) are closely associated with various major diseases such as cancer, diabetes, inflammatory disease, and osteoporosis. Therefore, NRs have become a frequent target for drug development. During the process of developing drugs against these diseases by targeting NRs, we are often facing a problem: Given a NR and chemical compound, can we identify whether they are really in interaction with each other in a cell? To address this problem, a predictor called “iNR-Drug” was developed. In the predictor, the drug compound concerned was formulated by a 256-D (dimensional) vector derived from its molecular fingerprint, and the NR by a 500-D vector formed by incorporating its sequential evolution information and physicochemical features into the general form of pseudo amino acid composition, and the prediction engine was operated by the SVM (support vector machine) algorithm. Compared with the existing prediction methods in this area, iNR-Drug not only can yield a higher success rate, but is also featured by a user-friendly web-server established at http://www.jci-bioinfo.cn/iNR-Drug/, which is particularly useful for most experimental scientists to obtain their desired data in a timely manner. It is anticipated that the iNR-Drug server may become a useful high throughput tool for both basic research and drug development, and that the current approach may be easily extended to study the interactions of drug with other targets as well. With the ability to directly bind to DNA ( Figure 1 ) and regulate the expression of adjacent genes, nuclear receptors (NRs) are a class of ligand-inducible transcription factors. They regulate various biological processes, such as homeostasis, differentiation, embryonic development, and organ physiology [1] [2] [3] . The NR superfamily has been classified into seven families: NR0 (knirps or DAX like) [4, 5] ; NR1 (thyroid hormone like), NR2 (HNF4-like), NR3 (estrogen like), NR4 (nerve growth factor IB-like), NR5 (fushi tarazu-F1 like), and NR6 (germ cell nuclear factor like). Since they are involved in almost all aspects of human physiology and are implicated in many major diseases such as cancer, diabetes and osteoporosis, nuclear receptors have become major drug targets [6, 7] , along with G protein-coupled receptors (GPCRs) [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] , ion channels [18] [19] [20] , and kinase proteins [21] [22] [23] [24] . Identification of drug-target interactions is one of the most important steps for the new medicine development [25, 26] . The method usually adopted in this step is molecular docking simulation [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] . However, to make molecular docking study feasible, a reliable 3D (three dimensional) structure of the target protein is the prerequisite condition. Although X-ray crystallography is a powerful tool in determining protein 3D structures, it is time-consuming and expensive. Particularly, not all proteins can be successfully crystallized. For example, membrane proteins are very difficult to crystallize and most of them will not dissolve in normal solvents. Therefore, so far very few membrane protein 3D structures have been determined. Although NMR (Nuclear Magnetic Resonance) is indeed a very powerful tool in determining the 3D structures of membrane proteins as indicated by a series of recent publications (see, e.g., [44] [45] [46] [47] [48] [49] [50] [51] and a review article [20] ), it is also time-consuming and costly. To acquire the 3D structural information in a timely manner, one has to resort to various structural bioinformatics tools (see, e.g., [37] ), particularly the homologous modeling approach as utilized for a series of protein receptors urgently needed during the process of drug development [19, [52] [53] [54] [55] [56] [57] . Unfortunately, the number of dependable templates for developing high quality 3D structures by means of homology modeling is very limited [37] . To overcome the aforementioned problems, it would be of help to develop a computational method for predicting the interactions of drugs with nuclear receptors in cellular networking based on the sequences information of the latter. The results thus obtained can be used to pre-exclude the compounds identified not in interaction with the nuclear receptors, so as to timely stop wasting time and money on those unpromising compounds [58] . Actually, based on the functional groups and biological features, a powerful method was developed recently [59] for this purpose. However, further development in this regard is definitely needed due to the following reasons. (a) He et al. [59] did not provide a publicly accessible web-server for their method, and hence its practical application value is quite limited, particularly for the broad experimental scientists; (b) The prediction quality can be further enhanced by incorporating some key features into the formulation of NR-drug (nuclear receptor and drug) samples via the general form of pseudo amino acid composition [60] . The present study was initiated with an attempt to develop a new method for predicting the interaction of drugs with nuclear receptors by addressing the two points. As demonstrated by a series of recent publications [10, 18, [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] and summarized in a comprehensive review [60] , to establish a really effective statistical predictor for a biomedical system, we need to consider the following steps: (a) select or construct a valid benchmark dataset to train and test the predictor; (b) represent the statistical samples with an effective formulation that can truly reflect their intrinsic correlation with the object to be predicted; (c) introduce or develop a powerful algorithm or engine to operate the prediction; (d) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (e) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us elaborate how to deal with these steps. The data used in the current study were collected from KEGG (Kyoto Encyclopedia of Genes and Genomes) [71] at http://www.kegg.jp/kegg/. KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies. Here, the benchmark dataset can be formulated as where is the positive subset that consists of the interactive drug-NR pairs only, while the negative subset that contains of the non-interactive drug-NR pairs only, and the symbol represents the union in the set theory. The so-called "interactive" pair here means the pair whose two counterparts are interacting with each other in the drug-target networks as defined in the KEGG database [71] ; while the "non-interactive" pair means that its two counterparts are not interacting with each other in the drug-target networks. The positive dataset contains 86 drug-NR pairs, which were taken from He et al. [59] . The negative dataset contains 172 non-interactive drug-NR pairs, which were derived according to the following procedures: (a) separating each of the pairs in into single drug and NR; (b) re-coupling each of the single drugs with each of the single NRs into pairs in a way that none of them occurred in ; (c) randomly picking the pairs thus formed until reaching the number two times as many as the pairs in . The 86 interactive drug-NR pairs and 172 non-interactive drug-NR pairs are given in Supplementary Information S1, from which we can see that the 86 + 172 = 258 pairs in the current benchmark dataset are actually formed by 25 different NRs and 53 different compounds. Since each of the samples in the current network system contains a drug (compound) and a NR (protein), the following procedures were taken to represent the drug-NR pair sample. First, for the drug part in the current benchmark dataset, we can use a 256-D vector to formulate it as given by where D represents the vector for a drug compound, and d i its i-th (i = 1,2, ,256) component that can be derived by following the "2D molecular fingerprint procedure" as elaborated in [10] . The 53 molecular fingerprint vectors thus obtained for the 53 drugs in are, respectively, given in Supplementary Information S2. The protein sequences of the 25 different NRs in are listed in Supplementary Information S3. Suppose the sequence of a nuclear receptor protein P with L residues is generally expressed by where 1 R represents the 1st residue of the protein sequence P , 2 R the 2nd residue, and so forth. Now the problem is how to effectively represent the sequence of Equation (3) with a non-sequential or discrete model [72] . This is because all the existing operation engines, such as covariance discriminant (CD) [17, 65, [73] [74] [75] [76] [77] [78] [79] , neural network [80] [81] [82] , support vector machine (SVM) [62] [63] [64] 83] , random forest [84, 85] , conditional random field [66] , nearest neighbor (NN) [86, 87] ; K-nearest neighbor (KNN) [88] [89] [90] , OET-KNN [91] [92] [93] [94] , and Fuzzy K-nearest neighbor [10, 12, 18, 69, 95] , can only handle vector but not sequence samples. However, a vector defined in a discrete model may completely lose all the sequence-order information and hence limit the quality of prediction. Facing such a dilemma, can we find an approach to partially incorporate the sequence-order effects? Actually, one of the most challenging problems in computational biology is how to formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence order information. To avoid completely losing the sequence-order information for proteins, the pseudo amino acid composition [96, 97] or Chou's PseAAC [98] was proposed. Ever since the concept of PseAAC was proposed in 2001 [96] , it has penetrated into almost all the areas of computational proteomics, such as predicting anticancer peptides [99] , predicting protein subcellular location [100] [101] [102] [103] [104] [105] [106] , predicting membrane protein types [107, 108] , predicting protein submitochondria locations [109] [110] [111] [112] , predicting GABA(A) receptor proteins [113] , predicting enzyme subfamily classes [114] , predicting antibacterial peptides [115] , predicting supersecondary structure [116] , predicting bacterial virulent proteins [117] , predicting protein structural class [118] , predicting the cofactors of oxidoreductases [119] , predicting metalloproteinase family [120] , identifying cysteine S-nitrosylation sites in proteins [66] , identifying bacterial secreted proteins [121] , identifying antibacterial peptides [115] , identifying allergenic proteins [122] , identifying protein quaternary structural attributes [123, 124] , identifying risk type of human papillomaviruses [125] , identifying cyclin proteins [126] , identifying GPCRs and their types [15, 16] , discriminating outer membrane proteins [127] , classifying amino acids [128] , detecting remote homologous proteins [129] , among many others (see a long list of papers cited in the References section of [60] ). Moreover, the concept of PseAAC was further extended to represent the feature vectors of nucleotides [65] , as well as other biological samples (see, e.g., [130] [131] [132] ). Because it has been widely and increasingly used, recently two powerful soft-wares, called "PseAAC-Builder" [133] and "propy" [134] , were established for generating various special Chou's pseudo-amino acid compositions, in addition to the web-server "PseAAC" [135] built in 2008. According to a comprehensive review [60] , the general form of PseAAC for a protein sequence P is formulated by where the subscript  is an integer, and its value as well as the components ( 1, 2, , ) u u   will depend on how to extract the desired information from the amino acid sequence of P (cf. Equation (3)). Below, let us describe how to extract useful information to define the components of PseAAC for the NR samples concerned. First, many earlier studies (see, e.g., [136] [137] [138] [139] [140] [141] ) have indicated that the amino acid composition (AAC) of a protein plays an important role in determining its attributes. The AAC contains 20 components with each representing the occurrence frequency of one of the 20 native amino acids in the protein concerned. Thus, such 20 AAC components were used here to define the first 20 elements in Equation (4); i.e., (1) ( 1, 2, , 20) ii fi   (5) where f i (1) is the normalized occurrence frequency of the i-th type native amino acid in the nuclear receptor concerned. Since AAC did not contain any sequence order information, the following steps were taken to make up this shortcoming. To avoid completely losing the local or short-range sequence order information, we considered the approach of dipeptide composition. It contained 20 × 20 = 400 components [142] . Such 400 components were used to define the next 400 elements in Equation (4); i.e., (2) 20 ( 1, 2, , 400) jj fj where (2) j f is the normalized occurrence frequency of the j-th dipeptides in the nuclear receptor concerned. To incorporate the global or long-range sequence order information, let us consider the following approach. According to molecular evolution, all biological sequences have developed starting out from a very limited number of ancestral samples. Driven by various evolutionary forces such as mutation, recombination, gene conversion, genetic drift, and selection, they have undergone many changes including changes of single residues, insertions and deletions of several residues [143] , gene doubling, and gene fusion. With the accumulation of these changes over a long period of time, many original similarities between initial and resultant amino acid sequences are gradually faded out, but the corresponding proteins may still share many common attributes [37] , such as having basically the same biological function and residing at a same subcellular location [144, 145] . To extract the sequential evolution information and use it to define the components of Equation (4), the PSSM (Position Specific Scoring Matrix) was used as described below. According to Schaffer [146] , the sequence evolution information of a nuclear receptor protein P with L amino acid residues can be expressed by a 20 L matrix, as given by where (7) were generated by using PSI-BLAST [147] to search the UniProtKB/Swiss-Prot database (The Universal Protein Resource (UniProt); http://www.uniprot.org/) through three iterations with 0.001 as the E-value cutoff for multiple sequence alignment against the sequence of the nuclear receptor concerned. In order to make every element in Equation (7) be scaled from their original score ranges into the region of [0, 1], we performed a conversion through the standard sigmoid function to make it become Now we extract the useful information from Equation (8) Moreover, we used the grey system model approach as elaborated in [68] to further define the next 60 components of Equation (4) ( 1, 2, , 20) In the above equation, w 1 , w 2 , and w 3 are weight factors, which were all set to 1 in the current study; f j (1) has the same meaning as in Equation (5) where   and Combining Equations (5), (6), (10) and (12), we found that the total number of the components obtained via the current approach for the PseAAC of Equation (4) and each of the 500 components is given by (1) ( Since the elements in Equations (2) and (4) are well defined, we can now formulate the drug-NR pair by combining the two equations as given by   (19) where G represents the drug-NR pair, Å the orthogonal sum, and the 256 + 500 = 756 components are defined by Equations (2) and (18) . For the sake of convenience, let us use x i (i = 1, 2, , 756) to represent the 756 components in Equation (19); i.e., (20) To optimize the prediction quality with a time-saving approach, similar to the treatment [148] [149] [150] , let us convert Equation (20) to where the symbol means taking the average of the quantity therein, and SD means the corresponding standard derivation. In this study, the SVM (support vector machine) was used as the operation engine. SVM has been widely used in the realm of bioinformatics (see, e.g., [62] [63] [64] [151] [152] [153] [154] ). The basic idea of SVM is to transform the data into a high dimensional feature space, and then determine the optimal separating hyperplane using a kernel function. For a brief formulation of SVM and how it works, see the papers [155, 156] ; for more details about SVM, see a monograph [157] . In this study, the LIBSVM package [158] was used as an implementation of SVM, which can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/, the popular radial basis function (RBF) was taken as the kernel function. For the current SVM classifier, there were two uncertain parameters: penalty parameter C and kernel parameter  . The method of how to determine the two parameters will be given later. The predictor obtained via the aforementioned procedure is called iNR-Drug, where "i" means identify, and "NR-Drug" means the interaction between nuclear receptor and drug compound. To provide an intuitive overall picture, a flowchart is provided in Figure 2 to show the process of how the predictor works in identifying the interactions between nuclear receptors and drug compounds. To provide a more intuitive and easier-to-understand method to measure the prediction quality, the following set of metrics based on the formulation used by Chou [159] [160] [161] in predicting signal peptides was adopted. According to Chou's formulation, the sensitivity, specificity, overall accuracy, and Matthew's correlation coefficient can be respectively expressed as [62, [65] [66] [67] Sn 1 where N  is the total number of the interactive NR-drug pairs investigated while N   the number of the interactive NR-drug pairs incorrectly predicted as the non-interactive NR-drug pairs; N  the total number of the non-interactive NR-drug pairs investigated while N   the number of the non-interactive NR-drug pairs incorrectly predicted as the interactive NR-drug pairs. According to Equation (23) we can easily see the following. When 0 N    meaning none of the interactive NR-drug pairs was mispredicted to be a non-interactive NR-drug pair, we have the sensitivity Sn = 1; while NN    meaning that all the interactive NR-drug pairs were mispredicted to be the non-interactive NR-drug pairs, we have the sensitivity Sn = 0 . Likewise, when 0 N    meaning none of the non-interactive NR-drug pairs was mispredicted, we have the specificity Sp we have MCC = 0 meaning total disagreement between prediction and observation. As we can see from the above discussion, it is much more intuitive and easier to understand when using Equation (23) to examine a predictor for its four metrics, particularly for its Mathew's correlation coefficient. It is instructive to point out that the metrics as defined in Equation (23) are valid for single label systems; for multi-label systems, a set of more complicated metrics should be used as given in [162] . How to properly test a predictor for its anticipated success rates is very important for its development as well as its potential application value. Generally speaking, the following three cross-validation methods are often used to examine the quality of a predictor and its effectiveness in practical application: independent dataset test, subsampling or K-fold (such as five-fold, seven-fold, or 10-fold) crossover test and jackknife test [163] . However, as elaborated by a penetrating analysis in [164] , considerable arbitrariness exists in the independent dataset test. Also, as demonstrated in [165] , the subsampling (or K-fold crossover validation) test cannot avoid arbitrariness either. Only the jackknife test is the least arbitrary that can always yield a unique result for a given benchmark dataset [73, 74, 156, [166] [167] [168] . Therefore, the jackknife test has been widely recognized and increasingly utilized by investigators to examine the quality of various predictors (see, e.g., [14, 15, 68, 99, 106, 107, 124, 169, 170] ). Accordingly, in this study the jackknife test was also adopted to evaluate the accuracy of the current predictor. As mentioned above, the SVM operation engine contains two uncertain parameters C and  . To find their optimal values, a 2-D grid search was conducted by the jackknife test on the benchmark dataset . The results thus obtained are shown in Figure 3 , from which it can be seen that the iNR-Drug predictor reaches its optimal status when C = 2 3 and 9 2    . The corresponding rates for the four metrics (cf. Equation (23)) are given in Table 1 , where for facilitating comparison, the overall accuracy Acc reported by He et al. [59] on the same benchmark dataset is also given although no results were reported by them for Sn, Sp and MCC. It can be observed from the table that the overall accuracy obtained by iNR-Drug is remarkably higher that of He et al. [59] , and that the rates achieved by iNR-Drug for the other three metrics are also quite higher. These facts indicate that the current predictor not only can yield higher overall prediction accuracy but also is quite stable with low false prediction rates. As mentioned above (Section 3.2), the jackknife test is the most objective method for examining the quality of a predictor. However, as a demonstration to show how to practically use the current predictor, we took 41 NR-drug pairs from the study by Yamanishi et al. [171] that had been confirmed by experiments as interactive pairs. For such an independent dataset, 34 were correctly identified by iNR-Drug as interactive pairs, i.e., Sn = 34 / 41 = 82.92%, which is quite consistent with the rate of 79.07% achieved by the predictor on the benchmark dataset via the jackknife test as reported in Table 1 . It is anticipated that the iNR-Drug predictor developed in this paper may become a useful high throughput tool for both basic research and drug development, and that the current approach may be easily extended to study the interactions of drug with other targets as well. Since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful predictors [98, 172] , a publicly accessible web-server for iNR-Drug was established. For the convenience of the vast majority of biologists and pharmaceutical scientists, here let us provide a step-by-step guide to show how the users can easily get the desired result by using iNR-Drug web-server without the need to follow the complicated mathematical equations presented in this paper for the process of developing the predictor and its integrity. Step 1. Open the web server at the site http://www.jci-bioinfo.cn/iNR-Drug/ and you will see the top page of the predictor on your computer screen, as shown in Figure 4 . Click on the Read Me button to see a brief introduction about iNR-Drug predictor and the caveat when using it. Step 2. Either type or copy/paste the query NR-drug pairs into the input box at the center of Figure 4 . Each query pair consists of two parts: one is for the nuclear receptor sequence, and the other for the drug. The NR sequence should be in FASTA format, while the drug in the KEGG code beginning with the symbol #. Examples for the query pairs input and the corresponding output can be seen by clicking on the Example button right above the input box. Step 3. Click on the Submit button to see the predicted result. For example, if you use the three query pairs in the Example window as the input, after clicking the Submit button, you will see on your screen that the "hsa:2099" NR and the "D00066" drug are an interactive pair, and that the "hsa:2908" NR and the "D00088" drug are also an interactive pair, but that the "hsa:5468" NR and the "D00279" drug are not an interactive pair. All these results are fully consistent with the experimental observations. It takes about 3 minutes before each of these results is shown on the screen; of course, the more query pairs there is, the more time that is usually needed. Step 4. Click on the Citation button to find the relevant paper that documents the detailed development and algorithm of iNR-Durg. Step 5. Click on the Data button to download the benchmark dataset used to train and test the iNR-Durg predictor. Step 6. The program code is also available by clicking the button download on the lower panel of Figure 4 . Nuclear receptors in cell life and death Nuclear Receptors Nuclear Receptors: Current Concepts and Future Challenges The nuclear receptor superfamily Non-steroid nuclear receptors: What are genetic studies telling us their role in renal life? Nuclear receptor drug discovery Nuclear receptors and drug disposition gene regulation A web-server for identifying G-protein coupled receptors and their families with grey incidence analysis Bioinformatical analysis of G-protein-coupled receptors iGPCR-Drug: A web server for predicting interaction between GPCRs and drugs in cellular networking A cellular automaton image approach for predicting G-protein-coupled receptor functional classes GPCR-2L: Predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions Prediction of G-protein-coupled receptor classes in low homology using Chou's pseudo amino acid composition with approximate entropy and hydrophobicity patterns Prediction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: An approach from discrete wavelet transform Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou's PseAAC Identifying GPCRs and their types with Chou's pseudo amino acid composition: An approach from multi-scale energy representation and position specific scoring matrix Prediction of G-protein-coupled receptor classes Identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints Insights from modelling three-dimensional structures of the human potassium and sodium channels Influenza M2 proton channels A Model of the complex between cyclin-dependent kinase 5 (Cdk5) and the activation domain of neuronal Cdk5 activator Rapid and accurate structure determination of coiled-coil domains using NMR dipolar couplings: Application to cGMP-dependent protein kinase Ialpha The three-dimensional structure of the cGMP-dependent protein kinase I-α leucine zipper domain and its interaction with the myosin binding subunit Determination of the packing mode of the coiled-coil domain of cGMP-dependent protein kinase Ialpha in solution using charge-predicted dipolar couplings A guide to drug discovery: Target selection in drug discovery Target discovery A fast flexible docking method using an incremental construction algorithm Binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against SARS NMR studies on how the binding complex of polyisoprenol recognition sequence peptides and polyisoprenols can modulate membrane structure Review: Progress in computational approach to drug development against SARS Molecular modelling and chemical modification for finding peptide inhibitor against SARS CoV Mpro An in-depth analysis of the biological functional studies based on the NMR M2 channel structure of influenza A virus Energetic analysis of the two controversial drug binding sites of the M2 proton channel in influenza A virus Investigation into adamantane-based M2 inhibitors with FB-QSAR Designing inhibitors of M2 proton channel against H1N1 swine influenza virus Insights from investigating the interaction of oseltamivir (Tamiflu) with neuraminidase of the 2009 H1N1 swine flu virus Review: Structural bioinformatics and its impact to biomedical science Identification of proteins interacting with human SP110 during the process of viral infections Docking and molecular dynamics study on the inhibitory activity of novel inhibitors on epidermal growth factor receptor (EGFR) Novel inhibitor design for hemagglutinin against H1N1 influenza virus by core hopping method Design novel dual agonists for treating type-2 diabetes by targeting peroxisome proliferator-activated receptors with core hopping approach Insights from modeling the 3D structure of New Delhi metallo-betalactamase and its binding interactions with antibiotic drugs Insights into the mutation-induced HHH syndrome from modeling human mitochondrial ornithine transporter-1 Mitochondrial uncoupling protein 2 structure determined by NMR molecular fragment searching Structure and mechanism of the M2 proton channel of influenza A virus Unusual architecture of the p7 channel from hepatitis C virus The structure of phospholamban pentamer reveals a channel-like architecture in membranes The structural basis for intramembrane assembly of an activating immunoreceptor complex Solution NMR structure of the V27A drug resistant mutant of influenza A M2 channel Mechanism of drug inhibition and drug resistance of influenza A M2 channel Solution structure and functional analysis of the influenza B proton channel Prediction of the tertiary structure and substrate binding site of caspase-8 Prediction of the tertiary structure of a caspase-9/inhibitor complex Prediction of the tertiary structure of the beta-secretase zymogen Coupling interaction between thromboxane A2 receptor and alpha-13 subunit of guanine nucleotide-binding protein Insights from modeling the 3D structure of DNA-CBF3b complex Modeling the tertiary structure of human cathepsin-E Assessment of chemical libraries for their druggability Predicting drug-target interaction networks based on functional groups and biological features Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review) Identify recombination spots with trinucleotide composition and pseudo amino acid components Identify recombination spots with pseudo dinucleotide composition Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection iNuc-PhysChem: A sequence-based predictor for identifying nucleosomes via physicochemical properties Predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition Incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins iEzy-Drug: A web server for identifying the interaction between enzymes and drugs in cellular networking iAMP-2L: A two-level multi-label classifier for identifying antimicrobial peptides and their functional types A sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition The KEGG databases and tools facilitating omics analysis: Latest developments involving human diseases and pharmaceuticals Review: Recent progresses in protein subcellular location prediction An intriguing controversy over protein structural class prediction Some insights into protein structural class prediction Prediction of enzyme family classes SLLE for predicting membrane protein types Predicting protein structural classes with pseudo amino acid composition: An approach using geometric moments of cellular automaton image A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space Subcellular location prediction of apoptosis proteins Boosting classifier for predicting protein domain structural class Artificial neural network for predicting alpha-turn types Neural network prediction of the HIV-1 protease cleavage sites A sequence-based predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix iDNA-Prot: Identification of DNA binding proteins using random forest with grey model AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties Predicting subcellular localization of proteins in a hybridization space Prediction of protease types in a hybridization space Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers Hum-PLoc: A novel ensemble classifier for predicting human protein subcellular localization Large-scale predictions of Gram-negative bacterial protein subcellular locations Euk-mPLoc: A fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites Signal-CF: A subsite-coupled and window-fusing approach for predicting signal peptides Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo amino acid composition to predict membrane protein types A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0 Fuzzy KNN for predicting membrane protein types from pseudo amino acid composition Prediction of protein cellular attributes using pseudo amino acid composition Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes Theoretical and experimental biology in one Predicting anticancer peptides with Chou's pseudo amino acid composition and investigating their mutagenicity via Ames test Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning EuLoc: A web-server for accurately predict protein subcellular localization in eukaryotes by incorporating various features of sequence segments into the general form of Chou's PseAAC Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou's pseudo amino acid composition Using radial basis function on the general form of Chou's pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sites Prediction of subcellular localization of apoptosis protein using Chou's pseudo amino acid composition GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou's pseudo-amino acid composition Predicting protein subchloroplast locations with both single and multiple sites via three different modes of Chou's pseudo amino acid compositions Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition A multilabel model based on Chou's pseudo-amino acid composition for identifying membrane proteins with both single and multiple functional types Genetic programming for creating Chou's pseudo amino acid based features for submitochondria localization Predicting protein submitochondria locations by combining different descriptors into the general form of Chou's pseudo amino acid composition Multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach Prediction of GABA(A) receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes Predicting antibacterial peptides by the concept of Chou;s pseudo-amino acid composition and machine learning methods Supersecondary structure prediction using Chou's pseudo amino acid composition Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou's pseudo amino acid composition and on evolutionary information A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou's amphiphilic pseudo amino acid composition Prediction of metalloproteinase family based on the concept of Chou's pseudo amino acid composition using a machine learning approach SecretP: Identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition Prediction of allergenic proteins by means of the concept of Chou's pseudo amino acid composition and a machine learning approach Using Chou's pseudo amino acid composition to predict protein quaternary structure: A sequence-segmented PseAAC approach Identifying protein quaternary structural attributes by incorporating physicochemical properties into the general form of Chou's PseAAC via discrete wavelet transform Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses Prediction of cyclin proteins using Chou's pseudo amino acid composition Discriminating outer membrane proteins with fuzzy K-nearest neighbor algorithms based on the general form of Chou's PseAAC Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition Protein remote homology detection by combining Chou's pseudo amino acid composition and profile-based protein representation Identification of colorectal cancer related genes with mRMR and shortest path in protein-protein interaction network Hepatitis C virus network based classification of hepatocellular cirrhosis and carcinoma Signal propagation in protein interaction network during colorectal cancer progression PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions Propy: A tool to generate various modes of Chou's PseAAC PseAAC: A flexible web-server for generating various kinds of protein pseudo amino acid composition The folding type of a protein is relevant to the amino acid composition An optimization approach to predicting protein structural class from amino acid composition Monte Carlo simulation studies on the prediction of protein folding types from amino acid composition Predicting protein folding types by distance functions that make allowances for amino acid interactions Monte Carlo simulation studies on the prediction of protein folding types from amino acid composition. II. correlative effect Does the folding type of a protein depend on its amino acid composition? Protein secondary structural content prediction The convergence-divergence duality in lectin domains of the selectin family and its implications A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins Using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements Gapped BLAST and PSI-BLAST: A new generation of protein database search programs A comparison of normalization methods for high density oligonucleotide array data based on variance and bias Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition Low-frequency Fourier spectrum for predicting membrane protein types Using stacked generalization to predict membrane protein types based on pseudo amino acid composition Prediction of linear B-cell epitopes using amino acid pair antigenicity scale Predicting secretory proteins of malaria parasite by incorporating sequence evolution information into pseudo amino acid composition via grey system model Using functional domain composition and support vector machines for prediction of protein subcellular location Support vector machines for predicting membrane protein types by using functional domain composition An Introduction of Support Vector Machines and Other Kernel-Based Learning Methodds LIBSVM: A library for support vector machines Prediction of protein signal sequences and their cleavage sites Using subsite coupling to predict signal peptides Prediction of signal peptides using scaled window Some remarks on predicting multi-label attributes in molecular biosystems Review: Prediction of protein structural classes Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organisms Cell-PLoc 2.0: An improved package of web-servers for predicting subcellular localization of proteins in various organisms Predicting enzyme family classes by hybridizing gene product composition and pseudo-amino acid composition Identify catalytic triads of serine hydrolases by support vector machines Using pseudo amino acid composition to predict protein subcellular location: Approached with amino acid composition distribution Discriminating bioluminescent proteins by incorporating average chemical shift and evolutionary information into the general form of Chou's pseudo amino acid composition A multi-layer classifier for predicting the subcellular localization of singleplex and multiplex gram-positive bacterial proteins Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework Review: Recent advances in developing web-servers for predicting protein attributes The authors would like to express their gratitude to the three anonymous reviewers, whose constructive comments are very helpful for strengthening the presentation of the paper. The authors declare no conflict of interest.