key: cord-0036193-8mjrrlni authors: Yang, Zheng Rong; Young, Natasha title: Bio-kernel Self-organizing Map for HIV Drug Resistance Classification date: 2005 journal: Advances in Natural Computation DOI: 10.1007/11539087_20 sha: d8f8d47f8aef5e697a4550b91d2917198016c50e doc_id: 36193 cord_uid: 8mjrrlni Kernel self-organizing map has been recently studied by Fyfe and his colleagues [1]. This paper investigates the use of a novel bio-kernel function for the kernel self-organizing map. For verification, the application of the proposed new kernel self-organizing map to HIV drug resistance classification using mutation patterns in protease sequences is presented. The original self-organizing map together with the distributed encoding method was compared. It has been found that the use of the kernel self-organizing map with the novel bio-kernel function leads to better classification and faster convergence rate ... In analysing molecular sequences, we need to select a proper feature extraction which can convert the non-numerical attributes in sequences to numerical features prior to using a machine learning algorithm. Suppose we denote by x a sequence and ) (x φ a feature extraction function, the mapping using a feature extraction function is . Finding an appropriate feature extraction approach is a nontrivial task. It is known that each protein sequence is an ordered list of 20 amino acids while a DNA sequence is an ordered list of four nucleic acids. Both amino acids and nucleic acids are non-numerical attributes. In order to analyze molecular sequences, these non-numerical attributes must be converted to numerical attributes through a feature extraction process for using a machine learning algorithm. The distributed encoding method [2] was proposed in 1988 for extracting features for molecular sequences. The principle is to find orthogonal binary vectors to represent amino (nucleic) acids. With this method, amino acid Alanine is represented by 0000000000 0000000001 while Cystine 0000000000 0000000010, etc. With the introduction of this feature extraction method, the application of machine learning algorithms to bioinformatics has been very successful. For instance, this method has been applied to the prediction of protease cleavage sites [3] , signal peptide cleavage sites [4] , linkage sites in glycoproteins [5] , enzyme active sites [6] , phosphorylation sites [7] and water active sites [8] . However, as indicated in the earlier work [9] , [10] , [11] such a method has its inherent limit in two aspects. First, the dimension of an input space has been enlarged 20 times weakening the significance of a set of training data. Second, the biological content in a molecule sequence may not be efficiently coded. This is because the similarity between any pair of different amino (nucleic) acids varies while the distance between such encoded orthogonal vectors of two different amino (nucleic) acids is fixed. The second method for extracting features from protein sequences is to calculate the frequency. It has been used for the prediction of membrane protein types [12] , the prediction of protein structural classes [13] , subcellular location prediction [14] and the prediction of secondary structures [15] . However, the method ignores the coupling effects among the neighbouring residues in sequences leading to potential bias in modelling. Therefore, di-peptides method was proposed where the frequency of each pair of amino acids occurred as neighbouring residues is counted and is regarded as a feature. Dipeptides, gapped (up to two gaps) transitions and the occurrence of some motifs as additive numerical attributes were used for the prediction of subcellular locations [16] and gene identification [17] . Descriptors were also used, for instance, to predict multi-class protein folds [18] , to classify proteins [19] and to recognise rRNA-, RNA-, and DNA-binding proteins [20] , [21] . Taking into account the high order interaction among the residues, multi-peptides can also be used. It can be seen that there are 400 di-peptides, 8,000 tri-peptides and 16,000 tetra-peptides. Such a feature space can be therefore computational impractical for modelling. The third class of methods is using profile measurement. A profile of a sequence can be generated by subjecting it to a homology alignment method or Hidden Markov Models (HMMs) [22] , [23] , [24] , [25] . It can be seen that either finding an appropriate approach to define ) (x φ is difficult or the defined approach may lead to a very large dimension, i.e., ∞ → d . If an approach which can quantify the distance or similarity between two molecular sequences is available, an alternative learning method can be proposed to avoid the difficulty in searching for a proper and efficient feature extraction method. This means that we can define a reference system to quantify the distance among the molecular sequences. With such a reference system, all the sequences are quantitatively featured by measuring the distance or similarity with the reference sequences. One of the important issues in using machine learning algorithms for analysing molecular sequences is investigating sequence distribution or visualising sequence space. Self-organizing map [26] has been one of the most important machine learning algorithms for this purpose. For instance, SOM has been employed to identify motifs and families in the context of unsupervised learning [27] , [28] , [29] , [30] , [31] . SOM has also been used for partitioning gene data [32] . In these applications, feature extraction methods like the distributed encoding method were used. In order to enable SOM to deal with complicated applications where feature extraction is difficult, kernel method has been introduced recently by Fyfe and his colleagues [1] . Kernel methods were firstly used in cluster analysis for K-means algorithms [33] , where the Euclidean distance between an input vector x and a mean vector m is minimized in a feature space spanned by kernels. In the kernel feature space, both x and m were the expansion on the training data. Fyfe and his colleagues developed so-called kernel self-organizing maps [34] , [35] . This paper aims to introduce a bio-kernel function for kernel SOM. The method is verified on HIV drug resistance classification. A stochastic learning process is used with a regularization term. A training data set ( S is a set of possible values and | | S can be either definite or indefinite) and a mapping function which can map a sequence to a numerical feature vector is defined as . In most situations, . In designing the bio-kernel machine, a key issue is the design of an appropriate kernel function for analysing protein or DNA sequences. Similar as in [9] , [10] , [11] , we use the bio-basis function as the bio-kernel function where x is a training sequence and i b is a basis sequence, both have D residues. Note with d x and id b and the dth residue in sequences. can be found in a mutation matrix [36] , [37] . The bio-basis function has been successfully used for the prediction of Trypsin cleavage sites [8] , HIV cleavage sites [9] , signal peptide cleavage site prediction [10] , Hepatitis C virus protease cleavage sites [38] , disordered protein prediction [39] , [40] , phosphorylation site prediction [41] , the prediction of the O-linkage sites in glycoproteins [42] , the prediction of Caspase cleavage sites [43] , the prediction of SARS-CoV protease cleavage sites [44] and the prediction of signal peptides [45] . Drug resistance modeling is a wide phenomenon and drug resistance modeling is a very important issue in medicine. In computer aided drug design, it is desired to study how the genomic information is related with therapy effect [46] . To predict if HIV drug may fail in therapy using the information contained in viral protease sequences is regarded as genotype-phenotype correlation. In order to discover such relationship, many researchers have done a lot of work in this area. For instance, the original selforganizing map was used on two types of data, i.e., structural information and sequence information [46] . In using sequence information, frequency features were used as the inputs to SOM. The prediction accuracy was between 68% and 85%. Instead of neural networks, statistical methods and decision trees were also used [47] , [48] , [49] . Data (46 mutation patterns) were obtained from [50] . Based on this data set, biokernel SOM was running using different value for the regularization factor. The original SOM was also used for comparison. Both SOMs used the same structure (36 output neurons) and the same learning parameters, i.e. the initial learning rate ( 01 . 0 = h η ). Both algorithms were terminated when the mean square error was less than 0.001 or 1000 learning iterations. Fig. 1 shows the error curves for two SOMs. It can be seen that the bio-kernel SOM (bkSOM) converged much faster with very small errors. 2 shows a map of bkSOM, where "n.a." means that there is no patterns mapped onto the corresponding output neuron, "5:5" means that all the five patterns mapped onto the corresponding neuron are corresponding to the mutation patterns which are resistant to the drug and "0:9" means that all the nine patterns mapped onto the corresponding neuron are corresponding to the mutation patterns which are not resistant to the drug. Table 1 shows the comparison in terms of the classification accuracy, where "NR" means non-resistance and "R" resistance. It can be seen that bkSOM performed better than SOM in terms of classification accuracy. The non-resistance prediction power indicates the likelihood that a predicted non-resistance pattern is a true non-resistance pattern. The resistance prediction power therefore indicates the likelihood that a predicted resistance pattern is a true non-resistance pattern. For instance, the nonresistance prediction power using SOM is 90%. It means that for every 100 predicted non-resistance patterns, 10 would be actually resistance patterns. This paper has presented a novel method referred to as bio-kernel self-organizing map (bkSOM) for embedding the bio-kernel function into the kernel self-organizing map for the purpose of modeling protein sequences. The basic principle of the method is using the "kernel trick" to avoid tedious feature extraction work for protein sequences, which has been proven a non-trivial task. The computational simulation on the HIV drug resistance classification task has shown that bkSOM outperformed SOM in two aspects, convergence rate and classification accuracy. Relevance and kernel self-organising maps Predicting the secondary structure of globular proteins using neural network models Neural network prediction of the HIV-1 protease cleavage sites Reliable prediction of T-cell epitopes using neural networks with novel sequence representations Prediction of Oglycosylation of mammalian proteins: specificity patterns of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase Using a neural network and spatial clustering to predict the location of active sites in enzymes Sequence and structure based prediction of eukaryotic protein phosphorylation sites Prediction of protein hydration sites from sequence by modular neural networks Characterising proteolytic cleavage site activity using bio-basis function neural networks A novel neural network method in mining molecular sequence data Orthogonal kernel machine in prediction of functional sites in preteins Application of SVMs to predict membrane protein types Prediction of protein structural classes by support vector machines Support vector machine approach for protein subcellular localization prediction Cancer diagnosis and protein secondary structure prediction using support vector machines Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs A computational approach to identify genes for functional RNAs in genomic sequences Multi-class protein fold recognition using support vector machines and neural networks Protein function classification via support vector machine approach Support vector machines for predicting rRNA-, RNA-, and DNAbinding proteins from amino acid sequence Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics Using the Fisher kernel method to detect remote protein homologies A Discriminative Framework for Detecting Remote Protein Homologies Classifying G-protein coupled receptors with support vector machines Combining protein secondary structure prediction models with ensemble methods of optimal complexity Self organization and associative Memory Identification of a new motif on nucleic acid sequence data using Kohonen's self-organising map Efficient recognition of immunoglobulin domains from amino acid sequences using a neural network Topological maps of protein sequences Self-organising tree growing network for classifying amino acids A hybrid method to cluster protein sequences based on statistics and artificial neural networks Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation The kernel trick for distances A kernel method for classification Epsilon-insensitive Hebbian learning A model of evolutionary change in proteins. matrices for detecting distant relationships. Atlas of protein sequence and structure A structural basis for sequence comparisons-an evaluation of scoring methodologies Reduced bio-basis function neural networks for protease cleavage site prediction Predict disordered proteins using bio-basis function neural networks RONN: use of the bio-basis function neural network technique for the detection of natively disordered regions in proteins Reduced bio basis function neural network for identification of protein phosphorylation sites: Comparison with pattern recognition algorithms Bio-basis function neural networks for the prediction of the Olinkage sites in glyco-proteins Prediction of Caspase Cleavage Sites Using Bayesian Bio-Basis Function Neural Networks Mining SARS-CoV protease cleavage data using decision trees, a novel method for decisive template searching Predict signal peptides using bio-basis function neural networks Predicting HIV drug resistance with neural networks Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypes. NAR Diversity and complexity of HIV-1 drug resistance: a bioinformatics approach to predicting phenotype from genotype Comparative evaluation of three computerized algorithms for prediction of antiretroviral susceptibility from HIV type 1 genotype Analysis of the protease sequences of HIV-1 infected individuals after Indinavir monotherapy