Microsoft Word - 47_PE_08_14_199-204_latkowski PRZEGLĄD ELEKTROTECHNICZNY, ISSN 0033-2097, R. 90 NR 8/2014 199 Tomasz LATKOWSKI1, Stanisław OSOWSKI1,2 Military University of Technology (1), Warsaw University of Technology (2) Feature selection methods in application to gene expression: autism data Abstract. The paper presents the application of several different feature selection methods for recognizing the most significant genes and gene sequences (treated as features) stored in dataset of gene expression microarray related to autism. The outcomes of each method have been examined by analyzing gene expression profiles of selected genes. In the next step fusion of the most relevant features selected by different methods, has been implemented. The optimal number of features has been defined as the set providing the best clustering purity. Streszczenie. Praca prezentuje badanie wybranych metod selekcji cech diagnostycznych w celu wyodrębnienia najbardziej znaczących sekwencji genowych z mikromacierzy ekspresji genów dotyczącej autyzmu. Dla wyselekcjonowanych cech przeanalizowano wartości poziomów ekspresji genów. W kolejnym etapie dokonano fuzji wyselekcjonowanych cech. Optymalny zbiór cech wyznaczono na podstawie czystości przestrzeni klasteryzacji. (Metody selekcji cech diagnostycznych w zastosowaniu do ekspresji genów: baza danych autyzmu). Keywords: feature selection, gene expression microarrays, clustering, autism. Słowa kluczowe: selekcja cech, mikromacierze ekspresji genów, klasteryzacja, autyzm. doi:10.12915/pe.2014.08.47 Introduction Gene expression microarray is a sophisticated technique used in molecular biology. DNA microarrays are mainly applied for analyzing the expression of genes in specific cells at given time and under certain conditions [4]. This technique generates a huge number of features (genes and gene sequences) which create a large information dataset. The main problem from the data mining point of view is a limited number of observations related to very large number of gene expressions. Number of observations is usually in the range of hundreds and number of genes tens of thousands. Figure 1 shows the organization scheme of the DNA microarray. Fig.1 . DNA microarray organization in the form of matrix. Each row in the above figure corresponds to one observation (patient) and each column to one gene or gene sequence (feature). Because of the large imbalance of the number of genes and patients the selection is an ill conditioned problem. Moreover, data stored in medical databases are typically noisy and some gene sequences have large variance [19]. It makes the gene selection in DNA microarrays very difficult task. Progress in bioengineering and data mining, which has been observed in recent years, has created the solid foundations for discovering the genes which are the best associated with the particular disease. Data analysis of microarrays is widely examined and introduced in literature [2,9-11,13,18-22] starting with pioneering Golub investigation in 1999 [7]. This area of science is being studied due to the fact that is strictly connected with early disease prediction (especially in tumors). It is thought that some diseases, i.e., autism, breast cancer, leukemia, etc., have their reflections in genetic changes [1]. From the practical point of view biologists need to identify only a small number of the most significant genes that can be used as biomarkers in the disease tracing. In the next step these selected markers can be observed in order to understand or find association with the disease. Present data mining methods allow to look at the gene selection problem from many points of view. It is known that the selection algorithms may provide various results for different datasets (diseases) [19]. It makes selection issue more complicated due to the fact that the model developed for one problem may not be suitable for another dataset. Many publications have examined gene selection problem using only one method of ranking. If we take into account that DNA microarrays create ill conditioned problem such approach may not provide the globally optimal result. Moreover, examining different methods on one dataset we can observe various outcomes [20]. In this paper we follow the direction pointed in [20] and present more complex approach based on using several methods simultaneously. The results of individual selection methods are fused, leading to the final set of genes. The application of several methods gives opportunity to look at the selection problem from different points of view. In this way we increase the probability of proper selection of the most important genes. Developed model is verified on the publicity available database related to autism. Expression levels of the selected genes have been examined and compared to the randomly chosen ones. The cluster purity idea has been applied for obtaining the optimal number of the most relevant genes. The results of selection have been illustrated in a graphical way using the PCA mapping. Dataset description This paper presents the problem of gene selection applied to the dataset related to the autism. The database is publicity available and was downloaded from GEO (NCBI) repository [23]. Autism is a severe neurodevelopmental disorder with characteristic social and communication deficits and ritualistic or repetitive behaviors [1]. Many etiologies have been suggested and numerous risk factors identified. Autism is associated with a high degree of heritability. Few specific genetic mutations have been identified accounting for a minority of cases [1], while the majority of cases are considered sporadic. 200 PRZEGLĄD ELEKTROTECHNICZNY, ISSN 0033-2097, R. 90 NR 8/2014 Number of observations in this dataset equals 146 and number of genes 54613. The database consists of two classes: the first one is related to children with autism (n=82) and the second to control (healthy) children (n=64). Blood draws for all subjects were done between the spring and summer of 2004. Total RNA was extracted for microarray experiments with Affymetrix Human U133 Plus 2.0 39 Expression Arrays. The important challenge is to find a small subset of genes with good class discriminative abilities. This problem is resolved by using several gene selection methods combined into one system. Feature selection methods Feature selection is an important operation in processing the data stored in gene microarrays. The most relevant genes (treated as the features) increase our understanding of the mechanism of disease formation and allow to predict the potential danger of being affected by such disease. The application of feature selection methods allows to identify a small number of important genes that can be used as biomarkers of the appropriate disease. In this paper different feature selection methods will be examined and integrated in the final system. Using the set of methods well increase the probability of finding the globally optimal set of genes which are the best associated with the particular disease. The paper will study the following methods: Fisher discriminant analysis, ReliefF algorithm, two sample t-test, Kolmogorov-Smirnov test, Kruskal-Wallis test, stepwise regression method, feature correlation with a class and multi-input linear Support Vector Machine network. These methods rely their operation principles on different foundations and thank to this allow to look on the selection problem from different points of view. The following sections present short description of each method used in investigation. Fisher discriminant analysis In Fisher approach the greatest wage is assigned to feature which is characterized by a large difference of the mean values in two studied classes and a small value of standard deviations within each class. The two class discrimination measure of the feature f is defined in the form [14] : (1) 21 21 12 )(     cc fS where: c1 and c2 represent the mean values for classes 1 and 2, respectively, while σ1 and σ2 are the appropriate standard deviations. A large value of S12(f) indicates good class discriminative ability of the feature. On the other hand small value is an indication of the insignificance of the feature in the recognition of these two classes. ReliefF algorithm The reliefF algorithm ranks the features according to the highest correlation with the observed class. It can be implemented for incomplete and noisy data. According to the reported results [15] it represents an important approach to gene selection. The main idea of the ReliefF algorithm is to estimate the quality of the features according to how well their values distinguish between observations that are near to each other. ReliefF selects randomly an instance Ri and then searches for k of its nearest neighbors from the same class, called nearest hits Hj, and also k nearest neighbors from each of the different classes, called nearest misses Mj(C). It updates the quality estimation W[A] for all attributes A depending on their values for Ri, hits Hj and misses Mj(C). If instances Ri and Hj have different values of the attribute A then the attribute A separates two instances with the same class which is not desirable. So the quality estimation W[A] is decreased. If instances Ri and Mj have different values of the attribute A then this attribute separates two instances of different class values which is desirable. So the quality estimation W[A] is increased. The algorithm averages the contribution of all hits and misses. The contribution for each class of misses is weighted with the prior probability of that class P(C) which is estimated from the training set. The process is repeated m times, where m is a user-defined parameter. The detailed description of the procedure can be found in [15]. Two-sample t-test The next used selection method is a two-sample t-test. One explicit assumption of t-test is that each of two compared populations should follow a normal distribution. The null hypothesis of t-test is that data in the class 1 and 2 are independent random samples of normal distributions with equal means and equal but unknown variances against the alternative hypothesis that the means are not equal. The test statistic is formulated in the form (2) mn cc t 2 2 2 1 21     where n and m represent the sample sizes of both classes [12]. Two sample t-test is implemented in MATLAB as ttest2 function [12]. The test result returns h, which is equal 1 or 0. The value of 1 indicates a rejection of the null hypothesis at the 5% significance level, while h=0 indicates a failure to reject the null hypothesis at the same significance level. The function returns also the p-value of the test. Low value of p indicates that the compared populations are significantly different. Figure 2 illustrates the histogram of the distribution of values of the randomly selected gene for the autism and references classes. Fig. 2. The histograms and their Gaussian approximations for randomly selected gene representing two classes (autism and controls) In the above case, the null hypothesis at the 5% significance level was rejected (h=1). It means that the distributions of these classes differ. This randomly selected gene can be accepted as a candidate for the relevant feature. Checking the condition of normality distribution of genes we found that in about 80% cases it was fulfilled. Kolmogorov-Smirnov test The other statistical feature selection method applied in the research was the Kolmogorov-Smirnov (KS) test. It compares the medians of the groups of data to determine if the samples come from the same population [12].The null PRZEGLĄD ELEKTROTECHNICZNY, ISSN 0033-2097, R. 90 NR 8/2014 201 hypothesis is that both classes are drawn from the same continuous distribution. The alternative hypothesis is that they are drawn from different distributions. The KS test statistic is based on the relation (3)  )()(max 21 xFxFKS  where F1(x) and F2(x) are the cumulative distribution of samples of feature f belonging to class 1 and 2. High value of this coefficient indicates that the feature has good class discriminative ability. On the other hand, a small value of this factor indicates that feature should be rejected at selection stage. Figure 3 illustrates the result of KS test for randomly selected gene. Fig. 3. The cumulative distribution function for two classes (autism and controls) for randomly selected gene in Kolmogorov-Smirnov test. From the discriminative point of view this gene is not significant (p=0.8136) and should not be taken into account in further selection procedure. Kruskal-Wallis test In this method medians of the samples are compared, but test uses ranks of the data rather than the numeric values. It finds ranks by ordering the data from the smallest to the largest across all groups and taking the numeric index of this ordering. The Kruskal-Wallis test does not make any assumptions about normality. It assumes that the observations in each group come from populations with the same shape of distribution. It returns the p value for the null hypothesis that all samples are drawn from the same population. The Kruskal-Wallis test is implemented in MATLAB as kruskalwallis function. The returned value of p indicates kruskalwallis test rejects the null hypothesis that all data from two classes come from the same distribution at 1% significance level. Stepwise regression method Stepwise regression is a systematic method for adding and removing features to the set of input attributes based on their statistical significance in a regression. The method begins with an initial model and then compares the explanatory power of incrementally larger and smaller models. At each step, the p value of an F-statistic [14] is computed to test models with and without selected feature. Based on the statistic result algorithm makes a decision whether feature should be included in a model or not. If a feature is not currently in the model, the null hypothesis is that the term would have a zero coefficient if added to the model. If there is sufficient evidence to reject the null hypothesis, the feature is added to the model. Conversely, if a feature is currently in the model, the null hypothesis is that the term has a zero coefficient. If there is insufficient evidence to reject the null hypothesis the term is removed from the model. The method proceeds as follows [12]: 1. Fit the initial model. 2. If any terms not in the model have p-values less than an entrance tolerance add the one with the smallest p value and repeat this step; otherwise go to step 3. 3. If any terms in the model have p-values greater than an exit tolerance remove the one with the largest p value and go to step 2; otherwise end. The algorithm is interrupted if none of steps leads to the increase of the model accuracy. Presented method may build different sets of features depending on initial model and order of adding and removing features from the set of attributes. Considering that outcomes may not be reproducible the stepwise regression method provides locally optimal result. Feature correlation with class In this method, the correlation of the feature values with a class is examined. The discriminative value S(f) of the feature f for recognizing one class from the other K classes is defined as follows [14]: (4)        K k kk K k kk PPf mmP fS 1 2 1 2 )1()( )( )(  where m is a mean value of feature for all data, mk is a mean value of the feature for the kth class data, σ2(f) is a variance of feature, Pk is a probability of kth class occurrence in dataset (the uniform distribution is assumed). In this paper number of classes is 2 (K=2). At the uniform distribution of both classes the above equation can be simplify to the following formula: (5) )(2 ))()(())()(( )( 2 2 2 2 1 12 f fmfmfmfm fS    The large value of S12(f) indicates good discriminative ability of feature for recognition of two classes. Multi-input linear SVM network SVM network is mainly used in regression and classification tasks [8]. It can be also configured for solving selection problem. In this approach SVM network with linear kernel is used. Network is learned applying all available features used simultaneously as an input. The sign (sgn) function is added for matching the input values to the appropriate class label [14]. The output y at presentation of the features organized in the form of vector f is defined by the following equation (6) )sgn()sgn()( buy T  fwf where w=[w1, w2,..., wn] T is the weight vector, f=[f1, f2,...,fn] T is a vector of features and b is a bias. Large absolute value of weight connecting feature f with the network denotes strong ability of this feature to distinguish two classes. In practice the recursive feature elimination (RFE) approach is used [20]. In RFE, the features are eliminated step by step according to an assumed criterion related to their support in the discrimination of the classes. The SVM is re-trained at each step using smaller and smaller population of features. In first step the linear SVM network is learned using all features. The weights are adapted and then sorted in descending order. In the next steps the features associated with the smallest absolute weights are eliminated. The process is repeated until we obtain an appropriate number of the most important features. Numerical results of experiments This section describes the numerical experiments using the autism data. We will present the results concerning the selection, clusterization and PCA transformation. In the first 202 PRZEGLĄD ELEKTROTECHNICZNY, ISSN 0033-2097, R. 90 NR 8/2014 stage eight feature selection methods are used to discover the sequence of the most significant genes. In the next stage, we consider only 100 top selected genes from each method. These subsets are fused into one reduced common set. To find the optimal number of genes we use the notion of cluster purity. The final outcome of the clusterization is illustrated and examined by using PCA transformation. Gene selection results Feature selection methods described in the previous sections have been applied to get the order of genes, sorted in a decreasing fashion. As a result of these experiments eight subsets of 100 the most relevant genes are selected. The following abbreviations were applied: FD - the Fisher discriminant analysis, RF - the ReliefF algorithm, TT - the two-sample t-test, KS - the Kolmogorov-Smirnov test, KW - the Kruskal-Wallis test, SWR - the stepwise regression method, COR - the feature correlation with a class, SVM - the multi-input linear SVM network. As was expected the methods have selected different sets of genes. Table 1 shows how many identical genes among the first 100 of the most important have been selected by different methods. Table 1. The percentage of the same genes among top 100 selected by different methods. FD RF TT KS KW SWR COR SVM FD 100 46 90 30 66 3 90 1 RF 46 100 46 33 46 4 46 1 TT 90 46 100 30 68 3 100 0 KS 30 33 30 100 46 3 30 1 KW 66 46 68 46 100 3 68 1 SWR 3 4 3 3 3 100 3 1 COR 90 46 100 30 68 3 100 0 SVM 1 1 0 1 1 1 0 100 The contents of the selected sets differ from method to method. Analyzing them we may find that few methods identified a large number of the same genes. For example the correlation feature with a class and t-test produced the same sets of genes. The overlapping results between the Fisher and correlation with a class methods cover 90% of genes. On the other hand some of them have resulted in very different sets, i.e., stepwise regression and t-test outcomes are overlapping only in 3%. The quality of the selection processes has been confirmed by analyzing the expression profiles for the identified genes. Fig. 4 illustrates the expression levels of all patients for the most important gene selected by the Fisher method. As we can see the mean value of the observations belonging to the autism class differs significantly from the reference class. Fig. 4. Expression levels for gene belonging to the most significant group. For comparison the appropriate expression levels representing gene selected randomly from the least significant subset are shown in Fig. 5. This time the difference between the means of both classes is unnoticeable. Fig. 5. Expression levels for gene representing the least significant group. In the case of the carefully selected genes (Fig. 4) we got the mean equal 60.94±9.50 (autism) and 53.74±7.00 (control group). The difference of the mean values is 7.20 (11.81% in relative terms). For the gene representing the least significant subset (Fig. 5) we got 9.13±1.13 (autism) and 9.16±1.49 (control group). The spread of mean values in this case is equal only 0.03 (0.33%). These differences confirm the advantage of the proposed selection procedure. The next important step is fusing the results of the individual methods into one common outcome. To find the most important genes valid for all analyzed methods we have to assign the global weight to each selected gene. The following formula has been proposed for assigning the global weight w of the gene f (7) )()( 1 fwfw k i k   The index k is the number of applied selection methods (k=8 in this research), wk is the position of genes in the appropriate method of selection. The best gene is the one of the smallest value of the global weight w. Analyzing the contents of all selected sets we have found 427 different genes. They have been sorted according to their global weights in an increasing order (the best gene is the one of the smallest weight). To the best 10 genes selected by the fusion algorithm belong: 16274 (206827_s_at), 46183 (236933_at), 38133 (228878_s_at), 335 (1552729_at), 50099 (240849_at), 18235 (208819_at), 45248 (235998_at), 2923 (1556314_a_at), 26879 (217593_at), 4077 (1558154_at). Clusterization of gene space Good way for assessing the quality of the selected genes is to apply the clustering of data in the multidimensional space. Good set of genes should provide the clusters of high purity with respect to the class membership. Different approaches to clustering are possible: K-means, fuzzy c-means or expectation maximization algorithm [14]. In this work we apply the simplest K-means. It is a method of vector quantization, that aims to partition n observations into K clusters (K