key: cord-0058861-f6uq590f authors: Nakao, Eduardo K.; Levada, Alexandre L. M. title: Unsupervised Learning and Feature Extraction in Hyperspectral Imagery date: 2020-08-24 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58799-4_57 sha: 3f0cc4e4177bf1f17a96a8b3ff6c8e7942376018 doc_id: 58861 cord_uid: f6uq590f Remotely sensed hyperspectral scenes are typically defined by large area coverage and hundreds of spectral bands. Those characteristics imply smooth transitions in the spectral-spatio domains. As consequence, subtle differences in the scene are evidenced, benefiting precision applications, but values in neighboring locations and wavelengths are highly correlated. Nondiagonal covariance matrices and wide autocorrelation functions can be observed this way, implying increased intraclass and decreased interclass variation, in both spectral and spatial domains. This leads to lower interpretation accuracies and makes it reasonable to investigate if hyperspectral imagery suffer from Curse of Dimensionality. Moreover, as this Curse can compromise linear method’s Euclidean behavior assumption, it is relevant to compare linear and nonlinear dimensionality reduction performance. So, in this work we verify these two aspects empirically using multiple nonparametric statistical comparisons of Gaussian Mixture Model clustering performances in the cases of: absence, linear and nonlinear unsupervised feature extraction. Experimental results indicate Curse of Dimensionality presence and nonlinear adequacy. Hyperspectral imaging [12] captures the energy matter interaction in hundreds of channels that can precisely define the chemical composition of different materials by spectral similarity discrimination. A hyperspectral image (HI) is viewed as a cube of two-dimensional single band images stacked in the third dimension of sampled wavelengths. So if m spectral bands were collected, each pixel can be viewed as a m dimensional vector where its j th coordinate value is the reflectance This study was financed in part by the Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior -Brasil (CAPES) -Finance Code 001. c Springer Nature Switzerland AG 2020 measured at the j th wavelength. This vector is the geometric pixel representation in the spectral domain. Due to this presence of hundreds of channels, HIs are spectrally smooth and spatially piecewise smooth. Values in neighboring locations and wavelengths are highly correlated. This can be observed by extremely nondiagonal covariance matrices and wide autocorrelation functions [4] . So there are increased intraclass variation and decreased interclass variation, in both spectral and spatial domains [34] , which leads to lower interpretation accuracies. Thus, the linear independence assumption in the basis vectors of an Euclidean space can be not suitable. If pixel vectors are seen as random variables, the high correlation between them suggests that a possible statistical independence assumption is not suitable as well. Those facts raise the supposition that the pixel vector representation is embedded in a high-dimensional space, suffering from the Curse of Dimensionality. From Machine Learning perspective, the problem is described by the Hughes phenomena [14] which states that, when dimension grows from a certain point in relation to the number of samples, classification performance is degraded. From a geometric point of view, in high dimensions it can also be proved [20] that the volume of a hypercube is concentrated on the corners [17, 31] and the volume of a hypersphere or hyperellipsoid concentrates on an outside shell [17] . These facts imply the following consequences: space is mostly empty (so data can be projected to a lower dimension without losing significant class separability); local neighborhoods are almost empty (requiring large estimation bandwidth and losing detailed density estimation); normal distributions have their data concentration on the tails instead the means, and uniformly distributed data is more likely collected on the corners (making density estimation more difficult). Other important high dimensionality characteristics are [20] : diagonal vectors become orthogonal to the basis vectors (the projection of a cluster in one of these "wrong directions" can ruin the information within the data [31] ); data covariance matrix becomes ill-conditioned; Euclidean distance is not fully reliable [1] ; low linear projections have the tendency to be Normal. So, in Curse of Dimensionality presence, the clustering method Gaussian Mixture Model (GMM) can have its behavior compromised because it uses Euclidean distance in neighborhood search and have Gaussian hypothesis and statistical independence assumption. On the other hand, in the presence of few bands (i.e. low dimensionality), GMM is efficient to determine class distribution even with few samples in some classes (which is common in HIs due to their subtle substances discrimination derived from high spectral resolution). The model is efficient even if some classes are not normally distributed in fact [19] and it also takes advantage of the high dimensional space property of Normal distributions formation in the projection from high to low dimensionality. Another problem is that, as GMM, linear Dimensionality Reduction (DR) methods also assume Euclidean data behavior. So those methods can be compromised too. Given a image, it is not possible to know beforehand theoretically if its pixels vectors will be better represented by a linear or a nonlinear DR approach. Therefore, empirical computational experiments are necessary to investigate linear against nonlinear data representation adequacy for HIs. Considering all those inherent properties and possible problems, the focus in this study is to analyze how unsupervised feature extraction methods affect clustering by GMM and if nonlinear DR is more adequate than linear for HIs. The common lack of class information in hyperspectral imagery justify the interest in unsupervised feature extraction and its impact in unsupervised learning. We compare Principal Component Analysis (PCA) -the most traditional linear DR method -against Isometric Feature Mapping (ISOMAP) and Locally Linear Embedding (LLE) -both pioneering nonlinear methods with global and local approaches respectively. All three methods yield an orthonormal basis in an Euclidean space where the vectors are written as linear combination of the others. Besides that, as vectors are seen as random variables by GMM, it is assured also that those variables are not correlated. So the goal of adjusting the HIs original data to a feasible condition for GMM is guaranteed. The remaining of the paper is organized as follows: next section brings some related work. Section 3 presents the unsupervised feature extraction methods used in the work. Section 4 describes how experiments were conducted. Section 5 explains the experimental results. Finally, Sect. 6 presents conclusions and final remarks. Feature Selection for Classification of Hyperspectral Data by SVM [27] . The main conclusion of this study is that, in HIs, the classification accuracy of Support Vector Machine (SVM) technique is influenced by the number of features used. Thus, it is shown that SVM is affected by Hughes phenomenon. The results presented in this paper show for AVIRIS sensor dataset that feature addition led to statistically significant decline in accuracy, for all sizes of training sets evaluated. With DAIS dataset, such decline was also observed, both for small training sets (25 samples per class or less) and for a large training sample. Therefore, it has been shown that feature selection provides a reduced feature set which gives similar classification accuracy to that of larger feature sets. In addition, the analysis of four different selection methods showed a wide variation between the number of features selected by each method, which highlights the influence of method choice in the process. As in our study, this reference showed the manifestation of Hughes phenomenon in HIs and the influence of the DR method on the estimation of intrinsic dimensionality. However, classification was used instead of clustering and feature selection was used in place of extraction. Feature Mining for Hyperspectral Image Classification [16] . Arguing that Hughes problem can be avoided by the use of few discriminating features, this article provides an overview of fundamental and modern feature extraction and selection which are applicable in HI classification. The common unavailability of class labeling in this type of data is discussed. It is emphasized that in supervised parametric approaches, class separation measures require large training amount, which may not be accessible or physically feasible in practice. Therefore, nonparametric and unsupervised methods are preferred. It is informed that, current area research focus is on nonlinear feature extraction. So this article shows that DR bypasses the Hughes problem in HIs and complements our work with feature extraction and selection methods description. One more argument is provided about the relevance of nonparametric, unsupervised and nonlinear categories (which are used in our work). Finally, the current relevance of research in nonlinear feature extraction is highlighted (which is also investigated by us). Sensing Images Classification [9] . This reference is able to show more specifically that, in HIs, the use of features obtained by nonlinear extraction provides greater classification accuracy than original features. This type of conclusion also reinforces the evidence shown in our work: not only the HI feature extraction but also the nonlinear benefits. [24] . In this work it is also shown that, in HIs context, linear DR methods generate better results compared to DR absence in classification process. Additionally, it is also shown that nonlinear DR is superior to linear in this context. This indicates that HIs are mapped in structures whose intrinsic dimensionality is less than the original, and also that such structures are particularly nonlinear. The study also shows that, in HIs, clustering methods that use Euclidean distance have lower performance when compared to methods that use other strategies. So this related work is another source that shows HIs DR benefit (and particularly nonlinear DR). Additionally, the work also brings information about Euclidean strategies failure in this type of image, reinforcing the validity of suspicions along the same lines that were proven in our case. [25] . Here it is also argued that, in many cases, combination of PCA with SVM provides good HIs classification results. However, hyperspectral scenes can bring complexity due to nonlinear effects, which can make PCA inappropriate. It is stated that for such scenes, therefore, it is necessary to choose carefully the output dimensionality and to consider the use of nonlinear DR techniques. It is also reported that the main disadvantage of nonlinear methods is characterized by high computational complexity in comparison with PCA. Unsupervised Exploration of Hyperspectral and Multispectral Images [22] . Some of the most used unsupervised exploration methods (mainly DR and clustering techniques) are revised and discussed, focusing in hyper and multispectral imagery. Use instructions are provided and vantages and disadvantages are highlighted in order to demystify common errors. It is argued that, when analysis focus is prediction, exploratory data analysis is frequently underrated. It is stated that, in fact, exploratory analysis provides wealth information itself and a first data insight that can be very useful, even if exploration is not the final goal. It follows that unsupervised methods provide valuable information and can be applied as a first approach for any hyper or multispectral analysis, given that such methods are hypothesis free. The goal of Feature Extraction is to find the transformation function f : and d ≤ m, for all n points in the dataset, such that most data information is kept. The f term can be a linear or nonlinear transformation. In the next three subsections, we show the algorithms selected to work with. Linear DR is widely used for analyzing high dimensional and noisy data [7] . Part of the success of linear methods comes from the fact that they have simple geometric interpretations and typically attractive computational properties [7] . Basically, in linear methods, we seek a projection matrix T that maps the samples in the original feature space (R m ) to a linear subspace in R d , with d < m. Given n m-dimensional data points X = [x 1 , x 2 , ..., x n ] ∈ R m×n and a choice of dimensionality d < m, the goal is to optimize some objective function f X (.) to produce a linear transformation T ∈ R d×m , and call Y = T X ∈ R d×n the lower dimensional transformed data. Principal Component Analysis is the most widely known linear method for data compression and Feature Extraction. It assumes that vectors are in an Euclidean space and from this assumption it searches for a new basis with d span vectors that maximizes the variance in the new representation. In this case, d is the intrinsic dimensionality and it is less than the original data dimension. ISOMAP was one of the pioneering algorithms in nonlinear DR. It combines the major algorithmic advantages of PCA and Multidimensional Scaling (MDS) -computational efficiency, global optimality, and asymptotic convergence guarantees -with the flexibility to learn a broad class of nonlinear structures [33] . The basic idea of the ISOMAP algorithm is first to build a graph by joining the k-nearest neighbors (KNN) in the input space, then compute the shortest paths between each pair of vertices in the graph and, knowing the approximate distances between points, find a mapping to an Euclidean subspace of R d that preserves those distances. The hypothesis of the ISOMAP algorithm is that the shortest paths in the KNN graph are good approximations for the true distances in the original data structure. ISOMAP algorithm can be divided into three main steps: 1. From the input data x 1 , x 2 , ..., x n ∈ R m build an undirected proximity graph using the KNN rule or the -neighborhood rule [21] ; 2. Compute the pairwise distance matrix D using n executions of the Dijkstra's algorithm or one execution of the Floyd-Warshall algorithm; 3. Estimate the new coordinates of the points in an Euclidean subspace of R d by preserving the distances through the MDS method. The ISOMAP algorithm is a global method in the sense that, to find the new coordinates of a given input vector x i ∈ R m , it uses information from all the samples through the matrix B. On the other hand, LLE, as the name emphasizes, is a local method where the new coordinates of any x i ∈ R m depend only on the point neighborhood. The main hypothesis behind LLE is that for a sufficient density of samples, it is expected that a vector x i and its neighbors define a linear patch (i.e. all belonging to an Euclidean subspace [29]). It is possible to characterize the local geometry by linear coefficients this way: that is, we can reconstruct a vector as a linear combination of its neighbors. Basically, the LLE algorithm requires as inputs a n × m data matrix X, with rows x i , a desired number of dimensions d < m and an integer k > d + 1 for finding local neighborhoods. The output is a n × d matrix Y , with rows y i . The LLE algorithm can be divided into three main steps [29,30]: 1. From each x i ∈ R m find its k nearest neighbors; 2. Find the weight matrix W which minimizes the reconstruction error for each data point x i ∈ R m ; where w ij = 0 unless x j is one of x i 's k-nearest neighbors and for each i, 3 . Find the coordinates Y which minimize the reconstruction error using the optimum weights; subject to the constraints i Y ij = 0 for each j, and Y T Y = I. 1: function PCA(X) 2: Compute the sample mean and the sample covariance matrix by: Compute the eigenvalues and eigenvectors of Σx 4: Define the transformation matrix T = [w1, w2, ..., w d ] with the d eigenvectors associated to the d largest eigenvalues. Project the data X into the PCA subspace: From the input data Xm×n build a KNN graph. Compute the pairwise distances matrix Dn×n. Compute A = − 1 2 D. Compute H = I − 1 n U , where U is a n × n matrix of 1's. 6: Compute B = HAH. 7: Find the eigenvectors and eigenvalues of the matrix B. 8: Select the top d < m eigenvectors and eigenvalues of B and define: From the input data Xm×n build a KNN graph. for xi ∈ X T do 4: Compute the K × K matrix Ci as: 5: Solve the linear system Ciwi = 1 to estimate the weights wi ∈ R K . 6: Normalize the weights in wi so that j wi(j) = 1. 7: end for 8: Construct the n × n matrix W , whose lines are the estimated wi. 9: Compute M = (I − W ) T (I − W ). 10: Find the eigenvalues and eigenvectors of the matrix M . 11: Select the bottom d nonzero eigenvectors of M and define: 12: return Y 13: end function Seven traditional HIs 1 were used for exploitation: Indian Pines, Salinas, Sali-nasA, Botswana, Kennedy Space Center, Pavia University, and Pavia Centre. The purpose of this work is to verify if unsupervised Feature Extraction benefits HI clustering by GMM and if a nonlinear method is more efficient than a linear in this case. To conduct such verification, the performance of clusterings made on linear reduction (PCA), nonlinear reduction (ISOMAP and LLE) and nonDR (NDR) are compared. As Ground Truths were available, performances were evaluated by external measures. Measures used are: Rand [28] , Jaccard [15] , Kappa [5] , Entropy and Purity [32] . Performances were then compared by Friedman [10] and Nemenyi [26] tests. All steps are shown in Fig. 1 . Regarding the application of nonlinear algorithms, most images were too large for our 24 GB RAM memory. This problem happened especially for ISOMAP algorithm second step. So, for those images, sets of representative pixels were chosen using Bagging [2] technique. Those sets were used by both linear and nonlinear DR methods in place of the original image. Clustering was made on those sets in the NDR case also. For each image, the three DR methods were applied. The choice of target dimensionality strongly affects the results and is usually defined by the intrinsic dimensionality. Such discovery can be made by several methods [3, 6, 13, 23] and its strongly related to each image itself and its particular properties. The topic is currently under investigation in the area and, as it is not focus of this work, we have choosen a well-known basic empirical method. The method consists in the analysis of the sum of the first n largest eigenvalues over the sum of all eigenvalues of the transformation matrix of each DR method [6] . We seek the point where 95% of data variance is retained. The amount of eigenvalues that reach this point is understood as the intrinsic dimensionality. In the image pre-processing step 2 , all mixed pixels were categorized to a background class which label is zero. So, after DR and before clustering, we adopted a simple procedure of setting those pixel vectors coordinates to (0,...,0) -as background class is not in the discrimination interest. The goal of this procedure is to avoid original geometry distortion in DR and also to help raising clustering performance considerably (as this class is a pixel mixture that is not in any other class defined by the expert). We notice that the procedure didn't produce much distortion in clustering (most class zero pixels in the Ground Truth were kept in the same group after clustering). For each vector set in the reduced dimensional space, several executions of the clustering procedure were applied due to its known effect of random centroids initialization. As this particular image set is already labeled by specialists, external clustering validation is more robust than internal. So external measures were used. All metrics adopted assume higher values for better clusterings. Specifically for the Kappa measure, it must be ensured a reliable semantic correspondence between clustering and expert's labeling. We mapped this problem to one of finding the minimum pairing cost in a bipartite graph, where there is label association to minimize a cost function. The Munkres algorithm was used in this Optimum Allocation Problem [18] . All measures were applied to all clustering results and the representative value for each measure was given by the maximum value obtained. Next, for each image, Friedman test indicates if there was significant statistical difference between some pair of measure values (considering a pair of two lists, each corresponding to a DR method, whose elements are the external indices maximum values). Nemenyi test tells then which measure pairs were different. The results make possible to analyze if DR was benefical to clustering and in which images nonlinear overcame linear reduction. Tests results are shown in the next section. The programming language and libraries used were: Python 3.6.5; NumPy 1.15.0; SciPy 1.0.0; Scikit-learn 0.20.0; Matplotlib 3.0.0, and STAC. We used the Scikit-learn implementations for all DR methods. We can find in Tables 1, 2, 3, 4 and 5 3 the clustering validation measure maximum values by DR method, where each table corresponds to an image which Friedman test pointed significant difference between some pair of DR strategy. Usual pvalue > 0.05 criteria to indicate equivalence was adopted. The last row presents the measure values mean given a DR method. In Table 6 we present the Friedman p-values by image. Friedman test only tells us if some method pair differed. We need a post-hoc test to discover which pairs were different [8, 11] . We show in Table 7 the Nemenyi p-values smaller than 0.05 and its occurrence pair by image. We can see for Botswana that, for a trust level of 95% (i.e. significance level of 5%), there is evidence that ISOMAP produces better clustering results than DR absence. This imply that DR was benefical to this image. There is also evidence that ISOMAP got better attributes than PCA, which justifies the better adequacy of a nonlinear DR method for this data. Following the same reasoning, in KSC we can say that LLE is superior than PCA. In Pavia, LLE is superior than NDR and PCA. In PaviaU PCA had significant difference in comparison with NDR and was also superior than both nonlinear methods. In SalinasA we can only conclude that PCA yeld better results in comparisons to NDR. So, in four images DR brought benefit to clustering. Additionally, in three images we saw that nonlinear DR performed better than linear. As one cannot state beforehand theoretically if HIs naturally suffer from Curse of Dimensionality, in this work we have shown empirical evidence that unsupervised Feature Extraction significantly increased GMM clustering performance in the majority of studied images. Significant statistical difference between DR and NDR was found using nonparametric multiple comparisons hypothesis tests on several large real images. Unsupervised DR and unsupervised learning approaches were used to reach this conclusion and Ground Truth was used only to compute several external measures in clustering evaluation. Additionally, label matching by Munkres algorithm increased Kappa values and emphasized performance differences between DR methods in this matter. The efficiency of nonlinear DR compared to linear in HI context cannot be shown beforehand theoretically also. Our experiments have shown that nonlinear DR is more appropriate for the majority of HIs in which DR brought significant contribution to clustering. We gave theoretical justification to why nonlinear DR can be more adequate by presenting the PCA, ISOMAP and LLE strategies and algorithms. Future works may include: other Feature Extraction techniques, specially modern nonlinear algorithms; more sophisticated intrinsic dimensionality discovery strategies; other HI images; more clustering methods; use of internal clustering validity indexes to test a complete unsupervised procedure; use of supervised classifiers ensemble to obtain a supervised perspective. We believe that such expansions can bring more evidence for the Curse of Dimensionality in HIs and build a stronger basic study for unsupervised Feature Extraction in HIs, compiling the main interest points in the area and serving as guide for first contact. On the surprising behavior of distance metrics in high dimensional space Bagging predictors Data dimensionality estimation methods: a survey. Pattern Recogn Remote sensing image processing A coefficient of agreement for nominal scales Multidimensional scaling Linear dimensionality reduction: survey, insights, and generalizations Statistical comparisons of classifiers over multiple data sets Subspace feature analysis of local manifold learning for hyperspectral remote sensing images classification The use of ranks to avoid the assumption of normality implicit in the analysis of variance Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power Imaging spectrometry for earth remote sensing Intrinsic dimensionality estimation based on manifold assumption On the mean accuracy of statistical pattern recognizers The distribution of the flora in the alpine zone 1 Feature mining for hyperspectral image classification A course in the geometry of N dimensions The Hungarian method for the assignment problem Multispectral data analysis: a signal theory perspective Information extraction principles and methods for multispectral and hyperspectral image data A tutorial on spectral clustering Chapter 2.4 -Unsupervised exploration of hyperspectral and multispectral images Geometric data analysis based on manifold learning with applications for image understanding Unsupervised clustering and active learning of hyperspectral images with nonlinear diffusion Evaluation of nonlinear dimensionality reduction techniques for classification of hyperspectral images Distribution-Free Multiple Comparisons Feature selection for classification of hyperspectral data by SVM Objective criteria for the evaluation of clustering methods Nonlinear dimensionality reduction by locally linear embedding Think globally, fit locally: unsupervised learning of low dimensional manifolds Multivariate Density Estimation: Theory, Practice, and Visualization A mathematical theory of communication A global geometric framework for nonlinear dimensionality reduction On combining multiple features for hyperspectral remote sensing image classification