key: cord-0801562-b08nzap5 authors: Lindenbaum, Ofir; Salhov, Moshe; Yeredor, Arie; Averbuch, Amir title: Gaussian bandwidth selection for manifold learning and classification date: 2020-07-02 journal: Data Min Knowl Discov DOI: 10.1007/s10618-020-00692-x sha: d57d0933a1deae90a4ed35c2ed3571cda483b6db doc_id: 801562 cord_uid: b08nzap5 Kernel methods play a critical role in many machine learning algorithms. They are useful in manifold learning, classification, clustering and other data analysis tasks. Setting the kernel’s scale parameter, also referred to as the kernel’s bandwidth, highly affects the performance of the task in hand. We propose to set a scale parameter that is tailored to one of two types of tasks: classification and manifold learning. For manifold learning, we seek a scale which is best at capturing the manifold’s intrinsic dimension. For classification, we propose three methods for estimating the scale, which optimize the classification results in different senses. The proposed frameworks are simulated on artificial and on real datasets. The results show a high correlation between optimal classification rates and the estimated scales. Finally, we demonstrate the approach on a seismic event classification task. Dimensionality reduction is an essential step in numerous machine learning tasks. Methods such as Principal Component Analysis (PCA) (Jolliffe 2002) , Multidimensional Scaling (MDS) (Kruskal 1977) , Isomap (Tenenbaum et al. 2000) and Local Linear Embedding (Roweis and Saul 2000) aim to extract essential information from high-dimensional data points based on their pairwise connectivities. Graph-based kernel methods such as Laplacian Eigenmaps (Belkin and Niyogi 2001) (Coifman and Lafon 2006) , construct a positive semi-definite kernel based on the multidimensional data points to recover the underlying structure of the data. Such methods have been proven effective for tasks such as clustering (Luo 2011) , classification (Lindenbaum et al. 2015) , manifold learning (Lin et al. 2006 ) and many more. Kernel methods rely on computing a distance function (usually Euclidean) between all pairs of data points x i , x j ∈ X ∈ R D×N and application of a data dependent kernel function. This kernel should encode the inherited relations between high-dimensional data points. An example for a kernel that encapsulates the Euclidean distance takes the form (where · denotes the Euclidean norm). As shown, for example, in Roweis and Saul (2000) , Coifman and Lafon (2006) , spectral analysis of such a kernel provides an efficient representation of the lower (d-) dimensional data (where d D) embedded in the ambient space. Devising the kernel to work successfully in such contexts requires expert knowledge for setting two parameters, namely the scaling (Eq. 1.1) and the inferred dimension d of the low-dimensional space. We focus in this paper on setting the scale parameter , sometimes also called the kernel bandwidth. The scale parameter is related to the statistics and to the geometry of the data points. The Euclidean distance, which is often used for learning the geometry of the data, is meaningful only locally when applied to high-dimensional data points. Therefore, a proper choice of should preserve local connectivities and neglect large distances. If is too large, there is almost no preference for local connections and the kernel method is reduced essentially to PCA (Lindenbaum et al. 2015) . On the other hand, if is too small, the matrix K (Eq. 1.1) has many small off-diagonal elements, which is an indication of a poor connectivity within the data. Several studies have proposed different approaches for setting . A study by Lafon et al. (2006) suggests a method which enforces connectivity among most data pointsa rather simple method, which is nonetheless sensitive to noise and to outliers. The sum of the kernel is used by Singer et al. (2009) to find a range of valid scales, this method provides a good starting value but is not fully automated. An approach by Zelnik-Manor and Perona (2004) sets an adaptive scale for each point, which is applicable for spectral clustering but might deform the geometry. As a result, there is no guarantee that the rescaled kernel has real eigenvectors and eigenvalues. Others simply use the squared standard deviation (mean squared Euclidean distances from the mean) of the data as , this again is very sensitive to noise and to outliers. Kernel methods are also used for Support Vector Machines (SVMs, Scholkopf and Smola 2001) , where the goal is to find a feature space that best separates given classes. Methods such as (Gaspar et al. 2012; Staelin 2003) use cross-validation to find the scale parameter which achieves peak classification results on a given training set. The study by Campbell et al. (1999) suggests an iterative approach that updates the scale until reaching maximal separation between classes. Chapelle et al. (2002) relate the scale parameter to the feature selection problem by using a different scale for each feature. This framework applies gradient descent to a designated error function to find the optimal scales. These methods are good for classification, but require to actually re-classify the points for testing each scale. In this paper we propose methods to estimate the scale which do not require the repeated application of a classifier to the data. Since the value of defines the connectivity of the resulted kernel matrix (Eq. 1.1), its value is clearly crucial for the performance of kernel based methods. Nonetheless, the performance of such methods depends on the training data and on the optimization problem in hand. Thus, in principle we cannot define an 'optimal' scaling parameter value independently of the data. We therefore focus on developing tools to estimate a scale parameter based on a given training set. We found that there are almost no simple methods focusing on finding element-wise (rather than one global) scaling parameters dedicated for manifold learning. Neither are there methods that try to maximize classification performance without directly applying a classifier. For these reasons we propose new methodologies for setting , dedicated either to manifold learning or to classification. For the manifold learning task, we start by estimating the manifold's intrinsic dimension. Then, we introduce a vector of scaling parameters = [ 1 , ..., D ], such that each value i , i = 1, ..., D, rescales each feature. We propose a special greedy algorithm to find the scaling parameters which best capture the estimated intrinsic dimension. This approach is analyzed and simulated to demonstrate its advantage. For the classification task, we propose three methods for finding a scale parameter. In the first, by extending we seek a scale which provides the maximal separation between the classes in the extracted low-dimensional space. The second is based on the eigengap of the kernel. It is justified based on the analysis of a perturbed kernel. The third method sets the scale which maximizes the within-class transition probability. This approach does not require to compute an eigendecomposition. Additionally, we provide new theoretical justifications for the eigengap-based method, as well as new simulations to support all methods. Interestingly, we also show empirically that all the three methods converge to a similar scale parameter . The structure of the paper is as follows: Preliminaries are given in Sect. 2. Section 3 presents and analyzes two frameworks for setting the scale parameter: the first is dedicated to a manifold learning task while the second fits a classification task. Section 4 presents experimental results. Finally, in Sect. 5 we demonstrate the applicability of the proposed methods for the task of learning seismic parameters from raw seismic signals. We begin by providing a brief description of two methods used in this study: A kernelbased method for dimensionality reduction called Diffusion Maps (Coifman and Lafon 2006) ; and Dimensionality from Angle and Norm Concentration (DANCo, Ceruti et al. 2014) , which estimates the intrinsic dimension of a manifold based on the ambient high-dimensional data. In the following, vectors and matrices are denoted by bold letters, and their components are denoted by the respective plain letters, indexed using subscripts or parentheses. DM (Coifman and Lafon 2006 ) is a nonlinear dimensionality reduction framework that extracts the intrinsic geometry from a high-dimensional dataset. This framework is based on the construction of a stochastic matrix from the graph of the data. The eigendecomposition of the stochastic matrix provides an efficient representation of the data. Given a high-dimensional dataset X ∈ R D×N , the DM framework construction consists of the following steps: , satisfying the following properties: (i) Symmetry: These properties guarantee that K has real-valued eigenvectors and non-negative real-valued eigenvalues. In this study, we focus on the common choice of a Gaussian kernel (see Eq. 1.1) as the affinity measure between two multidimensional data vectors x i and x j ; Obviously, choosing the kernel function entails the selection of an appropriate scale , which determines the degrees of connectivities expressed by the kernel. 2. By normalizing the rows of K , the row-stochastic matrix is computed, where D ∈ R N ×N is a diagonal matrix with D i,i = j K i, j . P can be interpreted as the transition probabilities of a (fictitious) Markov chain on X, such that ( P) t i, j p t (x i , x j ) (where t is an integer power) describes the implied probability of transition from point x i to point x j in t steps. 3. Spectral decomposition is applied to P, yielding a set of N eigenvalues {λ n } (in descending order) and associated normalized eigenvectors {ψ n } satisfying Pψ n = λ n ψ n , n ∈ {0 . . . N − 1}; 4. A new representation for the dataset X is defined by 3) where is the scale parameter of the Gaussian kernel (Eq. 2.1) and ψ m (i) denotes the ith element of ψ m . Note that λ 0 = 1 and ψ 0 = 1 were excluded as the constant eigenvector ψ 0 = 1 doesn't carry information about the data. The main idea behind this representation is that the Euclidean distance between two multidimensional data points in the new representation is equal to the weighted L 2 distance between the conditional probabilities p(x i , :) and p(x j , :), i, j = 1, ..., N , where i and j are the i-th and j-th rows of P. The diffusion distance is defined by . This equality is proved in Coifman and Lafon (2006) . We chose to use DM in our analysis as it provides an intuitive interpretation based on its Markovian construction. Nonetheless, the methods in this manuscript could also be adapted to Laplacian Eigenmaps (Belkin and Niyogi 2001) and to other kernel methods. Given a high-dimensional dataset X = {x 1 , x 2 , ..., x N } ⊆ R D×N , which describes an ambient space with a manifold M containing the data points x 1 , x 2 , ..., x N , the intrinsic dimensiond is the minimum number of parameters needed to represent the manifold. Definition 2.2.1 Let M be a manifold. The intrinsic dimensiond of the manifold is a positive integer determined by how many independent "coordinates" are needed to describe M. By using a parametrization to describe the manifold, the intrinsic dimension is the smallest integerd such that there exists a smooth map f (ξ ) for all data points on the manifold M = f (ξ ), ξ ⊆ Rd . Methods proposed by Fukunaga and Olsen (1971) or by Verveer and Duin (1995) use local or global PCA to estimate the intrinsic dimensiond. The dimension is set as the number of eigenvalues greater than some threshold. Others, such as Trunk (1976) or Pettis et al. (1979) , use k-neaserst-neighbors (KNN) distances to find a subspace around each point and based on some statistical assumption estimated. A survey of different approaches is provided in Camastra (2003) . In this study we use Dimensionality from Angle and Norm Concentration (DANCo, by Ceruti et al. (2014) ) based algorithm (which we observed to be the most robust approach in our experiments) to estimated. DANCo jointly uses the normalized distances and mutual angles to extract a robust estimate ofd. This is done by finding the dimension that minimizes the Kullback-Leibler divergence between the estimated probability distribution functions (pdf-s) of artificially-generated data and the observed data. A full description of DANCo is presented in the "Appendix" of this manuscript. In Sect. 3.2, we propose a framework which exploits the resulting estimated ofd for choosing the scale parameter . 3 Setting the scale parameter DM as described in Sect. 2 is an efficient method for dimensionality reduction. The method is almost completely automated, and does not require tuning many hyperparameters. Nonetheless, its performance is highly dependent on proper choice of (Eq. 1.1), which, along with the decaying property of Gaussian affinity kernel K , defines the affinity between all points in X ∈ R D×N . If the ambient dimension D is high, the Euclidean distance becomes meaningless once it takes large values. Thus, a proper choice of should preserve local connectivities and neglect large distances. We argue that there is no one 'optimal' way for setting the scale parameter; rather, one should define the scale based on the data and on the task in hand. In the following subsection we describe several existing method for setting . In Sect. 3.2, we propose a novel algorithm for setting in the context of manifold learning. Finally in Sect. 3.3, we present three methods for setting in the context of classification tasks, so as to optimize the classification performance (in certain senses) in the low-dimensional space. Our goal is to optimize the scale prior to the application of the classifier. Several studies propose methods for setting the scale parameter . Some choose as the empirical squared standard deviation (mean squared Euclidean deviations from the empirical mean) of the data. This approach is reasonable when the data is sampled from a uniform distribution. A max-min measure is suggested in Lafon et al. (2006) where the scale is set to and C ∈ [2, 3]. This approach attempts to set a small scale to maintain local connectivities. Another scheme (Singer et al. 2009 ) aims to find a range of values for . The idea is to compute the kernel K from Eq. (2.1) at various values of . Then, search for the range of values which give rise to a well-pronounced Gaussian bell shape. The scheme in Singer et al. (2009) is implemented using Algorithm 3.1. Note that L( ) consists of two asymptotes, L( ) →0 −→ log(N ) and L( ) →∞ −→ log(N 2 ) = 2log(N ), since when → 0, K (Eq. 2.1) approaches the Identity matrix, whereas for → ∞, K approaches an all-ones matrix. We denote by 0 the minimal value within the range¯ (defined in Algorithm 3.1). This value is used in the simulations presented in Sect. 4. A dynamic scale is proposed in Zelnik- Manor and Perona (2004) , suggesting to calculate a local-scale σ i for each data point x i , i = 1, ..., N . The scale is chosen using the L 1 distance from the r -th nearest neighbor of the point x i . Explicitly, the calculation for each point is where x r is the r -th nearest (Euclidean) neighbor of the point x i . The value of the kernel for points x i and x j is This dynamic scale guarantees that at least half of the points are connected to r neighbors. All the methods mentioned above treat as a scalar. Thus, when data is sampled from various types of sensors these methods may be dominated by the features (vector elements) with highest energy (or variance). In such cases, each feature = 1, .., D, in a data vector x i [ ] may require a different scale. In order to re-scale the vector, a diagonal D × D positive-definite (PD) scaling matrix A 0 is introduced. The rescaling of the feature vector x i is set as x i = Ax i , 1 ≤ i ≤ N . The kernel matrix is rewritten as A standard way to set the scaling elements A , is to use the empirical standard deviation of the respective elements and then set = std 1. More specifically, (3.5) In this subsection we propose a framework for setting the scale parameter when the dataset X has some low-dimensional manifold structure M. We start by revisiting an analysis from (Coifman et al. 2008; Hein and Audibert 2005) , which relate the scale parameter to the intrinsic dimensiond (Definition 2.2.1) of the manifold. In Coifman et al. (2008) a range of valid values is suggested for , here we expand the results from (Coifman et al. 2008; Hein and Audibert 2005) by introducing a diagonal PD scaling matrix A (as used in Eq. 3.4). This diagonal matrix enables a feature selection procedure which emphasizes the latent structure of the manifold. Let K ( ) ∈ R N ×N , A ∈ R D×D be the kernel matrix and diagonal PD matrix (resp.) from Eq. (3.4). By taking the double sum of all elements in Eq. (3.4), we get By assuming that the data points in X are independently uniformly distributed over the manifold M, this sum can be approximated using the mean value theorem as where Vol (M) M d y is the (weighted) volume of thed-dimensional manifold M, with d y, d y being infinitesimal parallelograms on the manifold, carrying the dependence on A (see Moser 1965 for a more detailed discussion). When is sufficiently small, the integrand in the internal integral in Eq. (3.7) takes non-negligible values only when y is close to the hyperplane tangent to the manifold at y. Thus, the integration over y within a small patch around each y can be approximated by integration in Rd , so that whered is the intrinsic dimension of M and t is ad-dimensional vector of coordinates on the tangent plane. The integral in Eq. (3.8) has a closed-form solution (the internal integral yields (2π )d /2 , so that the outer integral yields (2π )d /2 Vol(M) ), and we therefore obtain the relation The key observation behind our suggested selection of A and , is that due to the approximation in Eq. (3.9), an "implied" intrinsic dimension d ( A) can be obtained for different selections of A and as follows. Taking the log of Eq. (3.9) we have (3.10) Differentiating w.r.t. we obtain We propose to choose the scaling so as to minimize the difference between the estimated dimensiond (see Sect. 2.2) and the implied dimension d ( A). We therefore set A (and ) based on solving the following optimization problem We note that this minimization problem has one degree of freedom, which can be resolved, e.g., by arbitrarily setting A 1,1 = 1 (see Algorithm 3.2 below). When working in a sufficently small ambient dimension D, the minimizaion can be solved using an exhaustive search (e.g., on some pre-defined grid of scaling values). However, for large D an exhaustive search may become unfeasible in practice, so we propose a greedy algorithm, outlined below as Algorithm 3.2, for computing both the scaling matrix A and the scale parameter . The proposed algorithm (Algorithm 3.2) operates by iteratively constructing the normalized dataset X AX/ √ row by row. To this end, the algorithm is first initialized by normalizing the firstd rows (coordinates) using their empirical standard deviations (note that the estimated (or known) intrinsic dimensiond is either provided as an input or estimated using DANCo Ceruti et al. (2014) ). Then, in the -th iteration ( =d + 1, . . . D) only the scaling factor A , for the next ( -th) row of X and a new overall scaling are found, using a (two-dimensional) exhaustive search. The resulting A , is then applied to the -th row, and the entire × N data block is normalized by √ before the next iteration. The computational complexity of this algorithm with k hypotheses of and A l,l for each iteration is O N 2 k D , since N 2 operations are required in the computation of a single scaling hypothesis, and this is required for each coordinate d = 1, ..., D. Due to the greedy nature of the algorithm, its performance depends on the order of the D features. We further propose to reorder the features using a soft featureselection procedure. The studies in Cohen et al. (2002) , Lu et al. (2007) , Song et al. (2010) suggest an unsupervised feature selection procedure based on PCA. The idea is to use the features which are most correlated with the top principle components. We propose an algorithm for reordering the features based on their correlation with the leading coordinates of the DM embedding. The algorithm (Algorithm 3.3 below) is called Correlation Based Feature Permutation (CBFP), and uses the correlation between the D features andd embedding coordinates. This correlation value provides a natural measure for the influence of each feature on the extracted embedding. 6: Compute a vector c ∈ R D of the feature-embedding correlation scores, defined as (3.14) where corr(·, ·) denotes the correlation coefficient between its two vector agruments. Some remarks on the uniform distribution and finite sample assumptions: The derivation in Eq. (3.7) is based on the assumption that the distribution of X is uniform. If the density is not uniform, consider a measure μ with probability density μ(x). The integration in Eqs. (3.7) and (3.8) should be with respect to μ(x). This will change the result in Eq. (3.9) and in the algorithm that follows. In Hein and Audibert (2005) , the authors show that under certain assumptions on μ(x) the integral in Eq. (3.9) can still be used to approximate the intrinsic dimensiond. The approximations in Eqs. (3.7) and (3.9) are unbiased for proper choices of N and . As shown in Hein and Audibert (2005) , the bias error in Eq. (3.9) is proportional to 2 C(M), where C(M) is the curvature of the manifold. This means that should be sufficiently small for this term to vanish; However, due to the finite sample approximation of the integral, should not be too small, since the variance term is proportional to 1 N d . This quantity could be used to validate the scaling parameter estimated by Algorithm 3.2. In practice, a proper range of scales , can be validated by evaluating the slope of log S( ) (see Eq. 3.9). If the slope of log S( ) as a function of appears linear, this indicates that the approximation holds, and the bias and the variance errors are negligible. In Sect. 4.1 below, we evaluate the performance of Algorithms 3.2 and 3.3 when applied to synthetic data embedded in artificial manifolds. Classification algorithms use a metric space and an induced distance to compute the category of unlabeled data points. Dimensionality reduction is effective for capturing the essential intrinsic geometry of the data and neglecting the undesired information (such as noise). Therefore, dimensionality reduction can drastically improve classification results (Lindenbaum et al. 2015) . As previously mentioned, playes a crucial role in the performance of kernel methods for dimensionality reduction. Various methods have been proposed for finding a scale that would potentially optimize classification performance. Studies such as by Gaspar et al. (2012) and by Staelin (2003) use a cross-validation procedure and select the scale that maximizes the performance on the validation data. In Chapelle et al. (2002) apply gradient descent to a classification-error function to find an 'optimal' scale . Although these methods share our goal, they require performing classification on a validation set for selecting the scale parameter. To the best of our knowledge, the only method that estimates the scale without using a validation set was proposed by Campbell et al. (1999) , where for binary classificationit was suggested to use the scale that maximizes the margin between the support vectors. The authors show empirically that their suggested value correlates with peak classification performance on a validation set. In this subsection we focus on DM for dimensionality reduction and demonstrate the influence of the scale parameter on classification performance in the low-dimensional space. The contribution in this section is threefold: • We use the Davis-Kahan theorem to analyze a perturbed version of ideally separated classes. This allows us to optimize the choice of merely based on the eigengap of the perturbed kernel. • Based on our study in Lindenbaum et al. (2015) , we present an intuitive geometric metric to evaluate the separation in a multi-class setting. We show empirically that The probabilistic approach-the scale is estimated based on modified transition matrices. c The geometric approach-the scale is estimated based on the geometry of the embedding. d The spectral approach-the scale is estimated based on a generalized eigengap computed for each scale within the hypothesis range the scale which maximizes the ratio between class separation and the average class spread also optimizes classification performance. • Finally, to reduce the computational complexity involved in the spectral decomposition of the affinity kernel, we present a heuristic that allows to estimate the scale parameter based on the stochastic version of the affinity kernel. Next, we develop tools to estimate a scale parameter based on a given training set. The training set denoted as T ⊂ R D×N consists of N C classes. The classes are denoted by C 1 , ..., C N C . In this study we focus on the balanced setting, where the number of samples in each class is N P , thus the total number of data points is N = N P N C . We use a scalar scaling factor . However, the analysis provided in this subsection could be expanded to a vector scaling (namely, to the use of a diagonal PD scaling matrix) in a straightforward way (Fig. 1) . The following approach for setting is based on the geometry of the extracted embedding. The idea is to extract low-dimensional representations for a range of candidate 's. Then choose which maximizes the among-classes to within-class variances ratio. In other words, choose a scale such that the classes are dense and far apart from each other in the resulting low-dimensional embedding space. This is done by maximizing the ratio between the inter-class variance and the sum of intera-class variances. We have explored a similar approach for an audio based classification task in Lindenbaum et al. (2015) . This geometric approach is implemented using the following steps: 1. Compute DM-based embeddings d (x n ), n = 1, ..., N , (Eq. 2.5) for various candidate-values of . 2. Denote by μ i the center of mass for class C i , i = 1, ..., N C , and by μ a the center of mass for all the data points. All μ i and μ a are computed in the low-dimensional DM-based embedding d (x n ) -see step 1. 3. For each class C i , the average square distance (in the embedding space) is computed for the N P data points from the center of mass μ i such that ( 3.15) 4. The same measure is computed for all data points such that (3.17) 6. Choose which maximizes ρ ρ = arg max ρ . (3.18) The idea is that ρ (Eq. 3.18) inherits the inner structure of the classes and neglects the mutual structure. In Sect. 4.2, we describe experiments that empirically evaluate the influence of on the performance of classification algorithms. We note, however, that this approach requires an eigendecomposition computation for each , thus, its computational complexity is of order of O(N 2 d) (d being the number of required eigenvectors). In this subsection, we analyze the relation between the spectral properties of the kernel and its corresponding low-dimensional representation. We start the analysis by constructing an ideal training set, with well separated classes. Then, we add a small perturbation to the training set and compute the perturbed affinity matrix K . Based on the spectral properties of the perturbed kernel we suggest a scaling to capture the essential information for class separation. The Ideal Case: We begin the discussion by considering an ideal classification setting, in which the N C classes are assumed to be well-separated in the ambient space X (a similar setting for spectral clustering was described in Ng et al. (2002) ). The separation is formulated using the following definitions: 1. The Euclidean gap is defined as This is the Euclidean distance between the two closest data points belonging to two different classes. 2. The maximal class width is defined as This is the maximal Euclidean distance between two data points belonging to the same class. We assume that D Class D Gap such that the classes are well separated. Using this assumption and the decaying property of the Gaussian kernel, the matrix K (Eq. 2.1) converges to the following block form The eigenvectors ψ i , i = 2, ..., N C have the same structure but cyclically shifted to the right by (i − 1) · N P bins. Proof Recall thatP =D −1K (row-stochastic). Due to the special block structure of K (Eq. 3.21), each block P (i) , i = 1, ..., N P , is row stochastic. Thus, Each eigenvector ψ i , i = 1, .., N p consists of a block of 1-s at the row indices that correspond to P (i) (Eq. 3.21), padded with zeros. ψ i is the right eigenvector (1 · ψ i =P · ψ i ), with the eigenvalue λ = 1. We now have an eigenvalue λ = 1 with multiplicity N P and piecewise constant eigenvectors denoted as ψ i , i = 1, ..., N P . Each data point x i ∈ C , = 1, ..., N C , corresponds to a row within the respective sub-matrix P ( ) . Therefore, using N C (T ) = [1 · ψ 1 , ..., 1 · ψ N C ] T as the lowdimensional representation of T , all the data points from within a class are mapped to a point in the embedding space. Proof Based on the representation in Proposition 3.3.1 along with Eq. (3.19), we get In a similar manner, we get by Eq. (3.20) Corollary 3.3.1 implies that we can compute an efficient representation for the N C classes. We denote this representation by¯ The Perturbed Case In real datasets, we cannot expect that the off block-diagonal elements of the affinity matrix K would be zero. The data points from different classes in real datasets are not completely disconnected, and we can assume they are weakly connected. This low connectivity implies that off-block-diagonal values of K are non-zeros. We analyze this more realistic scenario by assuming that K is a perturbed version of the "Ideal" block form ofK . Perturbation theory addresses the question of how a small change in a matrix relates to a change in its eigenvalues and eigenvectors. In the perturbed case, the off-block-diagonal terms are non-zeros and the obtained (perturbed) matrix K takes the form where W is assumed to be a symmetrical small perturbation of the form (m, ) , ,m = 1, ..., N C . (3.25) The analysis of the "Ideal case" has provided an efficient representation for classification tasks as described in Proposition 3.3.1. We propose to choose the scale parameter such that the extracted representation based onK (Eq. 3.21) is similar to the extracted representation usingK (Eq. 3.24). For this purpose we use the following theorem. Then the distance where V 1 , V 1 is a diagonal matrix with the principal angles on the diagonal, and · F denotes the Frobenius norm. In other words, the theorem states that the eigenspace spanned by the perturbed kernel K is similar, to some extent, to the eigenspace spanned by the ideal kernelK . The distance between these eigenspaces is bounded by 1 δ W F . Theorem 3.2 provides a measure which helps to minimize the distance between the ideal representation¯ N C (proposition 3.3.1) and the realistic (perturbed) representation N C . Theorem 3.2 The distance between¯ N C ∈ R N C and N C ∈ R N C in the DM representations based on the matricesP and P, respectively, is bounded such that where W is the perturbation matrix defined in Eq. (3.24) andD is a diagonal matrix whose elements are the sums of rowsD i,i = jK i, j . Proof DefineĀ D −1/2KD−1/2 =D 1/2PD−1/2 . Based on Eq. (3.24), we have We are now ready to use Theorem 3.1. Assume that the eigenvalues {λ i } ofĀ and { λ i } of A are ordered in descending order, and set S = [λ( A) N C , 1], denoting the first N C eigenvectors ofĀ and of A asV 1 and V 1 , respectively. Obviously, by construction we have λ i ∈ S, i = 1, ..., N C . Based on the analysis of the "ideal" matrixP, we know that its first N C eigenvalues are equal to 1. Noting thatĀ is algebraically similar toP, they have the same eigenvalues, implying thatλ i ∈ S, i = 1..., N C , as well. Using the definition of δ from Eq. (3.26), we conclude that δ = λ N C − λ N C +1 . Settinḡ A D −1/2KD−1/2 and B =D −1/2 WD −1/2 , the Davis-Kahan Theorem 3.1 asserts that the distance between the eigenspacesV 1 and V 1 is bounded such that The eigen-decomposition ofĀ is written asĀ =V¯ V T . Note thatP = D −1/2ĀD1/2 which means that the eigen-decomposition ofP could be written as P =D −1/2V¯ V TD1/2 and the right eigenvectors ofP are¯ =D −1/2V . Using the same argument for A and choosing the eigenspaces using the first N C eigenvectors, we get that decreasing the term d(V 1 , V 1 ) also decreases d(¯ N C , N C ). The perturbation matrix W (Eq. 3.24) changes only slightly over the range of values of ∈ (D Class , D Gap ). Explanation For two data points x i , x j from different classes x i ∈ C , x j ∈ C m , = m and ∼ D Class , the values of W i, j 1. The decaying property of the Gaussian kernel provides a range of values for ∼ D Class such that the perturbation matrix W is indeed small. In Sect. 4.2 below we evaluate this assumption using a mixture of Gaussians. The scale parameter , which maximizes Ge This approach also requires computing an eigendecomposition for each value, thus, its computational complexity is of the order of O(N 2 N C ). We introduce here notations from graph theory to compute a measure of the class separation based on the stochastic matrix P (Eq. 2.2). Based on the values of the matrix P, a Cut (Shi and Malik 2000) is defined for any two subsets A, B ⊂ T (3.32) Given N C classes C 1 , C 2 , C 3 , ..., C N C ⊂ T , we define the Classification Cut by the following measure (3.33) In clustering problems, a partition is searched such that the normalized version of the cut is minimized (Dhillon et al. 2004; Ding et al. 2005) . We use this intuition for a more relaxed classification problem. We first define a Generalized cut using the following matrix (3.34) Based on P (which carries the dependence on ), the Generalized cut is then defined as Gcut( A, B) The idea is to remove the probability of "staying" at a specific node from the withinclass transition probability. Now let We search for which maximizes ρ P , namely ρ P = arg max (ρ P ). (3.37) By the stochastic model, the implied probability of transition between point x i and point x j is equal to p(x i , x j ) = P i, j , therefore by maximizing ρ P , the sum of withinclass transition probabilities is maximized. Based on the definition of the diffusion distance (Eq. 2.4), this implies that the within-class diffusion distance would be small, followed by a small Euclidean distance in the DM space. The heuristic approach entailed in Eq. (3.37) provides yet another criterion for setting a scale parameter which captures the geometry of the given classes. This approach does not require computing an eigendecomposition for each candidate value, thus, its computational complexity is of order of O(N 2 ). In this section we provide some experimental results, showing and comparing the different approaches (outlined in the previous section) in their respective contexts. We begin by demonstrating our proposed manifold-based scaling in Sect. 4.1, and then demonstrate the classification-based scaling approaches in Sect. 4.2. In this subsection we evaluate the performance of the proposed manifold-based approach by embedding a low-dimensional manifold which lies in a high-dimensional space. We consider two datasets: A synthetic set and a set based on an image taken from the MNIST database. The first experiment is constructed by projecting a 3-dimensional synthetic manifold into a high-dimensional space, then concatenating it with Gaussian noise. Data generation is done according to the following steps: • First, a 3-dimensional Swiss Roll is constructed based on the following function 1, ..., N ) , are drawn from Uniform distributions within the intervals [ 3·π 2 , 9·π 2 ], [0, 100], respectively. In our experiment we chose N = 2000. • We project the Swiss roll into a high-dimensional space by multiplying the data by a random matrix N T ∈ R D 1 ×3 , D 1 > 3. The elements of N T are drawn from a Gaussian distribution with zero mean and variance of σ 2 T . • Finally, we augment the projected Swiss Roll with a vector of Gaussian noise, obtaining ..., N , (4.2) where each component of n 1 i ∈ R D 2 , i = 1, ..., 2000, is an independent Gaussian variable with zero mean and variance of σ 2 N . We define the datasets Y = [y 1 , ..., y N ] ∈ R 3×N and X = [x 1 , ..., x N ] ∈ R (D 1 +D 2 )×N To evaluate the proposed framework, we apply Algorithm 3.1 followed by Algorithm 3.2, and extract a low-dimensional embedding (Fig. 2) . Different high-dimensional datasets X were generated using various values of σ N , σ N T , D 1 and D 2 . DM is applied to each X using: • The standard deviation normalization as defined in Eq. (3.5). • The 0 scale, which is described in 3.1 and in Singer et al. (2009) . • The MaxMin scale, as defined in Eq. (3.1) and in Lafon et al. (2006) . The extracted embedding is compared to the embedding extracted from the clean Swiss roll Y defined in Eq. (4.1). Each embedding is computed using an eigendecomposition, therefore, the embedding's coordinates could be the same up to scaling and rotation. To overcome this ambiguity, we search for an optimal translation and rotation matrix of the following form¯ where R is the rotation matrix and T is the translation matrix, which minimizes the mis-match error (4.4) defined as the sum of square distances between values of the clean mapping (Y ) and the "aligned" mapping¯ (X). We repeat the experiment 40 times and compute the empirical Mean Square error in the embedding space defined as (4.5) An example of the extracted embedding based on all the different methods is presented in Fig. 3 , followed by the MSE in Fig. 4 . It is evident that Algorithm 3.2 is able to extract a more precise embedding than the alternative scaling schemes. The strength of Algorithm 3.2 is that it emphasizes the coordinates which are essential for the embedding and neglects the coordinates which were contaminated by noise. In the following experiment, we create an artificial low-dimensional manifold by rotating a handwritten image of a digit. First, we rotate the handwritten digit '6' from MNIST dataset by N = 320 angles that are uniformly sampled over [0, 2π ]. Next, we add random zero-mean Gaussian noise with a variance of σ 2 N independently to each pixel. An example of the original and noisy version of the handwritten '6' are shown in Fig. 5 . Note that the values of the original image are in the range [0, 1]. In order to capture the circular structure of the manifold we apply DM to the rotated images. An example of the expected circular structure extracted by DM is depicted in Fig. 5 . We apply the different scaling schemes to the noisy images and extract a 2dimensional DM-based embedding. For the vectorized scaling schemes (proposed and standard deviation approach) we apply the scalings to the top 50 principle components. This allows us to reduce the computational complexity, as the dimension of the feature space is reduced from 784 to 50. In this experiment we further compare to two additional local scaling schemes. The first (Vasiloglou et al. 2006) , which we refer to as Harmonic, and the second, presented in Taseska et al. (2019), which we refer to as LocalDM. To evaluate the performance of the different scaling schemes, we propose the following metric to compare the extracted embedding to a perfect circle. Given a 2-dimensional representation (X), we use a polar transformation to evaluate the implied radius at each point. The squared radius is defined by r 2 i = ( 1 (x i )) 2 + ( 2 (x i )) 2 . Next, we normalize the radius values by their empirical mean, such that r i = r i μ r , and μ r is the empirical mean of the radii μ r = i r i N . Finally, we compute the empirical variance of the normalized radiusr . Explicitly, this value is computed by σ 2 r = i (r i −1) 2 N . A scatter plot of σ 2 r vs. the variance of the additive noise σ 2 N is presented in Fig. 6 . As evident in this figure, up to a certain variance of the noise, the proposed scaling scheme suppresses the noise and captures the correct circular structure of the data. At some level of noise our method breaks. It seems that the standard deviation and Singer's approach also break at a similar noise level. An explanation for this phenomenon could be that at the lower SNRs all these methods start to "amplify" the noise, rather than the signal. In this subsection we provide empirical support for the theoretical analysis from Sect. (3.3) . We evaluate the influence of on the classification results using four datasets: a mixture of Gaussians, artificial classes lying on a manifold, handwritten digits and seismic recordings. We focus on evaluating how the proposed measures ρ P , ρ , Ge (Eqs. (3.37), (3.18), (3.31), resp.) are correlated with the quality of the classification. In the following experiment we focus on a simple classification test using a mixture of Gaussians. We generate two classes using two Gaussians, based on the following steps: 1. Two vectors μ 1 and μ 2 ∈ R 6 were drawn from a Gaussian distribution N (0, σ M · I 6×6 ). These vectors are the centers of masses for the generated classes C 1 and C 2 . (resp.). 2. N = 100 data points were drawn for each class C 1 and C 2 with a Gaussian distribution N (μ 1 , σ V · I 6×6 ) and N (μ 2 , σ V · I 6×6 ), respectively. Denote these 2N data points by C 1 ∪ C 2 = T ⊂ R 6×200 . The first experiment evaluates the Spectral Approach (Sect. 3.3.2). Therefore, we set σ v < σ M such that the class variance is smaller than the variance of the center of mass. Then, we apply DM using a scale parameter such that ∼ σ 2 v < σ 2 M . In Fig. 8 (left) , we present the first extracted diffusion coordinate using various values of . It is evident that the separation between classes is highly influenced by . A comparison between ρ P ,ρ and Ge is presented in Fig. 8 (right) . This comparison provides evidence of the high correlation between ρ P (Eq. 3.37), ρ (Eq. 3.18) and the generalized eigengap (Eq. 3.31) (Fig. 7) . To evaluate the validity of Assumption 1, we calculate the Frobenius norm of the perturbation matrix W for various values of . The results with the approximated Ge are presented in Fig. 9 . Indeed, as evident from Fig. 9 , the value of || W || F is nearly constant for a small range of values around Ge . For the non-ideal case, we generate classes using a non-linear function. This non-linear function is designed to model an unknown underlying nonlinear physical process governed by a small number of parameters. Consequently, the classification task is essentially expected to provide an estimate of these hidden parameters. An example for such a problem is studied, e.g., in Lindenbaum et al. (2015) , where a musical key is estimated by applying a classifier to a low-dimensional representation extracted from the raw audio signals. In the following steps we describe how we generate classes from a Spiral structure: 1. Set the number of classes N C and a gap parameter G. Each class C , = 1, ..., N C , consists of N P data points drawn from a uniformly dense distribution within the line [( − 1) · L C , · L C − G], = 1, ..., N C . L C is the class-length, set as L C = 1 N C . Let N = N C N P denote the total number of points. as the set of all points from all classes. 3. Project each r i into the ambient space using the following spiral-like function where n 2 i ∈ R 3 are drawn independently from a zero-mean Gaussian distribution with covariance = σ S · I 3 . Two examples of the spiral-based classes are shown in Fig. 10 . For both examples, we use N C = 4, N P = 100, σ S = 0.4 with different values for the gap parameter G. To evaluate the advantage of the proposed scale parameters and P (Eqs. (3.18) and (3.37), resp.) for classification tasks, we calculate the ratios ρ P and ρ for various Fig. 10 values of , and then we evaluate the resulting classification (which is based on the lowdimensional embedding). Examples of embeddings of the two spirals from Fig. 10 are shown in Fig. 11 . This merely demonstrates the effect of on the quality of separation. We apply classification in the low-dimensional space using KNN (k = 1). The KNN classifier is evaluated based on Leave-One-Out cross validation. The results are shown in Fig. 12 , where it is evident that the classification results in the ambient space are highly influenced by the scale parameter . Furthermore, peak classification results occur at a value of corresponding to the maximal values of ρ Ge and ρ . The value of ρ P did not indicate the peak classification scale, however, its computation complexity is lighter compared to ρ Ge and ρ . In the following experiment, we use the dataset from the UCI machine learning repository (Lichman 2013). The dataset consists of 2000 data points describing 200 instances of each digit from 0 to 9, extracted from a collection of Dutch utility maps. The dataset consists of multiple features of different dimensions. We use a concatenation of the Zerkine moment (ZER), morphological (MOR), profile correlations (FAC) and the Karhunen-loéve coefficients (KAR) as our features space. We compute the proposed ratios ρ P and ρ for various values of , and estimate the optimal scale based on Eqs. (3.18), (3.37). We evaluate the extracted embedding using 20-fold cross validation (5% left out as a test set). The classification is done by applying KNN (with k = 1) in the d-dimensional embedding. In Fig. 13 , we present the classification results and the proposed optimal scales for classification. Our proposed scale concurs with the scale that provides maximal classification rate. In the next evaluation, we focus on classifying individuals that were infected by COVID-19. In certain individuals, the COVID-19 disease may cause symptoms of pneumonia; these individuals could be further diagnosed using chest X-ray images (Abbas et al. 2020) . Convolutional neural networks (CNNs) have demonstrated promising results in classification of COVID-19 based on X-ray (Sethy and Behera 2020; Wang and Wong 2020) or CT (Shuai et al. 2020; Song et al. 2020 ) of chest images. Here, we focus on the task of classifying COVID-19 patients based on the front view chest X-ray images. The data is collected from the Kaggle database, 1 from which we have used 112 images that are split equally between two classes: healthy and COVID-19 infected. The feature space is defined by resizing all the chest X-rays to 200 × 200 pixel images. This feature space is still sensitive to translations and scales, which are inherent to the X-ray modality. This sensitivity could clearly be partly mitigated by defining a translation-invariant feature space, e.g. by using CNNs, however, this is beyond the scope of this study. To evaluate the proposed schemes, we compute the ratios ρ P and ρ for various values of , and estimate optimal scales based on Eqs. (3.18), (3.37). For each scale value, we extract a 2-dimensional embedding and perform classification using KNN (with k = 2) and SVM classifiers. Accuracy is averaged using a 10-folds cross-validation. Fig. 14 demonstrates that the proposed scales are good candidates for selecting a scale that yields high classification performance. Our numerical simulations demonstrate that none of the proposed scaling schemes consistently outperforms the others. Nonetheless, across all of the evaluated datasets, the best performance was obtained within the range where the values were bounded by ψ , ρ P and Ge . The probabilistic approach that estimates ρ P is based on the transition matrix P and requires O(N 2 ) computations. This complexity could be further reduced by computing a k-sparse approximation for the matrix P. Specifically, methods such as k-sparse graph (Wang et al. 2013) can be computed with complexity O (NlogN + N k) . The probabilistic approach is based on a heuristic, and we consider it to be the least accurate of all of the proposed schemes. Both the Spectral approach and the Geometric approach require a spectral decomposition. Assuming that both methods use the same number of coordinates (i.e., the embedding dimension is the same as the number of classes, d = N C ), they both require O(N 2 N C ) computations. The spectral approach is derived by analyzing a perturbed version of a kernel K , constructed based on well-separated classes. Empirically, this analysis seems to hold for certain cases; however, if there is no spectral gap (i.e., λ N C − λ N C +1 1) we do not recommend to use this method. Finally, the Geometric approach is purely based on the extracted representation. Therefore, we consider it as the most reliable method for estimating a scale parameter that is most appropriate for classification. In this section we demonstrate the capabilities of the proposed approach for extracting meaningful parameters from raw seismic recordings. Extracting reliable seismic parameters is a challenging task. Such parameters could help discriminate earthquakes from explosions, moreover, they can enable automatic monitoring of nuclear experiments. Traditional methods such as Rodgers et al. (1997) ; Blandford (1982) use signal processing to try to analyze seismic recordings. More recent methods, such as Kort- 2003); Tiira (1996) use machine learning to construct a classifier for a variety of seismic events. Here, we extend our result from Rabin et al. (2016a) ; Lindenbaum et al. (2018) , in which we have demonstrated the strength of DM for extracting seismic parameters. Our proposed method performs a vector scaling for manifold learning. Thus, essentially if the data lies on a manifold, our scaling combined with DM will extract the manifold from high-dimensional seismic recordings. Moreover, it will provide a natural feature selection procedure, thus if some features are corrupt, the proposed scaling may be able to reduce their influence. As a test case, we use a dataset from (Rabin et al. 2016a, b; Lindenbaum et al. 2018) , which was recorded in Israel and Jordan between 2005-2015. All recordings were collected in HRFI (Harif) station located in the south of Israel. The station collects three signals from north (N), east (E) and vertical (Z). Each signal is sampled using a broadband seismometer at 40[Hz] and consists of 10,000 samples. Seismic events usually generate two waves, primary-waves (P) and secondary waves (S). The primary wave arrives directly from the source of the event to the recorder, while the secondary wave is a shear wave and thus arrives at some time delay. Both waves pass through different material thus have different spectral properties. This motivates the use of a time-frequency representation as the feature vector for each seismic event. The time-frequency representation used in this study is a Sonogram (Joswig 1990), which offers computation simplicity while retaining the sufficient spectral resolution for the task in hand. The Sonogram is basically a spectrogram, renormalized and rearranged in a logarithmic manner. Given a seismic signal z(n), the Sonogram is extracted using the following steps: 1. Compute a discrete-time Short Term Fourier Transform: 3. Reshape the time frequency representation into a vector, by concatenating columns. The resulted vector representation for a signal z(n) is denoted by x. These steps are applied to each of the channels separately. This results in three sets X E for the east channel, X N for the north channel and X Z for the horizontal. Examples for seismic recording of an explosion and of an earthquake are presented on Fig. 15a , b. Examples of corresponding Sonograms are presented in Fig. 15a , b . To evaluate the proposed scaling for manifold learning, we use a subset of the seismic recording with 352 quarry blasts. The explosions have occurred at 4 known quarries surrounding the recording station HRFI. Our study in Rabin et al. (2016a) , Lindenbaum et al. (2018) , has demonstrated that most of the variability of quarry blasts stems from the source location of each quarry, therefor, we assume that the 352 blasts lie on some low-dimensional manifold. Where the parameters of the manifold should correlate with location parameters. Our approach for setting the scale parameter provides a natural feature selection procedure. To evaluate the capabilities of this procedure we "destroy" the information in some of the features. We do this by applying a deformation function to one channel out of the three seismometer recordings. We define the input for Algorithm 3.2 as where g(·) is an element-wise deformation function. In the first test case the deformation function is defined by g(y) = y 0.1 . Then, we apply DM with various scaling schemes and examine the extracted representation. In Fig.16 the two leading DM coordinates of different scaling methods are presented. The quarry cluster separation is clearly evident in Fig. 16d . To further evaluate how well the low-dimensional representation correlates with the source location we use a list of source locations. A list with the explosions locations is provided to us based on manual calculations, performed by an analyst by considering the phase difference Fig. 16d) between the signals' arrival times to different stations. We note that this estimation is accurate up to a few kilometers. A map of the location estimates colored by source quarry is presented in Fig. 17a . Then, we apply Canonical Correlation Analysis (CCA) to find the most correlated representation. The transformed representations U and V are presented in Fig. 17b , c respectively. The two correlation coefficients between coordinates of U and V are 0.88 and 0.72. In the second test case, we use additive Gaussian noise to "degrade" the signal. We use g(y) = y + n 1 as the defoemation function, where n 1 is drawn from a zeromean Gaussian distribution with variance of σ 2 N . We estimate the scaling based on the proposed and alternative methods. Then, we apply CCA to the two leading DM coordinated and the estimated source locations. The top correlation coefficients for various values of σ 2 N are presented in Fig. 18 . Both the max-min method Max Min and Singer's 0 (Singer et al. 2009 ) scheme seem to break at the same noise level. The standard deviation approach is robust to the noise level, this is because it essentially performs whitening of the data. However, this also obscures some of the information content when the noise is of low power. The proposed approach seems to outperform all alternative schemes for this test case. Automatic classification of seismic events is useful as it may reduce false alarm warnings on one hand, and enable monitoring nuclear events on the other hand. To evaluate the proposed scaling for classification of seismic events, we use a set with 46 earthquakes and 62 explosions all of which were recorded in Israel. A low-dimensional mapping is extracted by using DM with various values of , and binary classification was applied using KNN (k = 5) and Support Vector Machine (SVM) in a leaveone-out fashion. The accuracy of the classification for each value of is presented in Fig. 19 . The estimated values of Ge , ρ P and ρ were annotated. It is evident that for classification the estimated values are indeed close to the optimal values, although they do not fully coincide. Nevertheless, they all achieve high classification accuracy. The scaling parameter of the widely used Gaussian kernel is often crucial for machine learning algorithms. As happens in many tasks in the field, there does not seem to be one global scheme that is optimal for all applications. For this reason, we propose two new frameworks for setting a kernel's scale parameter tailored for two specific tasks. The first approach is useful when the high-dimensional data points lie on some lower dimensional manifold. By exploiting the properties of the Gaussian kernel, we extract a vectorized scaling factor that provides a natural feature selection procedure. Theoretical justification and simulations on artificial data demonstrate the strength of the scheme over alternatives. The second approach could improve the performance of a wide range of kernel based classifiers. The capabilities of the proposed methods are demonstrated using artificial and real datasets. Finally, we present an application for the proposed approach that helps learn meaningful seismic parameters in an automated manner. In the future, we intend to generalize the approach for the multi-view setting recently studied in Lindenbaum et al. (2020) , Salhov et al. (2019) , Lederman and Talmon (2014) . 9. Obtaind by minimizing the Kullback-Leibler (KL) divergence between the distribution based on X and Y d . The estimator takes the following form d = arg min d=1,...,D ,d M L ), g(·; ,d M L )) + K L(q(·;μ ν ,μ τ ), q(·;μ ν ,μ τ )), where g is the pdf of the normalized distances and q is the VM pdf. Both g and q are described in Ceruti et al. (2014) . Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network Laplacian eigenmaps and spectral techniques for embedding and clustering Constructing a hidden markov model based earthquake detector: application to induced seismicity Seismic event discrimination Data dimensionality estimation methods: a survey Dynamically adapting kernels in support vector machines Danco: Dimensionality from angle and norm concentration Choosing multiple parameters for support vector machines Feature selection using principal feature analysis. Univ. of Illinois at Urbana-Champaign, Coifman RR, Lafon S (2006) Diffusion maps Graph laplacian tomography from unknown random projections Discrimination of earthquakes and underwater explosions using neural networks Kernel k-means: spectral clustering and normalized cuts On the equivalence of nonnegative matrix factorization and spectral clustering An algorithm for finding intrinsic dimensionality of data On the parameter optimization of support vector machines for binary classification Classifying seismic waveforms from scratch: a case study in the alpine environment Intrinsic dimensionality estimation of submanifolds in R D Principal component analysis. Wiley Online Library Joswig M (1990) Pattern recognition for earthquake detection Automatic classification of seismic events within a regional seismograph network Data fusion and multicue data matching by diffusion maps Data fusion and multicue data matching by diffusion maps Common manifold learning using alternating-diffusion, submitted Musical key extraction using diffusion maps Multi-view kernels for low-dimensional modeling of seismic events Multiview diffusion maps Bandwidth selection for kernel-based classification Riemannian manifold learning for nonlinear dimensionality reduction Feature selection using principal feature analysis Face recognition based on laplacian eigenmaps On the volume elements on a manifold On spectral clustering: analysis and an algorithm Continuous automatic classification of seismic signals of volcanic origin at Mt. Merapi, Java, Indonesia An intrinsic dimensionality estimator from near-neighbor information Earthquake-explosion discrimination using diffusion maps Multi-channel fusion for seismic event detection and classification A comparison of regional-phase amplitude ratio measurement techniques Nonlinear dimensionality reduction by local linear embedding Seismic detection using support vector machines Applied and Computational Harmonic Analysis Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond Detection of Coronavirus Disease (COVID-19) Based on Deep Features Normalized cuts and image segmentation A deep learning algorithm using CT images to screen for Corona Virus Disease Detecting intrinsic slow variables in stochastic dynamical systems by anisotropic diffusion maps Feature selection using principal component analysis Deep learning enables accurate diagnosis of novel coronavirus (COVID-19) with CT images Parameter selection for support vector machines, Hewlett-Packard Company A global geometric framework for nonlinear dimensionality reduction Discrimination of nuclear explosions and earthquakes from teleseismic distances with a local network of short period seismic stations using artificial neural networks Stastical estimation of the intrinsic dimensionality of a noisy signal collection 16th IEEE signal processing society workshop on machine learning for signal processing An evaluation of intrinsic dimensionality estimators Fast algorithm for approximate k-nearest neighbor graph construction COVID-net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest radiography images Self-tuning spectral clustering Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Acknowledgements This work was supported by the Israel Science Foundation (ISF 1556/17). This research was partially supported by the US-Israel Binational Science Foundation (BSF 2012282), Blavatnik Computer Science Research Fund, Blavatink ICRC Funds and Pazy Foundation. Dimensionality from Angle and Norm Concentration (DANCo) Ceruti et al. (2014) DANCo is a recent method for estimating the intrinsic dimension based on highdimensional measurements. The estimate is based on the following steps:Denote the farthest neighbor of x i by S(x i ). The value of depends on the density of the dataset, and is usually not higher than 10. 2. Calculate the normalized closest distance for4. For each point x i , find the nearest neighbors and center them relative to x i . The translated points are denoted asx s j x s j − x i . The set of nearest neighbors for point x i is denoted byS (x i ) = {x s j } j=1 . The distribution model is explained in Camastra (2003) . 5. Calculate the 2 angles for all pairs of vectors withinS (x i ). The angles are calculated usingFor each point x i concatenate all angles from Eq. (6.2) into a vectorθ i and the set of vectors bybased on a ML estimation using the von Mises (VM) distribution with respect to X. The VM pdf describes the probability for θ given the mean direction ν and the concentration parameter τ ≥ 0. The VM pdf, as well as the ML solution, are presented in Ceruti et al. (2014) . The means ofν andτ are denoted asμ ν andμ τ , respectively. 7. For each hypothesis of d = 1, ..., D, draw a set of N data pointsfrom a d-dimensional unit hypersphere. 8. Repeat steps 1-6 for the artificial dataset Y d . Denote the maximum likelihood estimated set of parameters asd M L ,ν,τ ,μ ν ,μ τ .