key: cord-0811786-3m1zet9t authors: Cheng, Wei-Chen; Liou, Cheng-Yuan title: Manifold construction based on local distance invariance date: 2010-02-11 journal: Memet Comput DOI: 10.1007/s12293-009-0030-y sha: 9c9a77d2ad48020088cb7dbb56293ce27109e59d doc_id: 811786 cord_uid: 3m1zet9t This paper presents a distance invariant manifold that preserves the neighborhood relations among data patterns. All patterns have their corresponding cells in the manifold space. The constellation of neighborhood cells closely resembles that of patterns. The manifold is invariant under the translation, rotation and scale of the pattern coordinates. The neighborhood relations among cells are adjusted and improved in each iteration according to the reduction of the distance preservation energy. Dimension reduction [10] in the manifold space can display meaningful relationships among patterns. Foundations for various data manifolds have been set down for factorial components [19] and for generalized adaline [34] . They have been successfully applied in various temporal data analyses [35] . The principal component analysis (PCA) and multidimensional scaling (MDS) [32] are well established linear models that have been developed for such reduction. Nonlinear reduction algorithms have been devised with varying degrees of success. The Isomap [31] and the conformal C-Isomap [4] extend MDS by using the geodesic distance to construct the nonlinear manifold. The Locally Linear Embedding (LLE) [27] computes certain linear model coefficients to maintain the local geometric properties in the manifold. Both Isomap and LLE have distinguished performances. Isomap has been extended to find the intrinsic curvature manifold [4] , such as a fishbowl surface. Distortion analysis [12, 21, 23] has been introduced to study the formation mechanism of the self organizing map (SOM) [10] . The lack of a precise energy function for the SOM is studied in [7] . The ill-posed problem of SOM is conjectured by Kohonen, refer the Preface in [13] . It is very difficult to use the SOM for the non-vector data [14] . This paper devised a distance preservation energy to construct the manifold. It is a precise energy for the coordinate transformation of the patterns. The ill-posed problem does not exist for such a precise energy function. As this energy is used to manipulate the pattern coordinates, there is no probability manipulation in the manifold construction. The manifold is developed for data visualization only. It is not designed for LVQ or for the clustering purpose. This energy is devised to display the local relations among patterns directly. Since it uses the relative distances among local patterns to construct the manifold, it is invariant under the translation, rotation and scale of the pattern coordinate. Many existing manifolds are heavily sensitive to the setting of coordinates and obtain serious unreliable results, for example, the eigenvector system and LLE. This kind of invariance is very useful in many applications. The manually assigned numerical codes for the physical entities of patterns are usually arbitrary and abstract. The absolute values of these codes are meaningless. We expect that the distance between two neighborhood patterns carries rich and reliable meaning. The distance preservation manifold is capable of mapping the whole distribution of very high dimensional patterns to a perceptible space, 2D or 3D. We show experiments on market states [5, 18] and influenza protein sequences. Suppose there are P patterns distributed in a D-dimensional pattern space, X = {x j , j = 1, . . . , P}. Each pattern, x j , is a D-dimensional column vector and has a corresponding mapped cell, y j , in the manifold space. The positions of the cells in the manifold space are Y = {y j , j = 1, . . . , P}. Each cell position, y j , is a M-dimensional column vector. In the pattern space, giving a pattern x p , the set of those patterns whose distances to x p are less than r are included in the set U ( p, r ), where r denotes the radius of neighborhood region in the pattern space. The notation |U ( p, r )| denotes the number of patterns in the set U ( p, r ). Note that the space, M, is a pre-designed space that is continuous without fixed borders. One can set a different M in a different application. This preset M can drastically simplify the manifold problem. All other manifold methods, except SOM, attempt to seek such a space, for example a wrapped surface in the pattern space. Consider the local distance invariant (LDI) energy [16] , The energy is a function of y and r . The main difference between this energy (1) and that of MDS [32] is that (1) is a function of r . All patterns are used in (1) when r is very large. Few neighbors are used when r is small. These near neighbors are much reliable for setting and supporting, collectively, the position of the cell y p in the manifold space. The algorithm for the LDI manifold that adjusts the cell position, y j , in the manifold space is as follows. 1. Initialize the cell set Y . 2. Assign a value to r . 3. For each epoch t from t 0 to t 1 4. For every pair patterns x p and x q , adjust their cell positions by 5. Reduce r . In the above algorithm, η is the learning rate. The computational complexity for calculating the pairwise distance is O D P 2 and finding the neighbors for every pattern is O(P 2 ). The computational complexity of Isomap is O P 3 . It is higher than that of LLE. M is a small number, M = 2 or 3 in many cases, in the dimension reduction. To explain the idea of this algorithm, one may imagine that there are P labeled balls (cells) on a flat table. The surface of this table resembles the low dimensional space, M = 2. These balls are free to move on the table. They are confined on this M-dimensional space. Each ball is labeled with its corresponding pattern x p . The position of the pth ball on the table is y p . The algorithm seeks a ball distribution Y in M that can resemble the neighborhood relations of X in D. The system energy (1) exerts an implicit, distant and remote, force to the current distribution Y to force it maximally similar (resemble) to the distribution X . One may construct a Gibbs type system for the energy (1) to relax the table distribution as a whole and solve the distribution Y . Here, we prefer a batch operation, Step 4, by using the forces to redistribute the balls. There are many ways to assign the initial positions of cells in Step 1. For example, one may apply the linear projection methods, such as PCA and MDS, to project all patterns onto the M-dimensional space. Other developed methods, LLE and Isomap, can also be used for the initial assignment. Using the manifold by LLE or Isomap as the initial arrangement is computationally expensive for certain applications. In Step 2, r denotes the neighborhood radius of the hypersphere in the pattern space. The patterns inside the hypersphere are used in the energy (1) . During the end of the relaxation process, the value of r is shrunk to the minimum distance among all pattern pairs. We set the initial value r to be the maximum distance among all pattern pairs and reduce r linearly or exponentially. The computational complexity of updates is This algorithm is different from SOM. The LDI manifold organizes the positions of the cells based on the distance relations instead of the absolute pattern vectors used in the SOM [11] . These relative distances are readily available in many applications. The positions of the cells will not be fixed in regular positions as those in SOM. There is no synaptic weight attached to each cell that indicates its constellation in the pattern space. The differences between SOM and LDI are listed in Table 1 . Table 2 lists the energy functions of LDI, Isomap and LLE. In Table 2 , the term, d G (x p , x q ), computes the shortest path distance between x p and x q in a weighted graph G. The weight, W i j , summarizes the contribution of the point x j to the x i . The pseudocode of LDI is contained in Table 3 . In the pseudocode, the parameter r is reduced End For exponentially. Alternatively, r can be decreased linearly, All distance relations will be included in the tuning of each cell's position in the beginning of the relaxation algorithm. The number of neighbors of a pattern, |U (i, r )|, will be reduced to zero, Δy i will approach to zero. This will stop the movement of the cell position on the manifold space and reach the convergence. In this example, we show that the sampling density affects the LLE manifold. The swiss roll equation [17] is We uniformly sample data points as patterns along the variable v in the range − 3 10 , 3 10 and non-uniformly along u. Let r (u) be the equation of the curve, The arc length of r (u) is where f (u) is the arc length with respect to u. We sample along f (u) and use f −1 to calculate the u value. With this u value and the v, we can find a corresponding point on the plane by using the roll formula (5) . Using this sampling technique, the probability densities of points may be uniform or non-uniform along the arc length. We can design any sampling densities along the variables, u and v. Figure 1 shows the manifolds obtained by different algorithms using different density distributions. The left diagrams plot the density distributions with respect to f (u). In these diagrams, the horizontal coordinate, x-axis, is f (u), and the vertical coordinate, y-axis, is the density. Figure 1a shows the manifolds of an unbalanced sampling. The blue area of the roll has dense sampling points in unit area and the red area has low density of points. Figure 1b shows the manifolds where the red part of the roll has high density. Figure 1c shows the manifolds where the density is even along the arc length. Figure 1d shows the manifolds where the yellow area (middle portion) has high density. From Fig. 1 , we see that the Isomap is not affected much by the density distributions. The sinusoid curves in the LLE manifolds will fluctuate with the densities. The LDI manifolds are not affected by the density distributions. This is because every pattern has its correspondent cell in the manifold space. The number of cells is equals to the number of patterns. The probability density of each pattern is preserved and equal to the density of its cell. The LDI algorithm is a coordinate transformation algorithm. It maps all events (patterns) from the event space (the pattern space) to their corresponding cells in the manifold space. This algorithm accomplishes the coordinate transformation from the pattern space to the manifold space automatically. The LDI is a precise energy for such transformation. There is no manipulation on the probability density function in the construction of the manifold. The ill-posed problem in the Preface in Kohonen's book [13] does not exist in the LDI mani-fold. The LDI manifold has a perfect energy function and bypasses the problem skillfully. Note that the conformal self-organizing map [20] is also devised without this problem. The LDI manifold fully utilizes the manifold space to maintain both global and local structures. As an example, the curve pattern in Fig. 2 has two kinds of structures. It consists of 240 pattern points. These points globally form a S-curve on the x y-plane and locally form a sinusoidal curve along the z-axis. The points that have the same color are from the same pattern data. The LDI manifold reveals these two kinds of structures. Both Isomap with the parameter, K = 7, and LLE with the parameter, K = 12, derive incorrect structures. MDS maps the points onto a projection plane and reveals the S-curve structure only. Table 4 . These 18 indices can reveal certain national policies. We use the monthly data in the analysis. Each pattern vector, x p , contains the 18 values of the normalized indices of the pth month and is regarded as the month state of the economy. Each country indices I p is normalized by 2000) I (J anuar y,2000) . The normalization uses the country indices in January, 2000 as the base. There are P = 117 states and each state is an 18-dimension vector (D = 18 ). Figure 4 displays the LDI manifold (M = 2). The black solid circle denotes the state cell in January 2000 and the concentric circles denote the cell in September, 2009. The green line shows the LDI manifold when r is set to the maximum distance,r = 6.474. The red line is the converged manifold when r is decreased to the minimum distance,ř = 0.070. We record the relaxation processes between the green line and the red line, from light gray to dark gray. In Fig. 4 , the red line reveals much more detailed information than that of the green line. It clearly shows that the subprime mortgage mess in July, 2007 is the turning point of the global economy. We list several important events in Table 5 One may apply the manifold technique to interpolate the incomplete data records and extrapolate the predicted states. As for the interpolation of missing data, suppose there are three missing records during certain month. We use all P − 1 available months to construct a manifold with P − 1 cells. These P − 1 cell positions are then fixed. Then, we insert one new cell in the manifold for the incomplete month and train its position using the rest 15 = 18 − 3 records in the vectors, x p , of all P states. Note that the rest P(P − 1)/2 distances are calculated by using the 15 rest records of all P vectors. This inserted cell will converge to a new position and we obtain a fixed manifold with P cells. The distances of those near neighbors of this inserted cell in the manifold can be used as the weights to interpolate the missing three records. By a weighted fitting of these three missing records among the neighbors, one can estimate the missing three records. With the state trend and distances among neighbors, the extrapolation of the predicted state can be accomplished in a similar way. We also select five indices, GDP growth, consumer price index, import value index, export value index, and separately. The 210 = 21 × 20 ÷ 2 pair distances among the twenty-one year vectors are calculated and used in the LDI manifold algorithm. We set the learning rate η = 0.01. The algorithm is operated with 100 millions epochs. The parameter r is exponentially decreased. The maximumr and minimumř are listed in Table 6 . Figure 6 shows the 3D manifolds for four countries. They have different market patterns and trends. We mark several abnormal states. The results in [18] are consistent with those in Fig. 6 . Due to the horizontal gene transfer [25] , the life tree may not be an adequate structure to fully display the evolution relations among generations. Alternatively, we show how to display the generations in the manifold space. The H1N1 and H5N1 are subtypes of the Influenza A virus which can cause illness in humans and many animal species. The H5N1 subtype has been reported in 445 human cases and has caused 263 human deaths [33] . The pathway of H5N1 spread is still unclear. We use the protein HA segments of all available DNA sequences saved in NCBI's Influenza Virus Resource [2] . We select 184 distinct Influenza A (H1N1) protein sequences and 196 distinct H5N1 sequences. All 380 sequences are full lengths. The minimum and maximum lengths of the H1N1 sequences are 554 and 567 amino acids respectively. The minimum and maximum lengths of the H5N1 sequences are 552 and 575 amino acids. The host of all the selected isolates is human. All the selected Influenza A (H1N1) sequences are recorded during the year 2009. The H5N1 sequences are recorded during the 13 years from 1997 to 2009. We align all H1N1 sequences together. Multiple-sequence alignments were performed using the Clustal W2 program [15] . We compute the Hamming distances between every two sequences. Figure 7 shows the 2D manifold for the P = 184 H1N1 sequences. Each aligned sequence consists of 567 amino acids. There are 184×183÷2 = 16,836 Hamming distances among the 184 sequences. The neighborhood region, r , is reduced fromr = 20 toř = 1 in the algorithm. The algorithm is operated with 500,000 epoches. There exists only one cluster. The average center of the cluster is marked with a black square. The two sequences close to the center, y mean = 1 184 184 p=1 y p , are marked with two black circles. They are the isolate ACQ99610 and the isolate ACR81633. There is a overlap between these two black circles. We may expect that certain cells near the center may be the grandmother of the H1N1 virus. Finding the grandmother cell is useful for many medicine goals, such as the design of vaccine and the trace of the virus source. The evolution trend of the DNA mutations is displayed by colors. One can monitor newly sampled isolates in the trend. The three gray ellipses are the one, two and three times of the covariance. Both the covariance and the center are calculated in the 2D manifold space, Y . Figure 8 shows the distance invariant manifold for the 196 H5N1 gene sequences. In this case P = 196. Each aligned sequence consists of 583 amino acids. There are (196 × 195)/2 = 19,110 Hamming distances among these 196 sequences. The neighborhood region, r , is reduced from r = 74 toř = 1 in the algorithm. The gray ellipses show the covariance information in the manifold space. The sequence that is closest to the center is marked with a black circle. It is the isolate ACA64009. The manifold space M may have other shapes. We show an application for the manifold space that has a tree like structure. The phylogenetic tree is useful to display the interrelations among species [9, 28, 29] . Usually, the tree is constructed based on the minimization of the overall difference between all path lengths of the tree and their corresponding distances among species that stored in a distance matrix [8] . The path distance between two leaf nodes is the sum of the lengths of the branches along the path connecting these two nodes. According to the construction, the sum of all distances of all node pairs is close to that obtained from the distance matrix. Any path length in the tree should be fitted to its corresponding species relation in the matrix. Based on the LDI manifold, we rewrite the LDI energy for the tree path estimation to fine-tune its branch lengths for the H5N1 tree and SARS tree. Given a set of distance relations among species, we plan to construct the branch lengths of a given tree that meet the distance relations with its path lengths. The tree is an undirected binary tree. Its branches have no direction. Suppose there are total P species, {x p , p = 1, . . . , P}, where x p is a column vector that saves the amino acid sequence of the pth species. The tree has P leaf nodes and 2P − 2 branches. The tree structure can be saved in a matrix, A P(P−1) 2 ×(2P−2) . In A, the row indices denote the tree paths corresponding to every species pairs, x i , x j , i = 1, . . . , P; j = i + 1, . . . , P , and the column indices denote the 2P − 2 branches of the tree. The element, a i j , of the matrix A is set as a i j = 0, when the path i doesn't contain the branch j. 1, when the path i includes the branch j. . -by-1 column vector that consists of all distances between species pairs. Assume that the lengths of the 2P − 2 branches are variables, z = z 1 , . . . , z (2P−2) T . Then seek a solution for z, The method in [29] suggested using the least square method with non-negative constraint to modify the branch lengths of the tree. We rewrite the LDI energy to solve the variables z, E (r ), where t ( p, q) denotes the path length from the leaf node p to the node q. U ( p, r ) is the set that contains all neighbors x q of x p that are within the distance range r ; U = {x q ; x p − x q ≤ r }. The variables z can be solved by minimizing this energy. We adjust the branch lengths z by applying the gradient descent method to the energyÊ (r ) and restrict the value of z to be larger than or equal to zero. The value of r is reduced during the relaxation. Three datasets are used in the experiments. One is Case's data [3] that contains the immunological distances among nine frog (Rana) species. The second dataset is the influenza Fig. 11 The initial tree is obtained by UPGMA and the branch lengths are estimated by LDI manifold Fig. 12 Performance comparison, MDI (r), on the estimated branch lengths of the H5N1 tree, in Fig. 11 , obtained by the LDI algorithm and the non-negative least square algorithm by [29] A virus, H5N1 subtype, from NCBI (National Center for Biotechnology Information). The third dataset is the SARS-CoV genome sequences. We will employ the UPGMA method (Unweighted Pair Group Method with Arithmetic mean) by [30] or the neighbor-joining method by [28] to build the initial tree, A, and then use the LDI energy (11) to fine tune the branch lengths. Figure 9 shows the estimated branch lengths by the LDI algorithm. The lengths by Sattath [29] are also plotted in this figure for comparison. The LDI algorithm obtains very different branches for the subtree that contains the five species, R. aurora, R. boylii, R. cascadae, R. muscosa and R. pretiosa. After convergence, we calculate the performance using the formula, M DI (r ), for the LDI algorithm and Sattath's method. The measurement of distance invariance, MDI (r ), is The performance is plotted in Fig. 10 . In this figure, the x-axis denotes r and y-axis denotes the calculated value of M DI (r ) in (12) . The performance shows that the LDI algorithm obtains very precise length information for those small distance species. The LDI algorithm is used to construct the phylogenetic tree for the amino acid sequences in the segment one region (PB2) of bird flu, H5N1. The sequences in Influenza Virus Resource [2] are used in the construction. All redundant sequences are removed and will not be used. The tree is constructed for the P = 97 protein sequences of H5N1 recorded from 1997 to 2007 that hosted only on human. There are 4656 = (97 × 96)/2 distances for all sequence pairs. The lengths of the sequences after performing the multiple-alignment, [6] , are all 770. UPGMA uses the Hamming distances among the aligned sequences to build the tree, see Fig. 11 . The initial branch lengths of the UPGMA tree are obtained by Sattath's method. These lengths are used in the LDI algorithm as the initial setting. The neighborhood region, r , is reduced fromr = 51 toř = 1 in the algorithm. In Fig. 12 , the performance (12) shows that the LDI algorithm obtains very precise lengths for close species. Fig. 13 Branch lengths by LDI algorithm for the [24] 's SARS dataset. The four initial trees are obtained by the neighbor-joining method by [28] We now construct the phylogenetic tree for SARS. Rota et al. [26] analyzed the sample, SARS-CoV, under the accession number, AY278741, in Genbank. They applied phylogenetic analysis for the proteins from known coronaviruses and the predicted proteins produced from SARS-CoV. Marra et al. [24] studied the SARS genome, named Tor2 (AY274119.3), that consists of 29751 base pairs in Genbank. They showed that the SARS virus does not closely resemble any of the three previously known groups of coronaviruses. Figure 13 shows the results of the LDI algorithm. The four initial trees are constructed by the neighbor-join method. The results clearly show that SAR-CoV is not in the group of coronavirus. Finally, we briefly summarize several features of the LDI manifold. It use the relative distance between patterns to construct the low dimensional manifold. This manifold is invariant under the translation, rotation and scale of pattern coordinates. The constellations of cells are fixed reliably by their neighborhood patterns collectively. The constellation displays both the global and local details of the pattern structure. This manifold is useful in many applications, such as pattern recognitions [22] ; time series; chain and tree branch lengths. It is relatively difficult to display chains or tree branches in SOM. The LDI can be applied to many other invariance preservation problems, such as angular invariance and conformal invariance. The memetic tree-based genetic algorithm and its application to portfolio optimization The influenza virus resource at the national center for biotechnology information Biochemical systematics of members of the genus Rana native to western north America Global versus local methods in nonlinear dimensionality reduction Visual explorations in finance: with self-organizing maps MUSCLE: multiple sequence alignment with high accuracy and high throughput Self-organizing maps: ordering, convergence properties and energy functions Estimating phylogenetic trees from distance matrices Hierarchical clustering schemes Self-organized formation of topologically correct feature maps Self-organization and associative memory Comparison of SOM point densities based on different criteria Self-organizing maps Self-organizing maps of symbol strings Clustal W and Clustal X version 2.0 Separation of internal representations of the hidden layer Manifold construction by local neighborhood preservation Economic state indicator on neuronic map Separable cross-entropy approach to power spectrum estimation Conformal self-organization for continuity on a feature map Conformality in the self-organization network Handprinted character recognition based on spatial topology distance measurement Code vector density in topographic mappings: scalar case The genome sequence of the SARS-associated coronavirus A proposition on memes and meta-memes in computing for higher-order learning Characterization of a novel coronavirus associated with severe acute respiratory syndrome Nonlinear dimensionality reduction by locally linear embedding The neighbor-joining method: a new method for reconstructing phylogenetic trees Additive similarity trees Principles of numerical taxonomy A global geometric framework for nonlinear dimensionality reduction Multidimensional scaling I: theory and method Epidemic and pandemic alert and response Function approximation using generalized adalines Independent component analysis of correlated neuronal responses in area MT Acknowledgments This work was supported by National Science Council NSC 97-2221-E-002-207-MY3.