Abstract
Unsupervised and semi-supervised machine learning is very advantageous in data-intensive applications. Density-based hierarchical clustering obtains a detailed description of the structures of clusters and outliers in a dataset through density functions. The resulting hierarchy of these algorithms can be derived from a minimal spanning tree whose edges quantify the maximum density required for the connected data to characterize clusters, given a minimum number of objects, MinPts, in a given neighborhood. CORE-SG is a powerful spanning graph capable of deriving multiple hierarchical solutions with different densities with computational performance far superior to its predecessors. However, density-based algorithms use pairwise similarity calculations, which leads such algorithms to an asymptotic complexity of \(O(n^2)\) for n objects in the dataset, impractical in scenarios with large amounts of data. This article enables hierarchical machine learning models based on density by reducing the computational cost with the help of Data Bubbles, focusing on clustering and outlier detection. It presents a study of the impact of data summarization on the quality of unsupervised models with multiple densities and the gain in computational performance. We provide scalability for several machine learning methods based on these models to handle large volumes of data without a significant loss in the resulting quality, enabling potential new applications like density-based data stream clustering.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Massive amounts of data are generated daily [3, 23]. Unsupervised machine learning is advantageous in such scenarios as it finds significant relationships between the data and does not require labels, as the last can be rare or expensive in big data scenarios [2]. Among the various unsupervised techniques, data clustering algorithms are worth mentioning, especially those based on density, as they can find clusters and outliers without any information external to the data. Furthermore, these algorithms are not bound by format constraints for clusters and do not require a prior definition of several clusters like parametric clustering algorithms. Its applications are present in high-impact journals across various research areas, including human behavior [9], chemistry [14], genetics [19], robotics [21], and others.
Density-based algorithms estimate that objects of a dataset are chainlike connected if there is a minimum number of objects in their neighborhood (known as the smoothing parameter MinPts) to characterize a region dense enough to form a cluster. Otherwise, objects are considered outliers. Given a value of MinPts, it is possible to establish the minimum density and similarity necessary to connect any pair of objects in the data set, forming a complete graph G in which each node represents an object and each edge a measure of the density of mutual reach that connects them. For example, the HDBSCAN [6] algorithm, state-of-the-art in clustering by density, derives a hierarchy between objects and their clusters from the Minimum Spanning Tree (MST) of the complete graph G. This enables transmission of clusters by density, detection, and evaluation of outliers, and other helpful information in various areas such as passing bridge approval [8], cancer research [15], shipping systems [16], among others. Later, Gertrudes et al. [10] demonstrated that several other unsupervised and semi-supervised algorithms could also be explained from a perspective based on instances or variations of utilizing this MST over G, considering the resulting minimum density of MinPts. However, you may need to find the most appropriate value for MinPts, which can be done by analyzing multiple MSTs over G, with each MST requiring an asymptotic computational complexity of \(O(n^2.\ log\ n)\) for a dataset containing n objects.
In some cases, it is necessary to analyze different minimum densities, which can be computationally prohibitive given the re-execution of the MST extraction algorithm of the complete graph G. Recently, Neto et al. [18] proposed the replacement of the graph G by a much smaller graph called CORE-SG which has as its main characteristic to contain all the MSTs of G for any value of MinPts less than a given maximum value \(MinPts_{max}\). An extraction of MST from CORE-SG has an asymptotic complexity of \(O(n. log\ n)\), with \(n >> MinPts_{max}\), which is a very significant reduction compared to the extraction over G. CORE-SG is very advantageous, as it allows the analysis of results for multiple values of MinPts at the computational cost of obtaining a single result since the construction of the CORE-SG requires the MST over G with density estimated as \(MinPts_{max}\).
In scenarios where the amount of data generated is massive, using CORE-SG to extract results with different densities is even more computationally advantageous. However, in these scenarios, its construction is prohibitive due to the intrinsic complexity of density estimation methods using pairwise calculations, which remains \(O(n^2)\). Because of their complexity, density-based methods might be limited to smaller datasets. Breunig et al. [4] proposed the use of Data Bubbles, a data summarization method data for hierarchical clustering by density, which makes it possible to preserve the quality of the result obtained and, at the same time, considerably increase the computational performance of the algorithm. Furthermore, Data Bubbles are advantageous when compared to other feature vectors (Clustering Feature - CF) as they were explicitly designed by some of the authors of HDBSCAN* [6] to be used accurately in the calculation of reachability distances based on density.
Contributions. This work enables the construction of the CORE-SG on large datasets, which allows for the direct application of a variety of unsupervised and semi-supervised algorithms in tasks like clustering, outlier detection, classification, and others [10]. Density-based algorithms are known for the quality of their results, and their computational cost is traditionally limited to \(O(n^2)\). This work proposes summarization of the data in a significantly smaller amount of Data Bubbles and, subsequently, the construction of the CORE-SG on the Data Bubbles. Once built, CORE-SG can be used to obtain hierarchical results extracted with different values for its density parameter with computational performance O(n), which enables its application in large volumes of data. Far as the authors are aware, this is the only work in the literature that allows the construction of multiple hierarchical models with different density estimates from a single model of summarized data.
The present work analyzes the impact of summarization on the quality and performance of hierarchical models based on densities to measure to what extent it can obtain significant results without degenerating these models. The presented results, in addition to benefiting a variety of existing algorithms in the literature and organized in [10], also enables the potential for new density-based applications that include scenarios with continuous data streams.
2 Related Works
Data summarization has been successfully used to enable machine learning tasks on large data sets or data streams, which are potentially infinite. For hierarchical clustering, one of the best-known algorithms is BIRCH [24], which builds and maintains a hierarchy of structures called Clustering Features (CF). Each CF stores the number of summarized data, the linear and square sum of the data. The BIRCH algorithm structure of CFs forms a tree of subclusters and is built incrementally, as objects are inserted sequentially in the CF that best represents it. Parameters define when new CFs are added to the tree to increase the hierarchy. Using CFs statistics improves the algorithm’s performance without significantly impacting the quality of the result.
Traditional CF summarize data in such a way as to lose information about its original placement and its nearest neighbors, creating a distortion when represented only by its central objects. This loss of information impacts the quality of density-based algorithms. To solve this problem, Data Bubbles (DBs) [4] were proposed to add information to CF necessary for the calculation of density estimates in the region covered by DB. This information is a “representative” vector to indicate the placement of DB; its radius and estimation of average distances between objects within the DB, essential for density calculations. DBs have been successfully used in a variety of jobs [13, 20, 26] related to new summarization approaches and density calculations. Additionally, DBs were proposed for application in hierarchical algorithms, as was done with the OPTICS [1] in [5] where the use of DBs obtained a great increase in computational performance, maintaining the quality of the final result. The potential of applying DBs for analysis and visualization of large databases is also discussed in [5].
A characteristic of DBs is that they are incremental, which means they can be updated with new data. This characteristic is explored in [17], where DBs are used to provide incrementality for the method that the authors call Incremental and Effective Data Summarization (IEDS), designed for large volumes of data dynamic, where new data is continually add in a hierarchical clustering tree. The IEDS method speeds up the incremental construction of summarized data using triangular inequalities for inserting and deleting DBs. Furthermore, the algorithm analyzes the DBs to keep those with high-quality compressed data.
However, when initializing the summarization, defining which data will be summarized in each DB is necessary. This choice can generate different and unsatisfactory results, mainly if data separated by low-density regions are summarized in the same DB. To avoid the performance of the algorithm not so sensitive to the compression rate and the location of the seeds that initialize the DBs, a method of data summarization and initialization of DBs is proposed in [25]. In it, a sampling based on density for selecting a subset of data to be used in the construction of DBs is used, thus avoiding the choice of data separated by regions of low density for the same DB.
Recently, DBs have been used as summary frameworks along with framework MapReduce for data summarization and [20] scalability. They were applied to enable the scalable version of HDBSCAN* [6] to summarize data (and even other DBs) into a hierarchy of clusters by density. This approach is similar to the one proposed in this article, as it enables the application of a hierarchical algorithm based on density over large amounts of data. However, unlike the approach used in [6], our work considers the summarization of a structure that results in multiple hierarchies with different densities and not just a single one, which is quite advantageous when one does not have prior knowledge of the fittest values of the density parameters.
3 Proposed Method
Our proposal is divided into two steps: summarizing the data and obtaining hierarchies with multiple densities. The original DB [4] uses the parameter \(\epsilon \), which is incompatible with HDBSCAN* and CORE-SG. Therefore, we propose an adaptation and new definitions to enable the joint application of DB and CORE-SG.
3.1 Density-Based Summarization
Given a dataset X with n objects of d dimensions, an integer value for the density parameter MinPts that defines the number of neighbors needed for an object to be considered core, our method starts by applying a O(n) computational complexity method to divide X into m subsets, where \(m << n\). Clustering into m in subsets can be adopted in this step or a guided initialization [25].
Definition 1
- Data Bubbles: The data subset \(X_i \subseteq X\) summarized in a DB which is defined as a tuple \(B_i = (rep_i, n_i, extent_i, nnDist_i)\), where \(rep_i\) is the position of the representative of \(X_i \), \(n_i\) is the \(|X_i|\) cardinality, \(extent_i\) is a natural number that defines an extended radius around the \(rep_i\) that contains most objects of \(X_i\), and \(nnDist_i\) is a function that given an integer value \(k \le MinPts < n_i\) that denotes the estimated average distance of the k-nearest neighbor within the set of objects \(X_i\). In a Euclidean space, the statistics linear and quadratic sums of the objects in \(X_i\), \(LS_i\), and \(SS_i\), respectively, can be used to calculate the characteristics of \(B_i\). With these statistics, \(rep_i=\frac{LS_i}{n_i}\), \(extent_i=\sqrt{\frac{2 \cdot n_i \cdot SS_i - 2 \cdot LS^2_i}{n_i \cdot (n_i - 1)}}\) is calculated and the expected distance from the k-nearest neighbor is \(nnDist_i(k) = (\frac{k}{n_i})^{\frac{1}{d}} \cdot extension_i\).
Definition 2
- Distance between DBs: If \(B_i\) and \(B_j\) are two DBs, the direct distance between \(B_i\) and \(B_j\) is defined as:
In other words, if \(B_i = B_j\), their distance will be zero. If they do not overlap, their distance is given by the distance between their representatives minus their radii plus their expected distances to the nearest neighbor. Finally, if the DBs overlap, the distance is the maximum between their expected nearest neighbor distances. The described cases are illustrated in Fig. 1.
Distance between Data Bubbles [4].
Definition 3
- Core distance from a DB: Given a value of \(MinPts < n\), the function \(NN(B_i,k)\) that returns the k-th closest DB to \(B_i\) or \(B_i\) if \(k=0\), a function that returns the minimum number of neighbors of \(B_i\) needed to summarize MinPts objects, i.e., \(SNN(B_i,MinPts) = \arg \min _k \sum _{j=0}^k n_j \ge MinPts \,|\) \( B_j = NN(B_i,j)\), the core distance of \(B_i\) is:
where \(B_k = NN(B_i,SNN(B_i,MinPts))\) is the kth nearest neighbor of \(B_i\). The core distance is the minimum distance from a DB that summarizes at least MinPts objects. It is important to point out that \(core_{a}(B_{i}) \ge core_{b}(B_{i}) \; \forall \; a \ge b\), i.e., increasing the value of MinPts cannot reduce the core distance of a DB. We can calculate the reachability distances used in density estimates using the distance between two DBs and their core distances.
Definition 4
- Mutual Reachability Distance (MRD) between DBs: Given the DBs \(B_i\) and \(B_j\), the minimum number of objects MinPts, the mutual reachability distance between the bubbles \(B_i\) and \(B_j\), i.e., the distance that make them density reachable within each other is:
3.2 CORE-SSG: CORE Summarized Spanning Graph
The MRD reflects the minimum density needed to connect two DBs. With it, we can estimate the distance that connects all pairs of DBs in the data set by density, forming a virtual graph G that represents all the density relationships between the data, given a minimum value MinPts of points needed to obtain the desired density.
Definition 5
- Mutual Reachability Graph G: Given a minimum density value MinPts, the \(G_{MinPts} = (V, E)\) is a complete graph in which the set of vertices V represent the DBs and the set of edges is defined as \(E = \{e(B_i, B_j)\; | \; B_i, B_j \in V \text { with weights } w(e) = mrd_{MinPts}(B_i, B_j)\}\).
The main idea behind density-based hierarchical machine learning algorithms is to extract a minimal graph that connects all objects in the dataset by density [6, 10]. That is, directly or indirectly, these algorithms obtain the \(MST_{MinPts}\) of the graph \(G_{MinPts}\) and, with it, build a hierarchy that allows obtaining clusters, identification of outliers and applications. However, different MinPts values tend to derive different results, which would require multiple extractions of MSTs at a high computational cost. To avoid this, given an upper bound for MinPts defined here as \(MinPts_{Max}\), we can replace the complete graph G with \(O(n^2)\) edges by the graph CORE-SG\( _{MinPts_{Max}}\) with O(n) edges when applied to a dataset with n objects, if \(MinPts \le MinPts_{Max} << n\) [18], which is a natural assumption.
However, obtaining the CORE-SG of G over n objects using a traditional MST extraction algorithm is \(O(n^2 \cdot log \ n)\), prohibitive for large values of n. In the present work, we summarize the data into DBs and consider the complete graph G over m DBs, which makes the cost to build the CORE-SG\(_{MinPts_{Max}}\) \(O(m^2 \cdot log \ m)\) and the cost of each extraction of \(MST_{MinPts}\) \(O(m\cdot log \ m)\). In a scenario with a large volume of data, it is natural to consider that \(MinPts_{Max} << m << n\) and, therefore, to limit the necessary computational cost for our proposal to O(n). In scenarios where this assumption is invalid, the data can be used normally, and summarization is unnecessary.
As it is applied over DBs, the summarized CORE-SG will be referred to here as CORE Summarized Spanning Graph (CORE-SSG) and has definitions and properties different from its original version [18]. Next, we will present its new concepts and proprieties.
Definition 6
- Minimum Nearest Neighbor Graph \(MNNG_{MinPts}\): Given a minimum density value MinPts, the graph \(MNNG_{MinPts} = (V, E)\) has the generated DBs as a set of vertices V over the data and the set of edges \(E = \{e(B_i, B_j)\; | \; B_i, B_j \in V \wedge B_j = NN(B_i,k) \ \forall k = [0,SNN(B_i,MinPts)] \}\). This graph connects the DBs to guarantee a minimum number of MinPts connected summarized objects. We see that \(MNNG_{MinPts}\) has two important properties, as shown in Lemmas 1 and 2.
Lemma 1:
\(\forall \ a \le b\), \(MNNG_{a} \subseteq MNNG_{b}\).
Proof:
Since \(MNNG_{a} = (V_a, E_a)\), \(MNNG_{b} = (V_b, E_b)\) and \(V_a = V_b = V\), it is necessary to prove that \(E_a \subseteq E_b\) so that Lemma 1 be true. Assuming that \(E_a \not \subseteq E_b\), there must be an edge in \(E_a\) that is not in \(E_b\). By definition, \(SNN(B_i,a) \le SNN(B_i,b) \ \forall \ a \le b, B_i \in V\). Therefore, the set of integer values of the interval \([0,SNN(B_i,a)]\) is contained in \([0,SNN(B_i,b)]\), which makes false the statement that there is an edge \(E_a\) that is not in \(E_b\).
Lemma 2:
\(\forall \ a \le b: mrd_a(B_i,B_j) = max(core_a(B_i), core_a(B_j)) \Rightarrow e(B_i, B_j) \in MNNG_{b}\).
Proof:
Assuming that \(mrd_a(B_i,B_j) = \max (core_a(B_i), core_a(B_j))\), given the definition of the MRD in Definition 4, it follows that \(core_a(B_i) \ge dist(B_i,B_j)\) or \(core_a(B_j) \ge dist(B_i,B_j)\), or both. Hence, at least one of these DBs must be in the neighborhood of the other, i.e., \(B_j = NN(B_i,k), k = [0,SNN(B_i,a)]\) or/and \(B_i = NN(B_j,k), k = [0,SNN(B_j,a)]\) must hold and implies that \(e(B_i,B_j) \in MNNG_a\). Since \(a \le b\), it follows by Lemma 1 that \(MNNG_{a} \subseteq MNNG_{b}\) and, accordingly, \(e(B_i,B_j) \in MNNG_{b}\).
Definition 7
- Summarized CORE-SG (CORE-SSG): Given a maximum value for the minimum density parameter \(MinPts_{Max}\), the CORE-SSG\(_{MinPts_{Max}} \) = \( MNNG_{MinPts_{Max}} \cup MST_{MinPts_{Max }} \) is a non-directional graph that contains the edges of all \(MST_{MinPts}\) of \(G_{MinPts}\) for each value of \(MinPts < MinPts_{Max}\). This feature allows replacing G by CORE-SSG\(_{MinPts_{Max}}\) in the process of extracting a \(MST_{MinPts}\), simply replacing the weight of the edges \(e(B_i, B_j)\) in CORE-SSG\(_{MinPts_{Max}}\) by \(w(e) = mrd_{MinPts}(B_i, B_j)\) to \(MinPts < MinPts_{Max}\).
Theorem 1:
For two values p and q for the density parameter MinPts, let \(M_p\) be the set of all possible MSTs of \(G_p\), and let \(MST_q\) be any MST of \(G_q \), so \(\forall \ p < q\): \(\exists MST_p \in M_p\): \(MST_p \subseteq MST_q \cup MNNG_q\).
Proof:
Let’s consider an edge \(e(B_i,B_j)\) in a \(MST_{p}^{*} \in M_p\) that connects two subgraphs I and J. The edge weight \(e(B_i,B_j)\) is defined by \(mrd_{p}(B_i,B_j)\), which is the maximum between \(dist(B_i, B_j)\) and their core distances \(core_p(B_i) \) and \(core_p(B_j)\), according to Definition 4. Due to the cutting property of MSTs, \(e(B_i,B_j)\) must have the lowest weight among all the edges that connect the sets I and J in \(G_p\). There are two scenarios:
-
(1)
\(e(B_i,B_j) \in MST_q\)
-
(2)
\(e(B_i,B_j) \notin MST_q\)
In scenario (1), since \(e(B_i,B_j) \in MST_q\), it trivially follows that \(e(B_i,B_j) \in MST_q \cup MNNG_q\). In scenario (2), \(MST_q\) must have a different edge \(e(C_i,C_j)\) that connects the DBs \(C_i\) and \(C_j\) that are contained in the graphs I and J, respectively, having the lowest weight among the edges that connect these graphs (MSTs are connected graphs). Thus, the following two statements must be true (i) \(mrd_p(B_i, B_j) \le \) \(mrd_p(C_i, C_j)\) (otherwise \(e(B_i, B_j)\) could not be the edge with the smallest weight connecting I and J in \(MST_p^{*}\)) and (ii) \(mrd_q(B_i, B_j) \ge \) \(mrd_q (C_i, C_j)\) (otherwise \(e(C_i, C_j)\) could not be the least weighted edge connecting I and J in \(MST_q\)). We saw in Definition 3 that by increasing the value of MinPts, the core distances \(core_{MinPts}(\cdot )\) can only grow and the distance \(dist(\cdot ,\cdot )\) between two DBs is constant. Therefore, the weight of an edge, given by \(mrd_{MinPts}(\cdot ,\cdot )\), cannot decrease when the value of MinPts increases. Consequently, the mutual reachability distances (edge weights), defined in Eq. 3, can only remain the same or increase. Therefore, when MinPts increases from p to q (\(p < q\) by assumption in the theorem), one of the following must be true:
-
(a)
\(mrd_p(B_i, B_j) < mrd_q(B_i, B_j)\)
-
(b)
\(mrd_p(B_i, B_j) = mrd_q(B_i, B_j)\)
If (a) is true, as the distance \(dist(\cdot ,\cdot )\) between two DBs does not depend on MinPts, the mutual reachability distance \(mrd_q(B_i,B_j)\) must be determined by the core distance of \(B_i\) or \(B_j\), i.e., \(mrd_q(B_i, B_j) = max(core_q(B_i), core_q(B_j))\). From Lemma 2 with \(MinPts = q\) it then follows that \(e(B_i, B_j) \in MNNG_q\) and therefore \(e(B_i, B_j) \in MST_q \cup MNNG_q\). If (b) is true, then from statements (i) and (ii) above, we have \(mrd_q (C_i, C_j) \le mrd_q(B_i, B_j)\) = \(mrd_p(B_i, B_j) \le mrd_p(C_i , C_j)\). However, since \(p < q\) and \(mrd_{MinPts}\) cannot decrease when MinPts increases, it follows that \(mrd_q(C_i, C_j) = mrd_p(B_i, B_j) = mrd_p(C_i, C_j)\). In this case, replacing \(e(B_i, B_j)\) in \(MST_p^{*}\) with \(e(C_i, C_j)\) will result in another \(MST_p{'} \in M_p\), as the total weight of \(MST_p ^{*}\) and \(MST_p{'}\) are the same. \(MST_p^{*}\) includes the edge \(e(C_i, C_j)\) and by assumption \(e(C_i, C_j) \in MST_q\), and therefore \(e(C_i, C_j) \in MST_q \cup MNNG_q\).
It follows that, given a MST of \(G_q\), \(MST_q\), and a MST of \(G_p\), \(MST_p^{*} \in M_p\), with \(q > p\), we can construct a weighted graph \(MST_p{'}\) as follows: (1) include all edges and edge weights \(e(B_i, B_j) \in MST_p\) that belong to \(MST_q \cup MNNG_q\) (according to scenarios (1) or (2)-a); (2) replace all edges \(e(B_i, B_j) \in MST_i\) that do not belong in \(MST_q \cup MNNG_q\) with edges \(e(C_i, C_j)\) that must exist in \(MST_q \cup MNNG_q\)(as scenario (2)-b), connecting the same two subsets of points as \(e(B_i, B_j)\) and having the same edge weight as \(e(B_i, B_j)\). This graph \(MST_p{'}\) is a MST of \(G_p\), that is, \(MST_p{'} \in M_p\) and \(MST_p{'} \subseteq MST_q \cup MNNG_q\). Thus there exists a \(MST_p \in M_p\) such that \(MST_p \subseteq MST_q \cup MNNG_q\).
Theorem 1 supports that the complete graph \(G_{MinPts_{max}}\) can be replaced by \(MST_{MinPts_{max}} \cup MNNG_{MinPts}\) when calculating additional results of MSTs with \(MinPts < MinPts_{max}\), with potential for greater execution time and memory savings.
4 Experiments
In this section, we evaluate the use of CORE-SSG for the clustering task to illustrate its utility, although it can be used in various applications [10]. The CORE-SSGFootnote 1 was implemented in Python/Cython, and the experiments were performed on computer with 16GB of RAM and 8 processing cores. The experimental evaluations are divided into three parts: assessment of the quality of the hierarchies and extracted partitions from CORE-SSG; the clustering solutions from CORE-SG and CORE-SSG are compared; an analysis of the runtime and speed-up of the compared algorithms is made.
4.1 Evaluation of the Quality Loss Resulted from CORE-SSG Summarization
Three bi-dimensional datasets with 38k, 43k, and 49k objects were synthesized for the quality evaluation experiments and divided into 13, 10, and 12 clusters with noise. After that, each dataset was divided into subsets of the same volume so that the number of subsets totalized n/10, i.e., 10% of the datasets’ sizes. Then, the DBs were created over these subsets, and a CORE-SSG with \(MinPts = 200\) was built. From the CORE-SSG\(_{200}\), we extracted density-based MSTs for MinPts ranging from 200 to 2 in decreasing steps of 2. Using the extracted MSTs, we built one hundred HDBSCAN* clustering solutions and calculated their pairwise similarity using the HAI index [12]. Based on their similarities, we have chosen four of the most representative clustering solutions, i.e., we selected four solutions that are most similar to the others and different within themselves, according to the HAI index. These four hierarchical clustering results are presented in the clustering trees of Fig. 2, where the most prominent clusters, according to the FOSC framework [7], are circled in red. The FOSC framework uses the concept of excess of mass to extract the most prominent partition from a hierarchical clustering solution. As done in the HDBSCAN* article, we assumed the minimum cluster size value equals MinPts.
The clustering hierarchies presented in Fig. 2 accurately reflect the behavior of HDBSCAN* clustering for a range of MinPts values. For small MinPts values, the hierarchy is full of small, highly detailed clusters that are hard to visualize but can be analyzed using cluster validation indexes [22]. In contrast, as MinPts values increase, the hierarchies comprise a smaller number of more robust clusters that reflect denser structures. The results perfectly mimic what is expected from density-based clustering algorithms. The difference resides in the fact that CORE-SSG does not have the hierarchical structure of the summarized data, i.e., within the DBs. In most cases, the summarized data has a level of detail of minor importance for large volumes of data. However, if the user requires the structure within the DBs, the CORE-SG can be applied over the summarized data of each DBs independently, and the results (which are graphs) combined. If the summarized data is still extensive, the CORE-SSG can be applied recursively.
Although a hierarchy of clusters is a more complete and richer way of representing the data structure, they can be too complex and extensive for very large datasets. In such scenarios, the automatic extraction of a partition of clusters may be preferred. FOSC is an excellent [7] choice for this job. In Fig. 3, partitions selected (circled) by FOSC in Fig. 2 are presented. It is possible to note that FOSC successfully extracted prominent clusters in the datasets and correctly separated noise. Additionally, different MinPts values resulted in different cluster selections, where small values favor the selection of broader clusters that resides in the higher levels of the hierarchy, as they are more present in the hierarchy than their several smaller nested child clusters. In contrast, a higher MinPts value prevents smaller cluster structures at the bottom of the hierarchy, favoring the selection of denser clusters nested into the broader ones. The results explicit the importance of the analysis of multiple results with different MinPts values, as all of them have the potential to be considered “the best” and validate the main objective of the present work. Furthermore, summarizing the data did not impact the choice of clusters made by FOSC, as the framework has detected the most prominent ones seamlessly for different MinPts values.
4.2 Comparison Between CORE-SG and CORE-SSG Partitions
As the hierarchies of CORE-SG and CORE-SSG are not comparable due to summarization, we compared the FOSC extracted partitions between CORE-SG and CORE-SSG to check if the prominent clusters are similar. We used the Adjusted Rand Index (ARI) [11] to obtain the similarities between the extracted partitions where one indicates that the partitions are identical and zero is adjusted to reflect randomness. Figure 4 presents the ARI heatmap resulting from the pairwise comparison among partitions from CORE-SG and CORE-SSG with different MinPts values. CORE-SG partitions are presented in the horizontal axis, and CORE-SSG partitions are in the vertical axis. The heatmaps show that, for small MinPts values, the CORE-SG has a much more detailed partition, and CORE-SSG tends to unify small clusters, which is expected because the information within the DBs is summarized and cannot be split. That is reflected in the darker area over small MinPts values. For large datasets, small values of MinPts tend to over-fragment the clustering structure into a hard-to-interpret solution. According to the ARI, for medium and high values of MinPts, most heatmaps present bright areas that reflect a similarity over 0.85. This result shows that the main cluster structure of the data is present in the results of both graphs, sometimes for distinct values of MinPts, which also support the importance of a multi-density analysis of the dataset, the objective of this paper.
4.3 Runtime Analysis
To assess the computational performance of CORE-SSG in relation to CORE-SG, we created datasets with a variety of sizes, dimensions, and different levels of summarization, as described in Table 1. To evaluate each characteristic separately, default values were fixed and presented in bold in Table 1. Additionally, we also considered a variety of \(MinPts_{max}\) values used to build the graphs. The runtime necessary to obtain the graphs, CORE-SG and CORE-SSG, added to the time needed for the hierarchical density solutions extraction, are presented in Fig. 5.
Figure 5 (a) shows that the parameter MinPts has a linear correlation with the processing time needed to run the experiment for both graphs, but CORE-SSG presented a speedup close to 20 times better than CORE-SG. Considering the dataset size n and \(m << n\), Fig. 5 (b) clearly shows the quadratic behavior of CORE-SG as the dataset size increases up to one million and how the computational cost is amortized when using DBs, i.e., for CORE-SSG. The present results support the main research hypothesis of this work: data summarization enables scalability for density-based machine learning solutions, clustering algorithms in particular. Summarization also had an impact on the algorithms’ total runtime when the data’s dimensionality increases, as shown in Fig. 5 (c), with speedups up to 15 times. Last but not least, the total time needed for the data summarization, presented in Fig. 5 (d), reflects the linear increase of the computational time related to the proportion of DBs used, which reflects the expected behavior for our proposal and shows that summarization requires a small fraction of the total runtime.
5 Conclusion and Future Work
The main goal of the present research, enabling the use of CORE-SG over larger datasets, was achieved as we showed that a single CORE-SSG permits the accurate extraction of multiple density-based hierarchical solutions for different densities parameters at a subquadratic computational cost. That was shown theoretically, with the proof that CORE-SSG holds the main characteristics of CORE-SG, and empirically through experiments. Moreover, the experiments also supported the importance of the analysis of multiple density-based results, as datasets can have nested density structures with different characteristics, each structure with its relative importance, which justifies the application of CORE-SG in a pleura of machine learning tasks [10], including clustering.
The most prominent clusters (according to the FOSC framework) present in the original version of CORE-SG are also present in our CORE-SSG solution, obtaining a significant computational cost reduction. The downside of our approach is the loss of small density-based structures summarized inside the DBs, which were rarely considered prominent by the FOSC framework as they connect a very small fraction of the dataset. Thus, our solution is indicated for large volumes of data. Even so, if these small structures are indeed important for the application, we recommend the use of CORE-SG (or CORE-SSG) over the data within each DBs, followed by the combination of the resulted graphs.
In addition to the direct benefits of our findings for the machine learning applications previously referred in [10], we believe that CORE-SSG has much potential to innovate in scenarios with continuous data streams. That is because parameters of density-based applications for data streams usually must be set (fixed) as an algorithm input, which may be hard to define and require changes during the evolution of the data (concept drift). CORE-SSG enables the extraction of different density-based parameters, even multiple results, from a single induced model, which can be analyzed online to adapt the model for different density levels and concept drift. Indeed, this is a future work worth exploring.
Notes
- 1.
Implementation in: https://github.com/natanaelbatista99/CORE-SSG.
References
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999)
Barlow, H.: Unsupervised Learning. Neural Comput. 1(3), 295–311 (1989)
Blazquez, D., Domenech, J.: Big data sources and methods for social and economic analyses. Technol. Forecast. Soc. Chang. 130, 99–113 (2018)
Breunig, M.M., Kriegel, H.P., Kröger, P., Sander, J.: Data bubbles: quality preserving performance boosting for hierarchical clustering. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 79–90 (2001)
Breunig, M.M., Kriegel, H.-P., Sander, J.: Fast hierarchical clustering based on compressed data and OPTICS. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 232–242. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45372-5_23
Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 10(1), 1–51 (2015)
Campello, R.J., Moulavi, D., Zimek, A., Sander, J.: A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min. Knowl. Disc. 27, 344–371 (2013)
Cheema, P., Alamdari, M.M., Chang, K., Kim, C., Sugiyama, M.: A drive-by bridge inspection framework using non-parametric clusters over projected data manifolds. Mech. Syst. Signal Process. 180, 109401 (2022)
Djonlagic, I., et al.: Macro and micro sleep architecture and cognitive performance in older adults. Nat. Hum. Behav. 5(1), 123–145 (2021)
Gertrudes, J.C., Zimek, A., Sander, J., Campello, R.J.G.B.: A unified view of density-based methods for semi-supervised clustering and classification. Data Min. Knowl. Discov. 33(6), 1894–1952 (2019)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Johnson, D., Xiong, C., Gao, J., Corso, J.: Comprehensive cross-hierarchy cluster agreement evaluation. ACM TKDD. 10, 1–51 (2013)
Liu, B., Shi, Y., Wang, Z., Wang, W., Shi, B.: Dynamic incremental data summarization for hierarchical clustering. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 410–421. Springer, Heidelberg (2006). https://doi.org/10.1007/11775300_35
Miccio, L.A., Schwartz, G.A.: Mapping chemical structure-glass transition temperature relationship through artificial intelligence. Macromolecules 54(4), 1811–1817 (2021)
Minussi, D.C., et al.: Breast tumours maintain a reservoir of subclonal diversity during expansion. Nature 592(7853), 302–308 (2021)
Murray, B., Perera, L.P.: An AIS-based deep learning framework for regional ship behavior prediction. Reliab. Eng. Syst. Saf. 215, 107819 (2021)
Nassar, S., Sander, J., Cheng, C.: Incremental and effective data summarization for dynamic hierarchical clustering. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 467–478. SIGMOD 2004, Association for Computing Machinery, New York, NY, USA (2004)
Neto, A.C.A., Naldi, M.C., Campello, R.J.G.B., Sander, J.: Core-SG: efficient computation of multiple MSTS for density-based methods. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE), pp. 951–964 (2022)
Norman, T.M., et al.: Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science 365(6455), 786–793 (2019)
dos Santos, J.A., Syed, T.I., Naldi, M.C., Campello, R.J., Sander, J.: Hierarchical density-based clustering using MapReduce. IEEE Trans. Big Data 7(1), 102–114 (2019)
Savoie, W., et al.: A robot made of robots: emergent transport and control of a smarticle ensemble. Sci. Robot. 4(34), eaax4316 (2019)
Vendramin, L., Campello, R.J., Hruschka, E.R.: Relative clustering validity criteria: a comparative overview. Statist. Anal. Data Mining ASA Data Sci. J. 3(4), 209–235 (2010)
Zerhari, B., Lahcen, A.A., Mouline, S.: Big data clustering: Algorithms and challenges. In: Proceedings of International Conference on Big Data, Cloud and Applications (BDCA-5) (2015)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. SIGMOD Rec. 25(2), 103–114 (1996)
Zhang, Y., Cheung, Y., Liu, Y.: Quality preserved data summarization for fast hierarchical clustering. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 4139–4146 (2016)
Zhou, J., Sander, J.: Data bubbles for non-vector data: Speeding-up hierarchical clustering in arbitrary metric spaces. In: Freytag, J.C., Lockemann, P., Abiteboul, S., Carey, M., Selinger, P., Heuer, A. (eds.) Proc. 2003 VLDB Conf., pp. 452–463. Morgan Kaufmann, San Francisco (2003)
Acknowledgement
The authors would like to thank CNPq MAI/DAI program, CAPES, and FAPESP for their financial support. We also thank Prof. Jörg Sander for his insight and support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Batista, N.F.D., Nunes, B.L., Naldi, M.C. (2023). Efficient Density-Based Models for Multiple Machine Learning Solutions over Large Datasets. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14195. Springer, Cham. https://doi.org/10.1007/978-3-031-45368-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-45368-7_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45367-0
Online ISBN: 978-3-031-45368-7
eBook Packages: Computer ScienceComputer Science (R0)





