1 Introduction

The world is increasingly filled with fruitful data, most of which daily stored in electronic media. As such, there is a high potential of technique research and development for automated data retrieval, analysis, and classification [15]. Around 90% of the data produced up to 2017 were generated in 2015 and 2016, and the tendency is to biennially double this amount [17]. The exponential increase in size and complexity of Big Data are aspects worthy of attention.

Latent potentialities for decision-making based on insights learned from historical data are only actually exploited if pushed into practice. A comprehensive information extraction procedure is composed of two sub-processes: data management and data analysis [15]. Management supports data acquisition, handling, storage, and retrieval for analysis [18, 29]; in contrast, data analysis refers to the evaluation and acquisition of intelligence from data.

It was pointed out by [17] that researchers of various subjects have been adopting methods from Machine Learning (ML) [5, 10] and Data Mining (DM) [31]. Many of them did so in Big Data contexts such as stock data monitoring, financial analysis, traffic monitoring, and Structural Health Monitoring (SHM) [14]. [6] demonstrated the relationships between the data management and analysis sub-processes in practice, with a specific focus on the application herein highlighted. On one hand, this work claims that Big Data is not restricted to a computerized manipulation of massive data streams; on the other hand, it emphasizes that SHM can learn ipsis litteris with the conscientious use of ML.

The problem of data grouping (i.e., data clustering) is one of the main tasks in ML [2, 5] and DM [31], prevailing in any discipline involving multivariate data analysis [21]. It gained a prominent place in many applications lately, especially in speech recognition [11], web applications [35], image processing [3], outlier detection [23], bioinformatics [1], and SHM [9].

A wide variety of Genetic Algorithm (GA)-based clustering techniques have been proposed in recent times [25, 28, 42]. Their search ability is commonly exploited to find suitable prototypes in the feature space such that a per-chromosome measure of the clustering results is optimized in each generation. In [2], two conflicting functions were proposed and defined based on cluster cohesion and connectivity. The goal was to reach well-separated, connected, and compact clusters by means of two criteria in an efficient, multi-objective particle swarm optimization algorithm. More recently, [20] combined K -means and a GA through a differentiate arrangement of genetic operators to conglomerate different solutions, with the intervention of fast hill-climbing cycles of K -means.

An unsupervised, non-parametric, GA-based approach to support the SHM process in bridges termed Genetic Algorithm for Decision Boundary Analysis (GADBA), which was proposed by [36], aims to group data into natural clusters. The algorithm is also supported by a method based on spatial geometry to eliminate redundant clusters. Upon testing, GADBA was more efficient in the task of fitting the normal condition than its state-of-the-art counterparts in SHM contexts. However, due to the specialization of its objective function to SHM contexts, GADBA is lackluster on more general clustering scenarios.

This work aims to improve the objective function of GADBA to expand its application potential to a wider range real-world problems. In this sense, a version of GADBA based on the Mutual Equidistant-scattering Criterion (MEC) is proposed as a general-purpose clustering approach. Four clustering algorithms are compared against the new proposal: K -means [26], Gaussian Mixture Models (GMM) [30], Linkage [22], and GADBA, of which the first three are well-known and explored in literature.

The remaining sections of this paper are divided as follows. Section 2 and Sect. 3 respectively define GADBA and MEC. Section 4 discusses the performance of the new proposal under some experimental evaluations. Finally, Sect. 5 summarizes and ends the paper.

2 Genetic Algorithm for Decision Boundary Analysis

Given a minimum (\(K_{min}\)) and maximum (\(K_{max}\)) number of clusters, clustering is done by the combination of a GA to dispose their centroids in the M-dimensional feature space, and a method called Concentric Hypersphere (CH) to agglutinate clusters, to choose an appropriate \(K \in [K_{min}, K_{max}]\) [37].

The initial population \(\mathbf {P}(t{=}0)\) is randomly created and each individual represents a set of K centroids, where K is randomly selected. The chromosome is then formed by concatenating K feature vectors (Fig. 1), whose values are initialized by randomly selecting K data points from the training set. A length ratio is defined as \(\gamma _{i} = K_{i} / K_{max}\), where \(K_{i}\) is the number of active centroids in individual i. The role of \(\gamma \) is to define the number of active centroids for a given candidate solution, since a single individual might have enabled/disabled centroids during the recombination process.

Fig. 1.
figure 1

Chromosome organization in GADBA.

The parent selection is based on tournament with reposition, where R individuals are randomly selected and the fittest one is chosen to recombine with other individual chosen in the same way. The recombination is conducted in three steps for each pair of parents \(P_i\) and \(P_j\) to generate a pair of descendants:

  1. 1.

    A random number \(r \in [0,1]\) is compared with \(p_{rec}\) defined a priori. If \(r \le p_{rec}\), then two cut points \(\pi _1\) and \(\pi _2\) are selected such that \(1 \le \pi _1 < \pi _2 \le min(K_i,K_j)\). The centroids in the range are then swapped. If \(r > p_{rec}\), the parents remain untouched.

  2. 2.

    Likewise, two random numbers \(r, T \in [0,1]\) are picked for each centroid position in the parents and, if \(r \le p_{pos}\), defined a priori, an arithmetic recombination is conducted as follows:

    $$\begin{aligned} F_{\mathbf {x},t}^{i} = F_{\mathbf {x},t}^{i} + (F_{\mathbf {y},t}^{j}-F_{\mathbf {x},t}^{i})T, \end{aligned}$$
    (1)
    $$\begin{aligned} F_{\mathbf {x},t}^{j} = F_{\mathbf {x},t}^{j} + (F_{\mathbf {y},t}^{i}-F_{\mathbf {x},t}^{j})T, \end{aligned}$$
    (2)

    where \(F_{\mathbf {x},t}^{i}\) and \(F_{\mathbf {x},t}^{j}\) are the values in the tth position of \(\mathbf {x}\)th centroid from parents \(P_i\) and \(P_j\), respectively. Similarly, \(F_{\mathbf {y},t}^{i}\) and \(F_{\mathbf {y},t}^{j}\) respectively correspond to the yth centroid of the i and jth parents. A pair of parents can recombine even if they have a different number of genes.

  3. 3.

    Finally, the last step consists in arithmetically recombining the parents’ length ratio to define the length ratios of the offspring individuals.

The mutation is the result of a personalized two-step process:

  1. 1.

    Let \(T_{\mathbf {x}} = K_{max}^{-1}\) and \(T_r\) be a random number on interval [0, 1]. The number of centroids to be enabled in an offspring individual is . When \(K < K_{new} \le K_{max}\), the missing positions are filled with the information of \(K_{new}-K\) data points chosen at random.

  2. 2.

    Each centroid position can be mutated with a probability \(p_{mut}\), defined a priori, in which a Gaussian mutation is applied by using

    $$\begin{aligned} F_{\mathbf {x},t}^{j} = F_{\mathbf {x},t}^{j} + \mathcal {N}(0,1), \end{aligned}$$
    (3)

    where \(\mathcal {N}(0,1)\) is a random number from a standard Gaussian distribution and \(F_{\mathbf {x},t}^{j}\) is the value in the tth position of \(\mathbf {x}\)th centroid.

Survivor selection is based on elitism, where the parents \(\mathbf {I}^{(t)}_p\) and offspring \(\mathbf {I}^{(t)}_c\) are concatenated into \(\mathbf {I}^{(t+1)}_p = \mathbf {I}^{(t)}_p \cup \mathbf {I}^{(t)}_c\), which is then sorted according to a fitness measure based on Pareto Front and Crowding Distance. The new population \(\mathbf {P}(t+1)\) is composed by the \(|\mathbf {P}|\) best individuals [8].

Parent selection, recombination, mutation, and survivor selection are repeated until a maximum number of iterations is reached and/or the difference of the current solution against the last one is smaller than a given threshold \(\epsilon \).

As mentioned, the CH algorithm is used to regularize the number of clusters encoded in the individuals. It is executed in each individual prior to their evaluation by determining the regions that limit each cluster in three steps:

  1. 1.

    For each cluster, its centroid is dislocated to the mean of its data points.

  2. 2.

    Each centroid is the center of a hypersphere whose radius will increase while the difference of density between two consecutive inflations is positive.

  3. 3.

    If more than one centroid is found inside a hypersphere, they are agglutinated into a centroid located at their mean point.

3 A New Objective Function

3.1 Basic Notations and Definitions

The objective of clustering is to find out the best way to split a given data set \(\mathcal {X} \in \mathbb {R}^{N \times M}\), with N input vectors in an M-dimensional real-valued feature space , into K mutually disjoint subsets (\(K \le N\)). Assume the vectors in \(\mathcal {X}\) have hard labels marking them as members of one cluster. A set of prototypes \(\varTheta \) is described as a function of \(\mathcal {X}\) and K as

$$\begin{aligned} \varTheta \in \mathbb {R}^{K \times M} = \varTheta (\mathcal {X}), \end{aligned}$$
(4)

whereupon \(\varTheta \) contains K representative vectors .

Let the hard label for cluster \(\kappa \) be

The prototypes are computed whereby N label vectors are organized into a partition over the vectors in \(\mathcal {X}\) such that

$$\begin{aligned} U \in \mathbb {Z}^{N \times M} = U(\mathcal {X}), \end{aligned}$$
(5)

subject to

$$\begin{aligned} U = \left[ \mu _{i \kappa }\right] _{N \times K} \in \{0,1\}, \quad i = 1,\dots ,N, \end{aligned}$$

where \(\mu _i = y_\kappa \Leftrightarrow x_i\) is in cluster \(\kappa \).

Put another way, the membership of \(x_i\) to cluster \(\kappa \), \(\mu _{i \kappa }\), is either 1 if the ith object belongs to the \(\kappa \)th cluster, or 0 otherwise. Accordingly, \(\mu _{i \kappa } = 1\) for one value of \(\kappa \) only, such that [41]

figure a

The resulting grouping in this structure is hard [24], since one object belonging to one cluster cannot simultaneously belong to another.

In general, data clustering involves finding \(\{U,\varTheta \}\) to partition \(\mathcal {X}\) somehow. For a given initial \(\varTheta \), the optimal set of prototypes can be represented by centroids, medians, medoids, and others, in which the optimized partition U is obtained by assigning each input vector to the cluster with the nearest prototype. Both U and \(\varTheta \) comprise a dual structure (if one of them is known, the other one will also be) named clustering solution.

3.2 Cluster Validation

Most researchers have some theoretical difficulty in describing what a cluster is without assuming an induction principle (i.e., a criterion) [21]. A classic definition for them is: “objects are grouped based on the principle of maximizing intra-class similarity and minimizing inter-class similarity”. Another cluster definition involving density defines it as a connected, dense component such that high-density regions are separated by low-density ones [1, 19].

In clustering algorithms, K is usually assumed to be unknown. Since clustering is an unsupervised learning procedure (i.e., there is no prior knowledge on data distribution), the significance of the defined clusters must be validated for the data [33]. Therefore, one of the most challenging aspects of clustering is the quantitative examination of clustering results [31]. This procedure is performed by Cluster Validity Indices (CVIs), sometimes called criteria, which also targets hard problems such as cluster quality assessment and the degree wherewith a clustering scheme fits into a specific data set. The most common application of CVIs is to fine-tune K. Given \(\mathcal {X}\), a specific clustering algorithm and a range of values of K, these steps are executed [10, 43]:

  1. 1.

    Successively repeat a clustering algorithm according to a number of clusters from a fixed range of values defined a priori: \(K \in [K_{min},K_{max}]\);

  2. 2.

    Obtain the clustering result \(\{U,\varTheta \}\) for each K in the range;

  3. 3.

    Calculate the validity index score for all solutions; and

  4. 4.

    Select \(K_{opt}\) for which data partitioning provides the best clustering result.

CVIs are considered to be independent of the clustering algorithms used [40] and usually fall into one of two categories: internal and external [27, 32]. Internal validation does not require knowledge about the problem for it only uses information intrinsic to the data; hence, it has a practical appeal. Conversely, external validation is more accurate, but not always feasible [32]. Knowing this, it can be evaluated how well the achieved solution approaches a predefined structure based on previous and intuitive understanding regarding natural clusters.

3.3 Mutual Equidistant-Scattering Criterion

This work proposes the replacement of the objective function of GADBA with a CVI called MEC [12]. MEC is a non-parametric, internal validation index for crisp clustering. An immediate benefit of MEC is the absence of fine-tuning hyper-parameters, thus mitigating the user’s effort in operational terms and enabling the use of GADBA to cluster real-world data whose structure is unknown.

MEC assumes that “objects belonging to the same data cluster will tend to be more equidistantly scattered among themselves compared to data points of distinct clusters” [12]. As such, the mean absolute difference \(\mathcal {M}_\kappa \) is applied using multi-representative data in every clustering solution \(\{U,\varTheta \}\) obtained from a pre-determined K. MEC is weighted by a penalty of local restrictive nature to each cluster \(\kappa \) as well, while a global penalty is then applied a posteriori. Such penalties are a measure of intra-cluster homogeneity and inter-cluster separation.

Mathematical Formulations. The mean absolute difference is calculated between any possible pair of intra-cluster dissimilarities

(6)

where \(n_\kappa \) objects within the cluster are considered as representative data in formulation (thereby multi-representative). That is, \(D_\kappa \) is a strictly upper triangular matrix of order \(n_\kappa \).

Nevertheless, only the pairwise distances matter in \(\mathcal {M}_\kappa \). Thus, \(\varUpsilon (\cdot )\) reshapes all elements above the main diagonal of \(D_\kappa \) (Eq. 6) into a column vector

$$\begin{aligned} \begin{bmatrix} d_{1 \kappa }\\ d_{2 \kappa }\\ \vdots \\ d_{L_\kappa \kappa } \end{bmatrix} = \varUpsilon (D_\kappa ), \end{aligned}$$
(7)

denoted by \(L_\kappa = \frac{n_\kappa (n_\kappa -1)}{2}\) intra-cluster Euclidean distances. The key part of MEC is then defined in Eq. 8 as

(8)

where \(L_\kappa> 1 \Leftrightarrow n_\kappa > 2\), is the absolute value and \(\eta _\kappa = \frac{L_\kappa (L_\kappa -1)}{2}\) stands for the total number of differences for a single cluster.

An exponential-like distance measure provides a robust property based on the analysis of the influence function [39]. [12] empirically observed that it works properly, particularly when we look for \(K_{opt}\) within a hierarchical data set. Therefore, a new homogeneity measure \(\varSigma _\kappa \) of non-negative exponential type was modelled as a penalty over \(\mathcal {M}_\kappa \) as

(9)

where \(\sigma _\kappa ^2\) is the variance of \(d_\kappa \).

One can observe that the homogeneity measure gets closer to zero with the approximation of the ideal model solution, where the criterion value is zero and, therefore, the loss of information is null. Thus, we have MEC defined as

$$\begin{aligned} \mathrm {MEC}(K) = \lambda \sum _{\kappa =1}^{K} \varSigma _\kappa \times \mathcal {M}_\kappa , \end{aligned}$$
(10)

where

(11)

The measure of global separation and penalty \(\lambda \), therefore, does not depend exclusively on the \(\kappa \)th cluster, but on the greater distance between the pairs of representative points of each data cluster (e.g., centroids). In a few words, \(\lambda \) globally weights the result of the solution. The presence of K, in Eq. 11, is a simple way to avoid overfitting as a result of clustering solutions already sufficiently accommodated to the data. In addition to avoiding overfitting, an improvement over other indices is the possibility of evaluating the clustering tendency (\(K{=}1\)) without resorting to additional, external techniques [43].

Fig. 2.
figure 2

MEC results for a small set of twelve data points: (a) \(K = 1\); (b) \(K = 2\); (c) \(K = 4\).

To illustrate, Fig. 2 shows the MEC composition for three feasible cluster solutions, where each dotted line represents one measure of intra-cluster dissimilarity and each cluster is depicted by a quadratic centroid. The operating mechanism of MEC, which encompasses both homogeneity (Eq. 9) and separation (Eq. 11), is visualised for \(K = 1, 2, 4\) (Figs. 2a, 2b, and 2c, respectively). The motivation is that the dissimilarity measures should be similar to each other when looking at each cluster. In this case, Fig. 2a contains the least suitable solution among those shown graphically, as their dissimilarity measures are more divergent in magnitude than those in Fig. 2b and 2c. The four-cluster solution (Fig. 2c) is the best within the solution set, as the distances among objects are exactly the same in each cluster.

At last, it is worth noting that Eq. 10 should be minimized,

$$\begin{aligned} \hat{K} = \arg \min \mathrm {MEC}(K), \end{aligned}$$
(12)

where \(K \in [K_{min},K_{max}]\) and \(\hat{K}\) is inferred by the variation of K which determines the lowest MEC value, regardless of the clustering algorithm.

Improving the Time-Complexity of MEC. Equation 8 can be equivalently computed in terms of a log-linear time complexity as a function of \(L_\kappa \), to improve the computational efficiency of MEC. To do so, Eq. 8 can be reformulated to generate an auxiliary vector, as well as in sorting \(d_\kappa \) with an algorithm of same complexity (e.g., HeapSort). In fact, the time complexity of MEC will be entirely dependent on the complexity of the chosen sorting algorithm. As such, we have a complexity of \(\mathcal {O}(L_\kappa \log {}L_\kappa )\) with the Heap-Sort algorithm, or even \(\mathcal {O}(N^2)\), by the reformulated

$$\begin{aligned} \mathcal {M}_\kappa = \eta _\kappa ^{-1} \sum ^{\tilde{L}_\kappa }_{l=1} \left( c_{l \kappa } - \hat{d}_{l \kappa } \times \left( L_\kappa - l \right) \right) , \end{aligned}$$
(13)

where \(\hat{d_\kappa }\) is the increasing ordering of the values of \(d_\kappa \) and \(\tilde{L}_\kappa = L_\kappa - 1 = \left| c_\kappa \right| \); \(c_\kappa \) is an auxiliary variable that consists of a cumulative and naturally ordered vector of \(\hat{d_\kappa }\) defined as

figure b
Fig. 3.
figure 3

Element assignment of \(c_\kappa \).

Looking at Fig. 3, each square and value between square brackets depicts some vector position (l notation). In Eq. 13, the general form (\(L_\kappa -l\)) consists of the number of subtractions (Eq. 8) represented by arrows in the Figure, with \(\hat{d}_{l \kappa }\) depending on its location. The ordered \(\hat{d}_\kappa \) ensures that \(\hat{d}_{l \kappa } \le \hat{d}_{l+1,\kappa }\). By transitivity we have that, in Eq. 13,

$$\hat{d}_{l \kappa } \times \left( L_\kappa - l \right) \le c_{l \kappa }, \quad l = 1,\dots ,\tilde{L}.$$

Hence, \(\hat{d}_\kappa \) is sensibly less accessed, thus reducing the time complexity of MEC.

4 Results and Analyses

This section describes the results achieved by the five algorithms compared in this study: GADBA, its new version GADBA-MEC, K -means, GMM, and Linkage. Since none of the last three techniques automatically finds \(\hat{K}\), the Calinski and Harabasz Cluster Validity Index (CVI) is used to optimize \(\hat{K}\) through cluster validation (Sect. 3.2). Section 4.1 presents the methodology as how to, and by what means, the results were generated; Sect. 4.2 discusses the results highlighting the techniques that clustered the data; and the statistical significance of the results is analysed in Sect. 4.3.

4.1 Applied Methodology

The accuracy of the clustering algorithms is explained in a set of statistical indicators, such as absolute frequency, mean and standard deviation of \(\hat{K}\), in twenty clustering validations for each data set (i.e., \(N_r = 20\)). The Mean Absolute Percentage Error (MAPE) was then estimated between the desired (\(K_{opt}\)) and optimized (\(\hat{K}\)) number of clusters in Sect. 4.3. It generally expresses accuracy as a percentage which is designated by

(15)

Table 1 presents data sets from different benchmarks used for performance analysis when comparing clustering algorithms. To evaluate the algorithms, ten sets were selected as archetypes of real challenges faced in cluster validation (e.g., data hierarchy, clustering tendency, different densities/sizes).

Table 1. Properties of test data sets.

The GADBA-MEC algorithm works through some previously specified hyper-parameters. Considering an oscillation of the best fitness in the order of \({1 \times 10^{-4}}\), the number of generations needed to infer the convergence of the fitness value is 50. The crossover and mutation probabilities are 0.8 and 0.03, respectively. The ring size of the tournament method for individual selection is set to 3. The population size and the maximum number of clusters are 100 and 30, respectively.

All experiments presented herein were conducted on a computer with an Intel© CoreTM i5 CPU @ 3.00 GHz with 8 GB of memory running MATLAB® 2017a. Most packages used in our tests are internal to MATLAB®.

4.2 Cluster Detection Results

One approach to evaluate the performance of the clustering algorithms is to analyse how frequently \(\hat{K} = K_{opt}\). In this sense, Table 2 shows the frequency of \(K_{opt}\) with emphasis on the highest absolute frequency by algorithm in blue.

Table 2. Cluster detection results taken from the data sets.

Only GADBA is inconsistent with \(K_{opt}\) overall due to its SHM-related objective function, as proven by the performance of GADBA-MEC. Moreover, GADBA is the most unstable algorithm, as shown by the standard deviation values. The only highlight of GADBA was reached in Iris, although this might be explained by its tendency of settling on lower K values. Thus, a new version is justified ex post facto, attesting to the generalization potential of GADBA-MEC.

An important highlight of the proposed version is the detection of low-separation hierarchical data in H\(_1\). Virtually all other techniques tended to settle on expected sub-optimal K values. Contrastingly, in cases where GADBA-MEC did not reach the highest frequency (i.e., Dim-32, Hepta, and Iris), it at least approached the expected result in a stable manner, unlike GADBA.

For the rest of the algorithms, it should be noted that Linkage is deterministic, so its null standard deviation is expected. It was the second best in finding \(K_{opt}\), although it failed to assess clustering tendency. In this regard, only GADBA-MEC determined that \(\hat{K} = 1\) in GolfBall and One-G.

4.3 Statistical Significance Analyses

Friedman’s test is a non-parametric statistical test analogue to the two-way ANOVA (analysis of variance) [1, 16]. This statistical test is used to determine whether there are any statistically significant differences among algorithms from sample evidences. The samples to be considered are clustering algorithm performance results collected over the data sets, where the null hypothesis \(H_0\), to be considered is that all algorithms obtained similar results. Friedman’s test converts all results to ranks where all algorithms are classified for each problem according to its performance. As such, p-values can be computed for hypothesis testing. The p-value represents the probability of obtaining a result as extreme as the one observed, given \(H_0\) [7]. Specifically, given the significance level \(\alpha = 0.05\), the null hypothesis is rejected if \(p < \alpha \).

Table 3. MAPE (%) taken from the data sets with emphasis on values above 100%.
Table 4. Friedman’s test on the data set results.
Table 5. Friedman’s post-hoc pairwise comparisons on the data set results, with emphasis on significant comparisons.

Since we want to know which algorithms are significantly different from each other when \(H_0\) is rejected, a post-hoc procedure is necessary to compare all possible algorithm pairs. In this work, the procedure presented in [16] is employed, in which the means of critical values at \(\alpha \) are compared to each absolute difference on mean ranks as , \(i \ne j\). The absolute difference must be greater than \(\alpha \) to determine statistical significance.

In this section, we verify the significance of GADBA-MEC using the Friedman’s test F. For this purpose, we calculate the MAPE of each data set. Once the algorithm with the smallest error is determined, the statistical significance test is applied to verify if the obtained difference is substantial. If this is the case, one can justify using one algorithm instead of another with more confidence.

Table 3 summarizes MAPE per data set emphasizing values above 100%. All algorithms had error rates above 100%, except GADBA-MEC with the lowest overall value (4.99%). Table 4 focuses on these errors, for which the Friedman’s test rejects the null hypothesis for an obtained p-value \(\ll \alpha = 0.05\). Accordingly, Friedman’s post-hoc test shows that there are significant improvements of the proposed version in terms of MAPE (Table 5), as well as the fact that significant differences are shown in virtually all algorithm pairs.

5 Conclusions and Further Work

Genetic-based clustering approaches play an important role in natural computing. In this sense, GADBA was introduced as an efficient, bioinspired approach to cluster data in SHM. Despite its competitive performance identifying structural components, it produces poor results on more general clustering scenarios. For this reason, this study proposed the replacement of its objective function for MEC, a recently developed CVI based on mutual equidistant-scattering analysis.

GADBA-MEC outperforms conventional clustering algorithms when statistically evaluated across various data sets, attaining the expected number of clusters more often than others. The results showed that GADBA-MEC yielded better results in terms of cluster validation and MAPE errors, in particular when handling hierarchical data and data with low separation. Also, only GADBA-MEC is able to verify the clustering tendency in the data sets addressed.

As future work, we intend to expand GADBA-MEC to multi-objective optimization contexts. It is also relevant to apply GADBA-MEC in real-world problems to validate its efficiency in finding natural clusters. Finally, comparing other CVI’s and bioinspired algorithms would be pertinent as well.