A novel variant of social spider optimization using single centroid representation and enhanced mating for data clustering


A novel variant of social spider
optimization using single centroid
representation and enhanced mating
for data clustering
Ravichandran Thalamala, Janet Barnabas and A.V. Reddy

Department of Computer Applications, National Institute of Technology, Trichy, India

ABSTRACT
Nature-inspired algorithms are based on the concepts of self-organization and
complex biological systems. They have been designed by researchers and scientists to
solve complex problems in various environmental situations by observing how
naturally occurring phenomena behave. The introduction of nature-inspired
algorithms has led to new branches of study such as neural networks, swarm
intelligence, evolutionary computation, and artificial immune systems. Particle swarm
optimization (PSO), social spider optimization (SSO), and other nature-inspired
algorithms have found some success in solving clustering problems but they may
converge to local optima due to the lack of balance between exploration and
exploitation. In this paper, we propose a novel implementation of SSO, namely social
spider optimization for data clustering using single centroid representation and
enhanced mating operation (SSODCSC) in order to improve the balance between
exploration and exploitation. In SSODCSC, we implemented each spider as a
collection of a centroid and the data instances close to it. We allowed non-dominant
male spiders to mate with female spiders by converting them into dominant males.
We found that SSODCSC produces better values for the sum of intra-cluster distances,
the average CPU time per iteration (in seconds), accuracy, the F-measure, and the
average silhouette coefficient as compared with the K-means and other nature-inspired
techniques. When the proposed algorithm is compared with other nature-inspired
algorithms with respect to Patent corpus datasets, the overall percentage increase in the
accuracy is approximately 13%. When it is compared with other nature-inspired
algorithms with respect to UCI datasets, the overall percentage increase in the
F-measure value is approximately 10%. For completeness, the best K cluster centroids
(the best K spiders) returned by SSODCSC were specified. To show the significance
of the proposed algorithm, we conducted a one-way ANOVA test on the accuracy
values and the F-measure values returned by the clustering algorithms.

Subjects Agents and Multi-Agent Systems, Artificial Intelligence, Data Mining and Machine
Learning
Keywords Social spider optimization, Nature inspired algorithms, Swarm intelligence,
Single cluster approach, Data clustering, Cluster centroids

INTRODUCTION
Data clustering is one of the most popular unsupervised classification techniques in
data mining. It rearranges the given data instances into groups such that the similar data

How to cite this article Thalamala R, Barnabas J, Reddy AV. 2019. A novel variant of social spider optimization using single centroid
representation and enhanced mating for data clustering. PeerJ Comput. Sci. 5:e201 DOI 10.7717/peerj-cs.201

Submitted 13 February 2019
Accepted 3 June 2019
Published 22 July 2019

Corresponding author
Ravichandran Thalamala,
sirichandran007@gmail.com

Academic editor
Yilun Shang

Additional Information and
Declarations can be found on
page 31

DOI 10.7717/peerj-cs.201

Copyright
2019 Thalamala et al.

Distributed under
Creative Commons CC-BY 4.0

http://dx.doi.org/10.7717/peerj-cs.201
mailto:sirichandran007@�gmail.�com
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.201
http://www.creativecommons.org/licenses/by/4.0/
http://www.creativecommons.org/licenses/by/4.0/
https://peerj.com/computer-science/


instances are placed in the same group while the dissimilar data instances are placed in
separate groups (Bernábe-Loranca et al., 2014). Data clustering identifies the groups
present in a data set, each of which contains related data instances. Network clustering
identifies the groups present in a computer network, each of which contains highly
connected computers. Network clustering returns the various topological structures
present in a computer network as shown in Fig. 1, whereas data clustering returns cluster
sets of related data instances. The quality of data clustering is measured using metrics
like intra-cluster distances (ICD), inter-cluster distances, F-measure, and accuracy.
The quality of network clustering is measured using metrics like the global clustering
coefficient and the average of the local clustering coefficients.

Data clustering is an NP-hard problem (Aloise et al., 2009) with the objective of
minimizing ICD within the clusters and maximizing inter-cluster distances across the
clusters (Steinley, 2006). A dataset DS is a collection of data instances. Each data instance in
a dataset DS can be represented by a data vector. In a dataset of text files, each text file
is a data instance and can be represented as a data vector using mechanisms like term
frequency-inverse document frequency (TF-IDF) (Beil, Ester & Xu, 2002). In this paper,
we use the terms “data instance” and “data vector” interchangeably and define clustering
as a minimization problem that minimizes the sum of intra-cluster distances (SICD).
Clustering forms a set of K clusters. Let CL be the set of K clusters where the SICD is

Figure 1 Topological structures present in a computer network: network clustering. It specifies the
resultant topological structures of the network clustering when applied on a computer network.

Full-size DOI: 10.7717/peerj-cs.201/fig-1

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 2/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-1
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


minimized. The mathematical model for the clustering problem can be defined as shown
in Eq. (1). The clustering function F takes DS and returns CL after minimizing SICD.

Minimize F DS; CLð Þ ¼
XK
i¼1

Xni
j¼1

distanceðci; dvi;jÞ (1)

In Eq. (1), distance is the distance function that returns the distance between two given
data vectors, dvi, j is the jth data vector present in ith cluster of CL, ni is the number of
data vectors present in ith cluster of CL, and ci is the centroid of ith cluster of CL.

The classical clustering algorithms are categorized into hierarchical and partitional
algorithms. The main drawback of hierarchical clustering is that the clusters formed in an
iteration cannot be undone in the next iterations (Rani & Rohil, 2013). K-means is one
of the simplest partitional algorithms (Mihai & Mocanu, 2015) but it has two drawbacks:
the number of clusters to be formed should be specified apriori, and it generally produces
local optima solutions due to its high dependency on initial centroids. Examples of
other classical clustering algorithms are BIRCH (Zhang, Ramakrishnan & Livny, 1996),
CURE (Guha, Rastogi & Shim, 1998), CLARANS (Ng & Han, 2002), and CHAMELEON
(Karypis, Han & Kumar, 1999). Classical algorithms suffer from the drawbacks like
the convergence to local optima, sensitivity to initialization, and a higher computational
effort to reach global optimum. In order to overcome these problems, nature-inspired
meta heuristic algorithms are now used for data clustering.

In this study, we investigated the performance of social spider optimization (SSO) for data
clustering using a single centroid representation and enhanced mating operation. The
algorithm was experimented on using the Patent corpus5000 datasets and UCI datasets. Each
data instance in the UCI dataset is a data vector but the data instances in the Patent corpus5000
datasets are text files. Before we apply the proposed algorithm on these datasets, the text
files were represented as data vectors using TF-IDF mechanism. The vector representation
of ith data instance dvi present in the dataset DS can be specified using Eq. (2).

dvi ¼ wi; 1; wi; 2; wi; 3; . . . . . . :; wi; t
� �

(2)

In Eq. (2), wi, j is the term weight of jth distinguishable term of the dataset DS in the ith
data instance, t is the total number of distinguishable terms in the dataset DS. The term
weight wi, j can be computed using Eq. (3).

wi; j ¼ TFi; j � IDFj (3)
In Eq. (3), TFi, j (term frequency of the jth distinguishable term of the dataset DS in ith

data instance) is the number of times that the jth distinguishable term of the dataset
DS occurred in the ith data instance, and IDFj is the inverse document frequency of the
jth distinguishable term of the dataset DS. IDFj can be calculated using Eq. (4).

IDFj ¼ log
m
n

� �
(4)

In Eq. (4), n is the total number of data instances in DS and m is the number of data
instances in which the jth distinguishable term of the dataset DS is present.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 3/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


Contributions
In the last decade, nature-inspired algorithms have been successfully applied for solving
NP-hard clustering problems. In the state-of-the-art nature inspired algorithms for solving
clustering problems, each agent in the population is taken as a collection of K clusters.
Therefore, the memory requirements and CPU times of these algorithms are very high.
These algorithms return the best agents in which SICD is minimized or the average of
ICD is minimized. In other words, the fitness of the agent is measured by the consideration
that all K clusters present in it as a whole. This, however, does not mean that all clusters
should have low ICD individually in order to get a low SICD, as even the globally best
agent with the best fitness may contain some clusters that have very high ICD.

Suppose DS = {dv1, dv2, dv3, dv4, dv5, dv6, dv7, dv8, dv9, dv10}, K = 3, number of the
agents = 4, and the contents of the agents are as shown in Figs. 2–5. According to all
state-of-the-art nature-inspired algorithms, the best agent will be agent1 as it has the lowest
SICD value. However, these algorithms will not give any assurance that the three clusters

Figure 2 Data instances present in Agent1. It specifies data instances present in Agent1.
Full-size DOI: 10.7717/peerj-cs.201/fig-2

Figure 3 Data instances present in Agent2. It specifies data instances present in Agent2.
Full-size DOI: 10.7717/peerj-cs.201/fig-3

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 4/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-2
http://dx.doi.org/10.7717/peerj-cs.201/fig-3
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


have the lowest individual ICD. Table 1 specifies the best three agents (spiders) returned
by our proposed algorithm. The SICD value of the globally best solution is 40 + 30 +
65 = 135, which is less than the SICD value of the globally best solution in the K-cluster
representation of the agent. Therefore, the clustering results produced by the start-of-the
art algorithms that use K-centroid representation for agents may not be highly
accurate. The proposed approach not only focuses on SICD but also on the individual
ICD of the clusters.

In our proposed algorithm, social spider optimization for data clustering using single
centroid (SSODCSC), each spider is represented by a single centroid and the list of data
instances close to it. This representation requires K times less memory requirements
than the representation used by the other state-of-the-art nature-inspired algorithms like
SSO, as shown below. Each data instance in the dataset is given an identification number.
Instead of storing data instances, we stored their identification numbers (which are
integer values) in spiders.

Figure 4 Data instances present in Agent3. It specifies data instances present in Agent3.
Full-size DOI: 10.7717/peerj-cs.201/fig-4

Figure 5 Data instances present in Agent4. It specifies data instances present in Agent4.
Full-size DOI: 10.7717/peerj-cs.201/fig-5

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 5/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-4
http://dx.doi.org/10.7717/peerj-cs.201/fig-5
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


For SSODCSC:

Number of spiders used = 50

Number of iterations for best clustering results = 300

Total number of spiders to be computed = 300 � 50 = 15,000

Memory required for storing a double value = 8 bytes

Memory required for storing a spider’s centroid (that consists of m dimension values) =
8 � m bytes, where m is the number of dimensions present in the dataset.

Memory required for storing an integer value representing identification number of a data
instance = 4 bytes.

Maximum memory required for storing the list of identification numbers of data instances
closer to the centroid = 4 � n bytes, where n is number of data instances present in the
dataset.

Maximum memory required for a spider = 8 � m + 4 � n bytes

Therefore, total computational memory of SSODCSC = 15,000 � (8 � m + 4 � n) bytes.

For SSO:

Number of spiders used = 50

Number of iterations for best clustering results = 300

Total number of spiders to be computed = 300 � 50 = 15,000

Memory required for storing a double value = 8 bytes

Memory required for storing K centroids of a spider = 8 � m � K bytes, where m is the
number of dimensions present in the dataset.

Memory required for storing an integer value representing identification number of a data
instance = 4 bytes

Table 1 The best three agents returned by SSODCSC.

Agent Data vectors Intra-cluster
distance

Part of best
solution?

Spider 1 dv1, dv2, dv3 25 No

Spider 2 dv4, dv5, dv6 50 No

Spider 3 dv7, dv8, dv9, dv10 75 No

Spider 4 dv1, dv2, dv3, dv4, dv7, dv8 40 Yes

Spider 5 dv5, dv6 45 No

Spider 6 dv9, dv10 70 No

Spider 7 dv1, dv2, dv3, dv4, dv8 38 No

Spider 8 dv5, dv10 30 Yes

Spider 9 dv6, dv7, dv9 90 No

Spider 10 dv1, dv2, dv3, dv4, dv8 38 No

Spider 11 dv5, dv7, dv10 60 No

Spider 12 dv6, dv9 65 Yes

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 6/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


Maximum memory required for storing K lists of identification numbers of data instances
(where each list is associated with a centroid) = 4 � n � K bytes, where n is the number of data
instances present in the dataset, and K is the number of centroids present in each spider.

Maximum memory required for a spider = K�(8 � m + 4 � n) bytes

Therefore, total computational memory of SSO = 15,000 � K� (8 � m +4 � n) bytes.

The time required for initiating the spiders will be less in this representation.
The average CPU time per iteration depends on the time required for computing fitness
values and the time required for computing the next positions of the spiders in the solution
space. The fitness values and next positions of spiders can be computed in less time
with single centroid representation, so that the average CPU time per iteration reduces
gradually. The proposed algorithm returns best K spiders such that the union of the lists
of data instances present in them will produce exactly all of the data instances in DS.

In the basic SSO algorithm, non-dominant males are not allowed in the mating
operation because of their low weight values. They do not receive any vibrations from
other spiders and have no communication in the web, as the communication is established
through vibration only (Shukla & Nanda, 2016). Therefore, their presence in the solution
space is questionable. Moreover, their next positions are dependent on the existing
positions of the dominant male spiders (Cuevas et al., 2013). They cannot be part of the
selected solution when dominant male spiders are in the solution space. In our proposed
algorithm, SSODCSC, we convert them into dominant male spiders by increasing their
weight values and then allowing them to participate in the mating operation to produce a
new spider, better than the worst spider in the population. In SSO, each dominant
male mates with a set of females and produces a new spider. The weight of the new spider
may or may not be greater than that of the worst spider. But as we make the weight of each
non-dominant male spider greater than the average weight of the dominant male
spiders in SSODCSC, the new spider produced by the non-dominant male spider is surely
better than the worst spider. In other words, not only did we convert non-dominant male
spiders into dominant male spiders, but also we made them more effective than the
dominant male spiders. Therefore, each spider receives vibrations from other spiders and
has a chance at becoming a part of the selected solution, unlike in SSO. At each iteration of
SSODCSC, the population size is the same but the spiders with greater weight values
are introduced in place of the worst spiders. As a result, the current solution given by
SSODCSC moves toward the globally best solution as the number of iterations is increased.
We applied SSODCSC on feature-based datasets and text datasets.

This paper is organized as follows: “Related Work” describes the recent related work
on solving clustering problems using nature-inspired algorithms, “Proposed Algorithm:
SSODCSC” describes SSODCSC, “Results” includes experimental results, and we
conclude the paper with future work in the section “Discussion.”

RELATED WORK
Shukla & Nanda (2016) proposed two clustering algorithms based on the original version
of SSO and a parallel version of SSO (P-SSO). P-SSO computes the next position of female

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 7/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


spiders, dominant male spiders, and non-dominant male spiders simultaneously in
each iteration. They applied the two algorithms on low dimensional datasets, high
dimensional datasets, overlapping datasets, and non-overlapping datasets, and found that
the two algorithms are able to produce consistent clustering results as compared with
other clustering algorithms. They designed a flood image segmentation application based
on their proposed work and got two times better accuracy than the K-means when applied
on NASA satellite images of flood affected areas of Chennai.

Zhou et al. (2018), used the symbiotic organism search (SOS) algorithm for solving
clustering problems. The SOS algorithm mimics the interactive behavior of the organisms
in nature. In SOS, new solutions are generated by imitating the biological interactions
between two organisms in the ecosystem. SOS implements three phases, namely, the
mutualism phase, the commensalism phase, and the parasitism phase. In the mutualism
phase, the organisms interact to increase their mutual survival advantage. In the
commensalism phase, the interaction benefits one organism but does not impact the other.
In the parasitism phase, the organism with better fitness will kill the other. The proposed
algorithm produced better clustering results when compared with other algorithms
with low dimensional datasets. However, the authors did not apply SOS on high
dimensional datasets. The algorithm suffers from an imbalance between exploration
and exploitation due to its high randomness.

Nayak et al. (2018), hybridized the Elicit Teaching learning based optimization
approach with the fuzzy c-means (FCM) clustering algorithm. At each iteration of Elicit
Teaching learning-based optimization, the worst entities are replaced with the best entities
in each cluster group. The best cluster centroids produced by Elicit Teaching learning
based optimization are taken as the inputs for the FCM clustering algorithm. They found
that the proposed algorithm produces clustering solutions of better fitness values when
compared with other clustering algorithms.

Zhou et al. (2017), proposed a simplex method based social spider optimization
(SMSSO) algorithm to overcome the drawbacks of SSO, namely local optima entrapment
and poor convergence rates. In the proposed algorithm, the spider with the worst fitness
is replaced by a reflected or extended alternate spider so that the global search may
be improved. The largest dataset used in the experiments has only 13 dimensions.
The proposed algorithm looks good with low dimensional datasets. It may not be as
effective for high dimensional datasets as the simplex mechanism will become expensive
for those datasets. The differences between SMSSO and our proposed algorithm SSODCSC
are as follows: in SMSSO, the initial solution moves along the edges of the polytope
until it reaches the optimal solution, in SSODCSC, the solution space will have spiders of
better fitness values after every iteration. SMSSO supports the mating of dominant
males only, but SSODCSC allows the mating of both dominant and non-dominant male
spiders with female spiders. In SMSSO, the fitness of the new spider may or may not
be better than that of the worst spider. In SSODCSC, the fitness of the new spider is always
greater than that of the worst spider. In SMSSO, each spider is represented as a collection
of K-centroids. In SSODCSC, each spider is represented as a single centroid. SMSSO
returns the best spider, whereas SSODCSC returns a set of first K best spiders.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 8/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


Han et al. (2017), proposed the bird flock gravitational search algorithm (BFGSA) to
enable the gravitational search algorithm escape from sub optimal solutions. The authors
used a concept called collective response of object reorientation to avoid stagnation.
In that model, if the fitness of the global optimum remains the same in several subsequent
iterations, the proposed algorithm defines the collective response of the object
reorientation that updates the position of each object using the mean position of its nearest
seven neighbors. The simulation results indicate that BFGSA can be used for both
low and high dimensional datasets. The proposed algorithm convergences occur in
500 iterations, which is relatively high when compared with other algorithms.

Jothi, Mohanty & Ojha (2018), proposed minimum spanning tree (MST) based
clustering on the partition-based nearest neighbor graph for reducing the computational
overhead. The proposed algorithm produces a sparse local neighborhood graph (LNG)
and then the approximate MST is constructed from LNG. They showed that the
proposed algorithm outperforms the traditional algorithms by reducing both the size and
computational time to construct the LNG. Experiments are conducted on both synthetic
and real datasets.

Chen et al. (2018), proposed a novel optimum-path forest (OPF) clustering algorithm
that can be used for remote sensing segmentation. They defined a probability density
function using the principle that the cluster centers depend on their distances from
samples with higher densities. They applied the proposed algorithm on five remote sensing
land cover images. The clustering results show that the proposed algorithm outperforms
the original OPF approach.

Nayak et al. (2014), proposed an improved firefly-based fuzzy c-means algorithm
(improved FAFCM) to resolve the drawbacks of the FCM algorithm (FCM) using firefly
algorithm. They used the firefly algorithm to minimize the objective function value of
the FCM algorithm. The output centroids of the firefly algorithm are passed to the FCM
algorithm as initial centroids so that it refines them further. They found that an improved
FAFCM produces better clustering results as compared to FCM, particle swarm
optimization (PSO), and FAFCM.

De Andrade Silva, Hruschka & Gama (2017), proposed a fast evolutionary algorithm for
clustering data streams (FEAC-Stream). It is capable of estimating the number of
clusters to be formed from data in an online fashion. The Page–Hinkley test is used by
FEAC-Stream to identify eventual degradation in the induced cluster quality. The
proposed algorithm is based on the assumption that clusters provide useful information
about the dynamics of the data stream. They applied the proposed algorithm on synthetic
and real-world data streams and showed that the proposed algorithm produces better
clustering results than other algorithms.

Costa et al. (2015), proposed a nature-inspired approach for data clustering based on the
optimum-path forest algorithm (OPFC). OPFC accepts a graph representation of the
dataset.

The nodes of the graph are samples and each sample is connected to its k-nearest
neighbors. Each node has a weight. The weight is its probability density function value,
which is computed using its distances from l k-nearest neighbors. After the k-nn graph is

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 9/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


constructed, OPFC finds roots which are the nodes with the maximum of probability
density function values and propagates one optimum-path cluster from each root to the
remaining nodes in the graph.

Alswaitti, Albughdadi & Isa (2018), presented a novel approach for data clustering
based on particle swarms. For balancing exploitation and exploration, they used the
kernel density estimation technique and estimated multidimensional gravitational learning
coefficients. The kernel density estimation technique is used for avoiding premature
convergence. They showed that the proposed algorithm produces better accuracy and
better cluster compactness than other clustering algorithms when applied on benchmark
datasets from the UCI Machine Learning Repository.

PROPOSED ALGORITHM: SSODCSC
Social spider optimization is based on the cooperative behavior of social spiders for
obtaining a common food. They are classified into two types, namely male spiders and
female spiders (Cuevas et al., 2013). Nearly 70% of the population is female. Each spider is
characterized by its position, fitness, weight, and vibrations received from other
spiders (Thalamala, Reddy & Janet, 2018). In K-centroid representation, each spider has
K-centroids which are associated with a list of data instances closer to it as shown in
Fig. 6A. In SSODCSC, each spider has two components, namely a centroid and a list of
identification numbers of data instances closer to it, as shown in Fig. 6B. The number of
data instances close to the centroids of spiders may be different, the length of each
spider may be different; however, the length of the centroid component of each spider is
fixed. Therefore, we used only centroid components of the spiders to specify their position
in the solution space. When the spiders move in the solution space, only the centroid
components are moved or updated. The new list of identification numbers of data
instances may be found to be closer to the centroid, depending on its new location. We use
the terms spider, position of spider, and centroid component of spider interchangeably
in this article. The fitness of a spider is the sum of the distances of its data instances from
its centroid. The weights of the spiders are computed based on their fitness values.

Figure 6 (A) K-cluster representation of a spider; (B) single cluster representation of a spider. The
figure specifies the components of a spider in K-cluster and single cluster representations.

Full-size DOI: 10.7717/peerj-cs.201/fig-6

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 10/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-6
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


In SSODCSC, the weight of a spider is inversely proportional to the fitness value of the
spider. A spider with the first largest weight (first smallest fitness) is known as the globally
best spider s1gbs. In this paper, we use the notations s

1
gbs and sgbs interchangeably.

In SSODCSC, s1gbs will have the largest weight and lowest value for the sum of the distances
of data instances from the centroid. The spider with the least weight (largest fitness value)
is known as worst spider sws, as shown in Fig. 7. In SSODCSC, sws will have the least
weight and largest value for the sum of the distances of the data instances from the
centroid. Each spider receives vibrations from the globally best spider sgbs, the nearest
better spider snbs, and the nearest female spider snfs. The male spiders are classified into two
types, namely dominant males and non-dominant males. The weight of a dominant
male spider is greater than or equal to the median weight of male spiders (Cuevas &
Cienfuegos, 2014) as shown in Fig. 8. The male spiders that are not dominant males are
called non-dominant males. A female spider can either attract or repulse other spiders.
The weight of a spider s can be computed using Eq. (5). The lower the sum of the distances
(i.e., fitness), the higher the weight of the spider will be in SSODCSC.

weight ½s� ¼ fitness sð Þ � fitness swsð Þ
fitness sgbs

� �
� fitness swsð Þ

(5)

The SSODCSC algorithm returns spiders s1gbs, s
2
gbs ... and s

K
gbs that have the first K

largest weight values (first K smallest fitness values) such that the union of the data
instances present in them will give exactly all of the data instances present in the dataset.
We used a two-dimensional array, namely spider, to store the centroid components of

Figure 7 Spiders on the scale of weight values. Full-size DOI: 10.7717/peerj-cs.201/fig-7

Figure 8 Male spiders on the scale of weight values. Full-size DOI: 10.7717/peerj-cs.201/fig-8

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 11/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-7
http://dx.doi.org/10.7717/peerj-cs.201/fig-8
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


all spiders. For example, in the case of the Iris dataset, the centroid component of first
spider is stored in spider [1, 1], spider [1, 2], spider [1, 3], and spider [1, 4], as Iris has
four dimensions. The number of spiders returned by SSODCSC depends on the number of
clusters inherently present in the dataset. For example, in the case of the Iris dataset,
though we use 50 spiders, the algorithm returns the first best, second best, and third best
spiders only, because the Iris dataset inherently has three clusters. In the following
subsections, we explain how the spiders are initialized in the solution space, the data
instances are assigned to them, the next positions of the spiders are found, and the mating
operation produces a new spider.

Initialization
Social spider optimization for data clustering using single centroid starts with the
initialization of spiders in the solution space. Initially all spiders are empty. The fitness of
each spider is set to 0, and the weight is set to 1. Each spider s is initialized with a random
centroid using Eq. (6).

spider ½s;d� ¼ lowerboundðdÞ þ randomð0;1Þ � ðupperboundðdÞ � lowerboundðdÞÞ (6)
where spider [s, d] is dth dimension of the centroid of spider s, lowerbound (d) and
upperbound (d) are the smallest and largest values of the dth dimension of the dataset,
respectively.

Assignment of data instances
The distances of each data instance from the centroids of all spiders are calculated using
the Euclidean distance function. A data instance is assigned to the spider that contains its
nearest centroid.

Next positions of spiders
The spiders are moved across the solution space in each iteration of SSODCLC based on
their gender. The movement of a spider in the solution space depends on the vibrations
received from other spiders. The intensity of the vibrations originated from spider sj
to spider si can be found using Eq. (7) and depends on the distance between the two spiders
and the weight of spider sj.

vibrations si; sj
� �

¼ weight sj
� �

� e�distance si;sjð Þ
2

(7)

Next positions of female spiders

The movement of a female spider sf depends on the vibrations from the globally best
spider sgbs and its nearest better spider snbs as shown in Fig. 9. To generate the next position
of a female spider sf, a random number is generated and if it is less than the threshold
probability (TP), the female spider attracts other spiders and the position of it is calculated
according to Eq. (8). If not, it repulses other spiders and the position of it calculated
according to Eq. (9). In Eqs. (8) and (9), a, β, c, and d are random numbers from the
interval [0, 1].

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 12/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


spider ½sf ; d� ¼ spider ½sf ; d� þ a � spider ½sf ; d� � spider ½sgbs; d�
� �

� weight ½sgbs�
� e�distanceðsf ;sgbsÞ2 þ b � spider ½sf ; d� � spider ½snbs; d�

� �

� weight ½snbs� � e�distanceðsf ;snbsÞ
2 þ g � ðd � 0:5Þ

(8)

spider ½sf ; d� ¼ spider ½sf ; d� þ a � spider ½sf ; d� � spider ½sgbs; d�
� �

� weight ½sgbs� � e�distanceðsf ;sgbsÞ
2 � b � spider ½sf ; d� � spider ½snbs; d�

� �

� weight ½snbs� � e�distanceðsf ;snbsÞ
2 þ g � ðd � 0:5Þ

(9)

Next position of male spiders
The solution space consists of female spiders and male spiders. When data instances are
added or removed from them, their fitness values and weight values will change. If the
current weight of a male spider is greater than or equal to the median weight of dominant
male spiders, it will be considered to be a dominant male spider. The male spiders that
are not dominant male spiders are called non-dominant male spiders. The next position of
a dominant male sdm can be calculated using Eq. (10).

spider ½sdm; d� ¼spider ½sdm; d� þ a � spider ½sdm; d� � spider ½snfs; d�
� �

� weight ½snfs� � e�distanceðsdm; snfsÞ
2 þ g � ðd � 0:5Þ (10)

The position of the spider depended only on the vibrations received from its nearest
female spider snfs. The pictorial representation of this is specified in Fig. 10. The weighted
mean of the male population, W, can be obtained using Eq. (11). Let Nf be the total number
of female spiders in the spider colony and Nm be the total number of male spiders.
Then the female spiders can be named as sf1; sf2; sf3; . . . . . . ::; sfNf and the male spiders
can be named as sm1; sm2; sm3; . . . . . . ::; smNm.

Figure 9 Next position of a female spider in SSODCSC. The figure specifies how the next position of a
female spider is calculated in SSODCSC. Full-size DOI: 10.7717/peerj-cs.201/fig-9

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 13/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-9
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


W ¼
PNm

i¼1 spider smi; d½ � � weight smið ÞPNm
i¼1 �weight smið Þ

(11)

The next position of the non-dominant male spider sndm can be calculated using
Eq. (12) and depends on the weighted mean of the male population.

spider ½sndm; d� ¼ spider ½sndm; d� þ a � W (12)

Mating operation
Each dominant male spider mates with a set of female spiders within the specified range of
mating to produce a new spider, as shown in Fig. 11. The new spider will be generated
using the Roulette wheel method (Chandran, Reddy & Janet, 2018). If the weight of the
new spider is better than the weight of worst spider, then the worst spider would be
replaced by the new spider. The range of mating r is calculated using Eq. (13).

r ¼ diff
2 � n (13)

Figure 10 Next position of a dominant male spider in SSODCSC. The figure specifies how the next
position of a dominant male spider is calculated in SSODCSC.

Full-size DOI: 10.7717/peerj-cs.201/fig-10

Figure 11 Mating of a dominant male spider in SSODCSC. The figure specifies how a dominant male
spider mates with a set of female spiders to produce a new spider in SSODCSC.

Full-size DOI: 10.7717/peerj-cs.201/fig-11

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 14/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-10
http://dx.doi.org/10.7717/peerj-cs.201/fig-11
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


where diff is the sum of differences of the upper bound and lower bound of each
dimension, and n is the number of dimensions of the dataset DS.

In SSO, the non-dominant male spiders are not allowed to mate with female spiders, as
they would produce new spiders having low weights. In SSODCSC, a non-dominant male
spider is converted into dominant male spider by making sure that its weight becomes
greater than or equal to the average weight of dominant male spiders so that it participates
in the mating process and produces a new spider whose weight is better than that of
at least one other spider. The theoretical proof for the possibility of converting a
non-dominant male spider into a dominant male spider is provided in Theorem 1. Thus,
non-dominant male spiders become more powerful than dominant male spiders as
they are made to produce new spiders that surely replace worst spiders in the population.
The theoretical proof for the possibility of obtaining a new spider that is better than
the worst spider, after a non-dominant male spider mates with the female spiders is
provided in Theorem 2. The following steps are used to convert a non-dominant male
spider into a dominant male spider:

Step 1: Create a list consisting of data instances of the non-dominant male spider sndm
in the decreasing order of their distances from its centroid.

Step 2: Delete the top-most data instance (i.e., the data instance which is the greatest
distance from the centroid) from the list.

Step 3: Find the weight of the non-dominant male spider sndm.

Step 4: If the weight of non-dominant male is less than the average weight of dominant
male spiders, go to Step 2.

The flowchart for SSODCSC is specified in Fig. 12.

Theorem 1. A non-dominant male spider can be converted into a dominant male spider in
single centroid representation of SSO.

Proof: Let sndm be the non-dominant male spider whose weight is wndm.

Let medwgt be the median weight of male spiders (which is always less than or equal to 1).

But according to definition of the non-dominant male spider,

wndm < medwgt (14)

Assume that the theorem is false.

⇒ sndm can not be converted into a dominant male spider

⇒ During the movement of sndm in the solution space,

wndm < 1 (15)

Let Sum be the sum of distances of data instances from the centroid of sndm.

If the data instance that is the furthest distance from the centroid of sndm is removed from
sndm, then

⇒ sum of distances of data instances from the centroid of sndm will decrease, as

Sum = Sum-distance of removed data instance from the centroid of sndm.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 15/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


⇒ fitness of sndm will decrease, as

fitness of sndm is proportional to Sum

⇒ the weight of sndm will increase as

the weight of sndm is inversely proportional to fitness of sndm
Similarly,
If a data instance is added to sndm, then

⇒ sum of distances of data instances from centroid of sndm will increase.

⇒ fitness of sndm will increase.

⇒ the weight of sndm will decrease.

Therefore,

wndm ¼ 1 �
Xn
i¼1

wdi (16)

where 1 is the initial weight of sndm, n is the total number of data instances added to sndm,
and wdi is the decrease in the weight of sndm when ith data instance was added to sndm.

When all the data instances are removed from sndm,

wndm ¼ wndm þ
Xn
i¼1

wdi ¼ 1�
Xn
i¼1

wdi þ
Xn
i¼1

wdi (17)

But according to Eq. (15), wndm can never be 1 during the movement of sndm in the
solution space.

Hence, our assumption is wrong.

So, we can conclude that a non-dominant male spider can be converted into dominant
male spider in single centroid representation of SSO.

Theorem 2. The weight of the new spider resulting from the mating of a non-dominant male
spider with a weight greater than or equal to the average weight of dominant male spiders
will be better than at least one spider in the population.

Proof: Let sndm be the non-dominant male spider whose weight became greater than or
equal to the average weight of dominant male spiders.

Let sf1; sf2; sf3; . . . . . . ::; sfm be female spiders that participated in the mating.

Let snew be the resulting new spider of the mating operation.

Let N be the total number of spiders in the colony.

Assume that the theorem is false.

It implies:

PN
i¼1

weight sið Þ � weight snewð Þ?1 : 0 ¼ 0 (18)

In other words, the total number of spiders whose weight is less than or equal to that of snew
is zero.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 16/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


But according to the Roulette wheel method:

snew ¼
Pm

i¼1 sfi � weight sfi
� �� �

þ sndm � weight sndmð ÞPm
i¼1 weight sfi

� �� �
þ weight sndmð Þ

(19)

⇒ lim
weight sfið Þ!0^weight sndmð Þ!1

snew = sndm with weight equal to 1 = sgbs (since any spider whose

weight is 1 is always sgbs)

And

lim
weight sfið Þ!1^weight sndmð Þ!1

snew

¼ m female spiders whose weight is 1ð Þ þ sndm whose weight is 1ð Þ
m þ 1

¼ m þ 1ð Þ � sgbs
m þ 1

¼ sgbs
So, when weight sfi

� �
tends to 0, and weight (sndm) tends to 1, snew becomes sgbs.

When weight sfi
� �

tends to 1, and weight (sndm) (sndm) tends to 1, snew becomes sgbs.

Similarly,
When weight sfi

� �
tends to 1, and weight (sndm) tends to 0, snew becomes sgbs.

Substituting sgbs in place of snew in Eq. (18),

XN
i¼1

weight sið Þ � weight sgbs
� �

?1 : 0 ¼ 0 (20)

According to Eq. (20), the number of spiders whose weight is less than or equal to the
weight of sgbs is zero. But according to the definition of sgbs, its weight is greater than or
equal to the weights of all remaining spiders. So, there are spiders whose weights are
less than or equal to the weight of sgbs. Therefore Eq. (20) is false.

Hence, our assumption is wrong. Therefore, we can conclude that the weight of snew
produced by sndm is greater than that of at least one spider in the population.

RESULTS
The proposed algorithm and the algorithms used in the comparison were implemented in
the Java Run Time Environment, version 1.7.0.51, and the experiments were run on
Intel Xeon CPU E3 1270 v3 with a 3.50-GHz processor with a 160 GB RAM. The Windows
7 Professional Operating System was used.

Applying SSODCSC on patent datasets
At first, we applied SSODCSC on six Patent corpus datasets. The description of the data
sets is given in Table 2. Patent corpus5000 contains 5,000 text documents with technical
descriptions of the patents that belong to 50 different classes. Each class has exactly
100 text documents. Each text document contains only a technical description of the
patent. All text documents were prepared using the ASCII format.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 17/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


As SSODCSC returned K best spiders, the SICD of clusters in those spiders was
calculated. Table 3 specifies the clustering results of SSODCSC when applied on Patent
corpus datasets. For each dataset the SICD value, Cosine similarity value, F-measure value
and accuracy obtained were specified.

Table 4 specifies the relationship between the SICD values and number of iterations.
Lower SICD value indicate a higher clustering quality. It was found that as we
increased the number of iterations, the SICD decreased and thereby, the clustering
quality increased.

Figure 12 Flowchart of SSODCSC. The flowchart specifies the various steps in SSODCSC.
Full-size DOI: 10.7717/peerj-cs.201/fig-12

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 18/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-12
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


To find the distance between data instances, we used the Euclidean distance function
and Manhattan distance function. Data instances having small differences were placed
in same cluster by the Euclidean distance function, as it ignores the small differences.

Table 2 Description of Patent corpus5000 datasets.

Patent
corpus1

Patent
corpus2

Patent
corpus3

Patent
corpus4

Patent
corpus5

Patent
corpus6

Number of text documents 100 150 200 250 300 350

Number of clusters 6 7 9 9 8 7

Table 3 Clustering results of SSODCSC: Patent corpus5000 datasets.

Dataset SICD Cosine similarity F-Measure Accuracy

Patent corpus1 10,263.55 0.8643 0.8666 87.53

Patent corpus2 12,813.98 0.7517 0.7611 79.24

Patent corpus3 16,600.41 0.7123 0.7316 74.29

Patent corpus4 20,580.11 0.9126 0.9315 94.05

Patent corpus5 23,163.24 0.8143 0.8255 83.17

Patent corpus6 28,426.86 0.8551 0.8703 86.25

Table 4 Relationship between SICD and number of iterations: SSODCSC: Patent corpus5000
datasets.

Dataset 100 iterations 150 iterations 200 iterations 250 iterations 300 iterations

Patent corpus1 27,500.23 21,256.45 16,329.59 13,260.72 10,263.55

Patent corpus2 23,464.44 21,501.16 17,467.15 15,254.33 12,813.98

Patent corpus3 25,731.05 22,150.15 19,456.25 18,204.42 16,600.41

Patent corpus4 31,189.46 28,506.72 27,155.68 24,638.83 20,580.11

Patent corpus5 36,124.30 33,854.35 30,109.52 26,138.59 23,163.24

Patent corpus6 41,201.22 37,367.33 33,632.63 31,007.25 28,426.86

Note:
The best values are specified in bold.

Table 5 Comparison between distance functions: SSODCSC: Patent corpus5000 datasets.

Dataset Euclidean distance function Manhattan distance function

Accuracy Avg. cosine
similarity

Accuracy Avg. cosine
similarity

Patent corpus1 87.53 0.8643 82.05 0.8198

Patent corpus2 79.24 0.7517 73.33 0.7344

Patent corpus3 74.29 0.7123 68.03 0.69.95

Patent corpus4 94.05 0.9126 85.27 0.8637

Patent corpus5 83.17 0.8143 76.49 0.7743

Patent corpus6 86.25 0.8551 80.46 0.8142

Note:
The best values are specified in bold.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 19/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


It was found that SSODCSC produced a slightly better clustering result with the Euclidean
distance function as shown in Table 5.

Table 6 specifies the comparison between clustering algorithms with respect to SICD
values. Table 7 specifies the comparison between clustering algorithms with respect
to accuracy. SSODCSC produces better accuracy for all datasets. The overall percentage
increase in the accuracy is approximately 13%.

The silhouette coefficient SC of a data instance di can be calculated using Eq. (21).

SC ¼ b � a
max a; bð Þ (21)

where a is the average of distances between data instance di and other data instances
present in its containing cluster, and b is the minimum of distances between data instance

Table 6 Comparison between clustering algorithms in terms of SICD: Patent corpus5000 datasets.

Dataset K-means PSO GA ABC IBCO ACO SMSSO BFGSA SOS SSO SSODCSC

Patent corpus1 13,004.21 13,256.55 13,480.76 13,705.09 14,501.76 14,794.09 12,884.53 13,250.71 13,024.83 12,159.98 10,263.55

Patent corpus2 15,598.25 15,997.44 16,044.05 15,800.55 16,895.58 17,034.29 14,057.22 16,842.83 15,803.19 14,809.66 12,813.98

Patent corpus3 20,007.12 21,255.77 23,903.11 24,589.19 19,956.44 19,543.05 18,183.14 21,259.03 19,045.42 18,656.93 16,600.41

Patent corpus4 24,175.19 25,023.52 27,936.76 28,409.58 24,498.32 25,759.48 23,637.83 25,109.06 24,264.31 23,447.12 20,580.11

Patent corpus5 31,064.62 29,879.76 31,007.15 31,588.66 27,442.28 30,015.64 28,268.55 30,129.24 29,176.48 26,289.88 23,163.24

Patent corpus6 29,846.53 32,226.51 33,509.84 34,185.35 31,993.79 32,753.55 30,005.81 32,208.31 31,804.89 31,615.35 28,426.86

Note:
The best values are specified in bold.

Table 7 Comparison of clustering algorithms in terms of accuracy: Patent corpus5000 datasets.

Dataset K-means PSO GA ABC IBCO ACO SMSSO BFGSA SOS SSO SSODCSC

Patent corpus1 68.29 57.97 57.06 58.26 56.08 54.15 68.28 76.03 49.14 70.22 87.53

Patent corpus2 70.21 69.38 67.15 68.57 67.88 67.05 62.25 69.92 60.05 69.45 79.24

Patent corpus3 65.15 64.95 62.99 63.25 67.09 66.98 51.19 68.28 64.03 67.69 74.29

Patent corpus4 64.93 61.03 58.78 59.11 58.12 69.49 55.28 68.87 62.49 71.10 84.05

Patent corpus5 69.72 57.38 55.80 56.07 44.67 68.05 61.51 64.62 68.55 71.16 83.17

Patent corpus6 58.35 62.59 60.65 61.47 54.95 69.51 64.63 69.55 72.01 70.29 85.25

Note:
The best values are specified in bold.

Table 8 Average silhouette coefficient value: Patent corpus datasets.

Dataset K-means PSO GA ABC IBCO ACO SMSSO BFGSA SOS SSO SSODCSC

Patent corpus1 0.5043 0.5990 0.4001 0.4109 0.4844 0.3184 0.4908 0.7015 0.5804 0.7159 0.7405

Patent corpus2 0.5107 0.6220 0.5922 0.3906 0.4335 0.4577 0.5388 0.6799 0.6496 0.6884 0.7797

Patent corpus3 0.4498 0.4411 0.4804 0.4188 0.5913 0.4990 0.6588 0.6731 0.6005 0.6691 0.7551

Patent corpus4 0.3466 0.6618 0.5269 0.4401 0.4548 0.4018 0.6106 0.7177 0.5985 0.6994 0.8009

Patent corpus5 0.4082 0.3933 0.4005 0.4905 0.3997 0.4833 0.6933 0.7269 0.6208 0.7280 0.7648

Patent corpus6 0.3225 0.4119 0.5507 0.5055 0.4883 0.4397 0.7045 0.6894 0.7328 0.7448 0.8397

Note:
The best values are specified in bold.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 20/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


di and data instances present in other clusters. The range of the silhouette coefficient
is [-1, 1]. When it is closer to 1, better clustering results will be produced. Table 8 specifies
the comparison between clustering algorithms with respect to the average silhouette
coefficient values of datasets. SSODCSC produces better average silhouette coefficient
values for all Patent corpus datasets.

From Figs. 13 to 14, it is obvious that SSODCSC produces the largest inter-cluster
distances and smallest ICD for Patent corpus datasets.

Applying SSODCSC on UCI datasets
We applied SSODCSC on UCI data sets as well. The description of the data sets is given
in Table 9. Table 10 specifies the relationship between SICD values and the number of
iterations. As we increase the number of iterations, the SICD is also reduced. For the
Iris dataset, the SICD value is 125.7045 at 100 iterations but as we increase the number of
iterations, the SICD value of the clustering result also decreases until it reaches 95.2579
at 300 iterations. However, it remains at 95.2579, after 300 iterations and it becomes
obvious that SSODCSC converges in 300 iterations.

Table 11 specifies the best three spiders for the Iris dataset. We initialized a solution space
with 50 spiders, among which, the first 30 spiders were females and the remaining were
males. Our proposed algorithm returned spider 21, spider 35, and spider 16. Spider 21 and
spider 16 were females and spider 35 was a male spider. The centroids of these spiders
were (6.7026, 3.0001, 5.4820, 2.018), (5.193, 3.5821, 1.4802, 0.2402), and (5.8849, 2.8009,
4.4045, 1.4152), respectively. These centroids have four values as Iris dataset consists of four

Figure 13 Inter-cluster distances: Patent corpus5000 datasets. The figure specifies inter-cluster dis-
tances returned by clustering algorithms when applied on Patent corpus5000 datasets.

Full-size DOI: 10.7717/peerj-cs.201/fig-13

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 21/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-13
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


attributes. The sum of the distances between the 150 data instances present in the Iris dataset
and their nearest centroids in Table 11 was found to be 95.2579, as shown in Table 10.

Table 12 specifies the best six spiders for vowel dataset. Our proposed algorithm
returned spider 10, spider 25, spider 42, spider 22, spider 48, and spider 5. Spider 10,
spider 25, spider 22, and spider 5 were females, and spiders 42 and spider 48 were male
spiders. The sum of the distance between the 871 data instances present in the Vowel data
set and their nearest centroids in Table 12 was found to be 146,859.1084, as shown in
Table 10.

Table 13 specifies the best three spiders for the CMC dataset. Our proposed algorithm
returned spider 23, spider 38, and spider 16 among which spider 23 and spider 16

Figure 14 Intra-cluster distances: Patent corpus5000 datasets. The figure specifies intra-cluster dis-
tances returned by clustering algorithms when applied on Patent corpus5000 datasets.

Full-size DOI: 10.7717/peerj-cs.201/fig-14

Table 9 Description of UCI datasets.

Dataset Number of classes Number of attributes Number of instances

Iris 3 4 150

Wine 3 13 178

Glass 6 9 214

Vowel 6 3 871

Cancer 2 9 683

CMC 3 9 1,473

Haberman 2 3 306

Bupa 2 6 345

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 22/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-14
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


were females and spider 38 was a male spider. The centroids of these spiders are specified.
These centroids had nine values as the CMC dataset consists of nine attributes. The sum of
the distance between the 1,473 data instances present in the CMC data set and their nearest
centroids in Table 13 was found to be 5,501.2642, as shown in Table 10.

Tables 14 and 15 specify the best spiders and their centroids for the Glass and Wine
datasets, respectively. The sum of the distances between the 214 data instances present
in the Glass dataset and their nearest centroids in Table 14 was found to be equal to
207.2091, as shown in Table 10. If we find the sum of the distances between the 178 data
instances present in the Wine dataset and their nearest centroids in Table 15, it would be
equal to 16,270.1427, as shown in Table 10.

Table 10 Relationship between SICD and number of iterations: SSODCSC: UCI datasets.

Dataset 100 iterations 150 iterations 200 iterations 250 iterations 300 iterations

Iris 125.7045 118.9034 107.0844 100.3683 95.2579

Vowel 147,257.5582 147,001.1863 146,948.7469 146,893.7569 146,859.1084

CMC 6,206.8186 6,127.4439 5,986.2964 5,574.6241 5,501.2642

Glass 387.5241 340.3885 301.0084 258.3053 207.2091

Wine 17,358.0946 17,150.6084 16,998.4387 16,408.5572 16,270.1427

Note:
The best values are specified in bold.

Table 11 Best three spiders of Iris dataset: SSODCSC.

Best spiders Dimension 1 Dimension 2 Dimension 3 Dimension 4

Spider 21 6.7026 3.0001 5.482 2.018

Spider 35 5.193 3.5821 1.4802 0.2402

Spider 16 5.8849 2.8009 4.4045 1.4152

Table 12 Best six spiders of Vowel dataset: SSODCSC.

Best spiders Dimension 1 Dimension 2 Dimension 3

Spider 10 508.4185 1,838.7035 2,558.1605

Spider 25 408.0024 1,013.0002 2,310.9836

Spider 42 624.0367 1,308.0523 2,333.8023

Spider 22 357.1078 2,292.1580 2,976.9458

Spider 48 377.2070 2,150.0418 2,678.0003

Spider 5 436.8024 993.0034 2,659.0012

Table 13 Best three spiders of CMC dataset: SSODCSC.

Best
spiders

Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5 Dimension 6 Dimension 7 Dimension 8 Dimension 9

Spider 23 24.4001 3.0699 3.4986 1.8021 0.9303 0.8206 2.2985 2.9584 0.0271

Spider 38 43.7015 2.9929 3.4602 3.4568 0.8209 0.8330 1.8215 3.4719 3.306

Spider 16 33.4894 3.0934 3.5599 3.5844 0.8015 0.6629 2.169 3.2901 0.0704

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 23/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


Table 16 specifies the average CPU time per iteration (in seconds), when clustering
algorithms were applied on the CMC dataset. It was found that SSODCSC produces
clustering results with the shortest average CPU time per iteration.

Table 17 specifies the average CPU time per iteration (in seconds), when clustering
algorithms were applied on the Vowel dataset. It was found that K-means produces
clustering results with the shortest average CPU time per iteration. The SSODCSC has the
second shortest average CPU time per iteration.

Table 18 specifies F-measure values obtained by the clustering algorithms when they are
applied on the Iris, Glass, Vowel, Wine, Cancer, and CMC datasets, respectively. It is
evident that SSODCSC produces the best F-measure values. The overall percentage
increase in the F-measure value is approximately 10%.

We computed the ICD and inter-cluster distances of the resultant clusters of the
clustering algorithms when applied on UCI datasets. From Figs. 15 to 17 it is obvious that

Table 15 Best three spiders of Wine dataset: SSODCSC.

Best
spiders

Dimension
1

Dimension
2

Dimension
3

Dimension
4

Dimension
5

Dimension
6

Dimension
7

Dimension
8

Dimension
9

Dimension
10

Dimension
11

Dimension
12

Dimension
13

Spider 33 12.89 2.12 2.41 19.51 98.89 2.06 1.46 0.47 1.52 5.41 0.89 2.15 686.95

Spider 4 12.68 2.45 2.41 21.31 92.41 2.13 1.62 0.45 1.14 4.92 0.82 2.71 463.71

Spider 3 13.37 2.31 2.62 17.38 105.08 2.85 3.28 0.29 2.67 5.29 1.04 3.39 1,137.5

Table 14 Best six spiders of Glass dataset: SSODCSC.

Best
spiders

Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5 Dimension 6 Dimension 7 Dimension 8 Dimension 9

Spider 13 1.5201 14.6023 0.06803 2.2617 73.3078 0.0094 8.7136 1.01392 0.0125

Spider 29 1.5306 13.8005 3.5613 0.9603 71.8448 0.1918 9.5572 0.0827 0.0071

Spider 35 1.5169 13.3158 3.6034 1.4236 72.7014 0.5771 8.2178 0.0076 0.0321

Spider 42 1.4138 13.0092 0.0036 3.0253 70.6672 6.2470 6.9489 0.0078 0.0004

Spider 48 1.5205 12.8409 3.4601 1.3091 73.0315 0.6178 8.5902 0.0289 0.0579

Spider 7 1.5214 13.0315 0.2703 1.5193 72.7601 0.3615 11.995 0.0472 0.0309

Table 16 Comparison of clustering algorithms in terms of average CPU time per iteration (in
seconds): CMC dataset.

K-means PSO IBCO ACO SMSSO BFGSA SOS SSO SSODCSC

Best 0.0041 0.0151 0.0168 0.0186 0.0097 0.0148 0.0116 0.0065 0.0048

Average 0.0068 0.0192 0.0205 0.0215 0.0126 0.0172 0.0129 0.0082 0.0055

Worst 0.0072 0.0235 0.0245 0.0278 0.0138 0.0194 0.0144 0.0097 0.0069

Table 17 Comparison of clustering algorithms in terms of average CPU time per iteration (in
seconds): Vowel dataset.

K-means PSO IBCO ACO SMSSO BFGSA SOS SSO SSODCSC

Best 0.0125 0.1263 0.1455 0.1602 0.0188 0.0206 0.0215 0.0178 0.0145

Average 0.0136 0.1923 0.2034 0.1698 0.0204 0.0218 0.0228 0.0195 0.0172

Worst 0.0155 0.2056 0.2245 0.1893 0.0219 0.0231 0.0239 0.0227 0.0198

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 24/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


SSODCSC produces the smallest ICD for UCI datasets. From Figs. 18 to 19 it can be
concluded that SSODCSC produces the largest inter-cluster distances for UCI datasets, as
compared with other clustering algorithms.

Table 19 specifies the comparison between clustering algorithms with respect to the
average silhouette coefficient values of UCI datasets. SSODCSC produces better average
silhouette coefficient values for UCI datasets also. The average silhouette coefficient values
produced by SSODCSC are 0.7505, 0.6966, 0.7889, 0.7148, 0.8833, and 0.6264 for Wine,
Cancer, CMC, Vowel, Iris, and Glass datasets, respectively.

Statistical analysis: Patent corpus datasets
To show the significance of the proposed algorithm, we applied a one-way ANOVA test on
the accuracy values shown in Table 7. Sum, Sum squared, Mean, and Variance of the
clustering algorithms are specified in Table 20.

Table 18 Comparison of clustering algorithms in terms of F-measure values: UCI datasets.

Dataset K-means PSO GA ABC IBCO ACO SMSSO BFGSA SOS SSO SSODCSC

Wine 82.25 78.79 70.25 72.48 63.34 64.88 60.10 67.88 63.78 78.42 94.98

Cancer 76.95 83.42 71.38 70.55 62.98 60.34 61.95 62.03 64.80 74.34 96.49

CMC 50.25 51.49 55.15 57.79 51.92 50.49 51.98 52.92 52.00 51.45 61.01

Vowel 66.10 68.11 60.69 64.74 62.12 68.13 54.00 68.68 65.56 70.85 90.46

Iris 94.43 90.95 62.41 62.58 60.43 71.95 64.43 62.47 62.43 85.81 96.95

Glass 52.88 44.94 45.01 43.72 54.66 43.36 55.48 42.21 44.46 58.54 70.92

Note:
The best values are specified in bold.

Figure 15 Intra-cluster distances: UCI datasets: Iris and Glass datasets. The figure compares the
clustering algorithms based on intra-cluster distances when applied on Iris and Glass datasets.

Full-size DOI: 10.7717/peerj-cs.201/fig-15

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 25/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-15
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


Figure 16 Intra-cluster distances: UCI datasets: Wine and Bupa datasets. The figure compares intra-
cluster distances of clustering algorithms when applied on Wine and Bupa datasets.

Full-size DOI: 10.7717/peerj-cs.201/fig-16

Figure 17 Intra-cluster distances: UCI datasets: Haberman, Cancer, and CMC datasets. The figure
compares intra-cluster distances of clustering algorithms when applied on Haberman, Cancer, and CMC
datasets. Full-size DOI: 10.7717/peerj-cs.201/fig-17

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 26/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-16
http://dx.doi.org/10.7717/peerj-cs.201/fig-17
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


Figure 18 Inter-cluster distances: UCI datasets: Iris, Haberman, Cancer, and CMC. The figure
compares inter-cluster distances of clustering algorithms when applied on Iris, Haberman, Cancer, and
CMC datasets. Full-size DOI: 10.7717/peerj-cs.201/fig-18

Figure 19 Inter-cluster distances: UCI datasets: Glass, Wine, and Bupa datasets. The figure compares
inter-cluster distances of clustering algorithms when applied on Glass, Wine, and Bupa datasets.

Full-size DOI: 10.7717/peerj-cs.201/fig-19

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 27/33

http://dx.doi.org/10.7717/peerj-cs.201/fig-18
http://dx.doi.org/10.7717/peerj-cs.201/fig-19
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


Degrees of freedom: df1 = 10, df2 = 55

Sum of squares for treatment (SSTR) = 2,761.313

Sum of squares for error (SSE) = 1,625.378

Table 19 Average silhouette coefficient value: UCI datasets.

Dataset K-means PSO GA ABC IBCO ACO SMSSO BFGSA SOS SSO SSODCSC

Wine 0.6490 0.5629 0.5008 0.5226 0.4151 0.4488 0.6003 0.6109 0.6417 0.6885 0.7505

Cancer 0.5894 0.6228 0.5277 0.5848 0.4492 0.4852 0.5995 0.5651 0.5999 0.6107 0.6966

CMC 0.3733 0.3281 0.3162 0.3726 0.3447 0.4984 0.4805 0.4900 0.4788 0.5111 0.7889

Vowel 0.4588 0.4079 0.4277 0.4011 0.6212 0.4105 0.5822 0.6255 0.6020 0.6492 0.7148

Iris 0.7099 0.7165 0.4388 0.4736 0.6043 0.5059 0.6253 0.5796 0.6511 0.6333 0.8833

Glass 0.3661 0.2805 0.2996 0.2070 0.5466 0.2900 0.4896 0.4155 0.4011 0.4419 0.6264

Note:
The best values are specified in bold.

Table 20 Statistical results of one-way ANOVA test when applied on accuracy values returned by
clustering algorithms for Patent corpus5000 datasets.

Dataset Sum Sum squared Mean Variance

K-means 396.650 26,318.996 66.108 19.425

PSO 373.300 23,327.241 62.217 20.352

GA 362.430 21,979.857 60.405 17.455

ABC 366.730 22,513.033 61.122 19.577

IBCO 348.790 20,646.575 58.132 74.166

ACO 395.230 26,205.548 65.872 34.218

SMSSO 363.140 22,174.032 60.523 39.118

BFGSA 417.270 29,087.550 69.545 13.701

SOS 376.270 23,910.126 62.712 62.721

SSO 419.910 29,395.727 69.985 1.665

SSODCSC 493.530 40,708.696 69.985 22.677

Table 21 Statistical results of one-way ANOVA test when applied on F-measure values returned by
clustering algorithms for UCI datasets.

Dataset Sum Sum squared Mean Variance

K-means 422.860 31,293.957 70.477 298.439

PSO 417.700 30,748.459 69.617 333.915

GA 364.890 22,675.874 60.815 97.018

ABC 371.860 23,589.299 61.977 108.531

IBCO 355.450 21,172.517 59.242 23.013

ACO 359.150 22,098.159 59.858 120.008

SMSSO 347.940 20,296.988 57.990 23.990

BFGSA 356.190 21,657.069 59.365 102.370

SOS 353.030 21,143.238 58.939 74.308

SSO 419.410 30,133.245 69.902 163.157

SSODCSC 510.810 44,665.701 85.135 235.578

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 28/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


Total sum of squares (SST = SSE + SSTR) = 4,386.691

Mean square treatment (MSTR = SSTR/df1) = 276.131

Mean square error (MSE = SSE/df2) = 29.552

F (= MSTR/MSE) = 9.344

Probability of calculated F = 0.0000000080

F critical (5% one tailed) = 2.008

We can reject the null hypothesis as calculated F (9.344) is greater than F critical (2.008).

1) Post-hoc analysis using Tukeys’ honestly significant difference method:

Assuming significance level of 5%.

Studentized range for df1 = 10 and df2 = 55 is 4.663.

Tukey honestly significant difference = 10.349.

Mean of K-means and SSODCSC differs as 16.14667 is greater than 10.349.

Mean of PSO and SSODCSC differs as 20.03833 is greater than 10.349.

Mean of genetic algorithms (GA) and SSODCSC differs as 21.85000 is greater than 10.349.

Mean of artificial bee colony (ABC) and SSODCSC differs as 21.13333 is greater than
10.349.

Mean of improved bee colony optimization (IBCO) and SSODCSC differs as 24.12333 is
greater than 10.349.

Mean of ACO and SSODCSC differs as 16.38333 is greater than 10.349.

Mean of SMSSO and SSODCSC differs as 21.73167 is greater than 10.349.

Mean of BFGSA and SSODCSC differs as 12.71000 is greater than 10.349.

Mean of SOS and SSODCSC differs as 19.54333 is greater than 10.349.

Mean of SSO and SSODCSC differs as 12.27000 is greater than 10.349.

Therefore, it may be concluded that SSODCSC significantly differs from other clustering
algorithms.

Statistical analysis: UCI datasets
To show the significance of the proposed algorithm, we applied a one-way ANOVA test on
the F-measure values shown in Table 17. Sum, Sum squared, Mean, and Variance of the
clustering algorithms are specified in Table 21.

Degrees of freedom: df1 = 10, df2 = 55.

Sum of squares for treatment (SSTR) = 4,113.431.

Sum of squares for error (SSE) = 7,901.638.

Total sum of squares (SST = SSE + SSTR) = 12,015.069.

Mean square treatment (MSTR = SSTR/df1) = 411.343.

Mean square error (MSE = SSE/df2) = 143.666.

F (= MSTR/MSE) = 2.863.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 29/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


Probability of calculated F = 0.0060723031.

F critical (5% one tailed) = 2.008.

So, we can reject the null hypothesis as calculated F (2.863) is greater than F critical (2.008).

1) Post-hoc analysis using Tukeys’ honestly significant difference method

Assuming significance level of 5%.

Studentized range for df1 = 10 and df2 = 55 is 4.663.

Tukey honestly significant difference = 22.819.

Means of GA and SSODCSC differ as 24.32000 is greater than 22.819.

Means of ABC and SSODCSC differ as 23.15833 is greater than 22.819.

Means of IBCO and SSODCSC differ as 25.89333 is greater than 22.819.

Means of ACO and SSODCSC differ as 25.27667 is greater than 22.819.

Means of SMSSO and SSODCSC differ as 27.14500 is greater than 22.819.

Means of BFGSA and SSODCSC differ as 25.77000 is greater than 22.819.

Means of SOS and SSODCSC differ as 26.29667 is greater than 22.819.

Therefore, it is obvious that SSODCSC significantly differs from most of the other
clustering algorithms when applied on UCI datasets.

DISCUSSION
We applied our proposed algorithm on Patent corpus datasets (PatentCorpus5000, 2012)
and UCI datasets (Lickman, 2013). If the population size is small then the optimal solution
is hard to find. If it is large, then the optimal solution is guaranteed with a side effect
of higher computational complexity. We used 50 spiders to obtain the optimal solution
without the side effect of higher computational complexity. Among these spiders, 70%
were female spiders and the remaining were male spiders. The TP value was set to 0.7.

We compared the clustering results of SSODCSC with other clustering algorithms such
as K-means, PSO, GA, ABC optimization, ACO, IBCO (Forsati, Keikha & Shamsfard,
2015), SMSSO, BFGSA, SOS, and SSO implementation in which each spider is a collection
of K centroids, and found that SSODCSC produces better clustering results.

In order to conduct experiments, we formed the Patent corpus1 dataset by taking
100 text documents that belong to six different classes, Patent corpus2 dataset by taking
150 text documents that belong to seven different classes, Patent corpus3 dataset by
taking 200 text documents that belong to nine different classes, Patent corpus4 dataset by
taking 250 text documents that belong to nine different classes, Patent corpus5 dataset
by taking 300 text documents that belong to eight different classes, and Patent corpus6
dataset by taking 350 text documents that belong to seven different classes of Patent
corpus5000 data repository.

The clustering quality can be validated using ICD and inter-cluster distances.
The smaller value for intra-cluster distance and a larger value for inter-cluster distance are
the requirements for any clustering algorithm. We computed the ICD and inter-cluster
distances of the resultant clusters of the clustering algorithms, when applied on Patent

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 30/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


corpus datasets and UCI datasets, and found that SSODCSC produces better results than
the other clustering algorithms.

We compared the clustering algorithms on the basis of average CPU time per iteration
(in seconds). We found that SSODCSC has the shortest average CPU time per iteration
with respect to most of the datasets. The reasons for this are its ability to produce a
better solution space after every iteration, to initialize the solution space in less time,
to compute fitness values of the spiders in less time, and to find the next positions of the
spiders in less time.

We compared the clustering algorithms on the basis of the average silhouette coefficient
value. We found that SSODCSC produces better average silhouette coefficient values
for both Patent corpus datasets and UCI datasets.

We conducted a one-way ANOVA test separately on the clustering results of Patent
corpus datasets and UCI datasets to show the superiority and applicability of the proposed
method with respect to text datasets and feature based datasets.

CONCLUSION
In this paper, we proposed a novel implementation of SSO for data clustering using a single
centroid representation and enhanced mating. Additionally, we allowed non-dominant
male spiders to mate with female spiders by converting them into dominant males.
As a result, the explorative power of the algorithm has been increased and thereby the
chance of getting a global optimum has been improved. We compared SSODCSC with
other state-of-the-art algorithms and found that it produces better clustering results.
We applied SSODCSC on Patent corpus text datasets and UCI datasets and got better
clustering results than other algorithms. We conducted a one-way ANOVA test to show its
superiority and applicability with respect to text datasets and feature-based datasets.
Future work will include the study of applicability of SSODCSC in data classification of
brain computer interfaces.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
The authors received no funding for this work.

Competing Interests
The authors declare that they have no competing interests.

Author Contributions
� Ravichandran Thalamala conceived and designed the experiments, performed the
experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared
figures and/or tables, performed the computation work, authored or reviewed drafts of
the paper, approved the final draft, studying literature.

� Janet Barnabas conceived and designed the experiments, analyzed the data, contributed
reagents/materials/analysis tools.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 31/33

http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


� A.V. Reddy conceived and designed the experiments, analyzed the data, contributed
reagents/materials/analysis tools.

Data Availability
The following information was supplied regarding data availability:

The two datasets are available in the Supplemental Files. The proposed algorithm was
applied on text data files in PATENTCORPUS5000 datasets and feature/attribute-based
UCI datasets.

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/
peerj-cs.201#supplemental-information.

REFERENCES
Aloise D, Deshpande A, Hansen P, Popat P. 2009. NP-hardness of Euclidean sum-of-squares

clustering. Machine Learning 75(2):245–248 DOI 10.1007/s10994-009-5103-0.

Alswaitti M, Albughdadi M, Isa NAM. 2018. Density-based particle swarm optimization
algorithm for data clustering. Expert Systems with Applications 91:170–186
DOI 10.1016/j.eswa.2017.08.050.

Beil F, Ester M, Xu X. 2002. Frequent term-based text clustering. In: Proceedings of
International Conference on Knowledge Discovery and Data Mining KDD ‘02, Alberta,
436–442.

Bernábe-Loranca B, Gonzalez-Velázquez R, Olivares-Benítez E, Ruiz-Vanoye J,
Martínez-Flores J. 2014. Extensions to K-Medoids with balance restrictions over the
cardinality of the partitions. Journal of Applied Research and Technology 12(3):396–408
DOI 10.1016/S1665-6423(14)71621-9.

Chandran TR, Reddy AV, Janet B. 2018. An effective implementation of social spider
optimization for text document clustering using single cluster approach. In: 2018 Second
International Conference on Inventive Communication and Computational Technologies
(ICICCT), Coimbatore, 508–511.

Chen S, Sun T, Yang F, Sun H, Guan Y. 2018. An improved optimum-path forest clustering
algorithm for remote sensing image segmentation. Computers & Geosciences 112:38–46
DOI 10.1016/j.cageo.2017.12.003.

Costa KAP, Pereira LAM, Nakamura RYM, Pereira CR, Papa JP, Falcão AX. 2015.
A nature-inspired approach to speed up optimum-path forest clustering and its application
to intrusion detection in computer networks. Information Sciences 294:95–108
DOI 10.1016/j.ins.2014.09.025.

Cuevas E, Cienfuegos M. 2014. A new algorithm inspired in the behavior of the social-spider
for constrained optimization. Expert Systems with Applications 41(2):412–425
DOI 10.1016/j.eswa.2013.07.067.

Cuevas E, Cienfuegos M, Zaldívar D, Pérez-Cisneros M. 2013. A swarm optimization algorithm
inspired in the behavior of the social-spider. Expert Systems with Applications 40(16):6374–6384
DOI 10.1016/j.eswa.2013.05.041.

De Andrade Silva J, Hruschka ER, Gama J. 2017. An evolutionary algorithm for clustering
data streams with a variable number of clusters. Expert Systems with Applications 67:228–238
DOI 10.1016/j.eswa.2016.09.020.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 32/33

http://dx.doi.org/10.7717/peerj-cs.201#supplemental-information
http://dx.doi.org/10.7717/peerj-cs.201#supplemental-information
http://dx.doi.org/10.7717/peerj-cs.201#supplemental-information
http://dx.doi.org/10.1007/s10994-009-5103-0
http://dx.doi.org/10.1016/j.eswa.2017.08.050
http://dx.doi.org/10.1016/S1665-6423(14)71621-9
http://dx.doi.org/10.1016/j.cageo.2017.12.003
http://dx.doi.org/10.1016/j.ins.2014.09.025
http://dx.doi.org/10.1016/j.eswa.2013.07.067
http://dx.doi.org/10.1016/j.eswa.2013.05.041
http://dx.doi.org/10.1016/j.eswa.2016.09.020
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/


Forsati R, Keikha A, Shamsfard M. 2015. An improved bee colony optimization algorithm with
an application to document clustering. Neurocomputing 159:9–26.

Guha S, Rastogi R, Shim K. 1998. CURE: an efficient clustering algorithm for large databases.
ACM SIGMOD Record 27(2):73–84 DOI 10.1145/276305.276312.

Han XH, Quan L, Xiong XY, Almeter M, Xiang J, Lan Y. 2017. A novel data clustering algorithm
based on modified gravitational search algorithm. Engineering Applications of Artificial
Intelligence 61:1–7 DOI 10.1016/j.engappai.2016.11.003.

Jothi R, Mohanty SK, Ojha A. 2018. Fast approximate minimum spanning tree based clustering
algorithm. Neurocomputing 272:542–557 DOI 10.1016/j.neucom.2017.07.038.

Karypis G, Han E-H, Kumar V. 1999. Chameleon: hierarchical clustering using dynamic
modeling. Computer 32(8):68–75 DOI 10.1109/2.781637.

Lickman M. 2013. UCI Machine Learning Repository. Available at https://archive.ics.uci.edu/ml/
datasets.php.

Mihai D, Mocanu M. 2015. Statistical considerations on the k-means algorithm.
Annals of the University of Craiova, Mathematics and Computer Science Series
42(2):365–373.

Nayak J, Naik B, Kanungo DP, Behera HS. 2018. A hybrid elicit teaching learning based
optimization with fuzzy c-means (ETLBO-FCM) algorithm for data clustering. Ain Shams
Engineering Journal 9(3):379–393 DOI 10.1016/j.asej.2016.01.010.

Nayak J, Nanda M, Nayak K, Naik B, Behera HS. 2014. An improved firefly fuzzy C-means
(FAFCM) algorithm for clustering real world data sets. In: Kumar Kundu M, Mohapatra D,
Konar A, Chakraborty A, eds. Advanced Computing, Networking and Informatics, Vol. 1.
Smart Innovation, Systems and Technologies. Vol. 27. Cham: Springer.

Ng RT, Han J. 2002. CLARANS: a method for clustering objects for spatial data mining.
IEEE Transactions on Knowledge and Data Engineering 14(5):1003–1016
DOI 10.1109/TKDE.2002.1033770.

PatentCorpus5000. 2012. Patent Corpus 5000 datasets. Available at http://www.ezcodesample.
com/semanticsearchart/downloadsdatacorpus.html.

Rani Y, Rohil H. 2013. A study of hierarchical clustering algorithm. International Journal of
Information and Computation Technology 3(11):1225–1232.

Shukla UP, Nanda SJ. 2016. Parallel social spider clustering algorithm for high dimensional datasets.
Engineering Applications of Artificial Intelligence 56:75–90 DOI 10.1016/j.engappai.2016.08.013.

Steinley D. 2006. K-means clustering: a half-century synthesis. British Journal of Mathematical and
Statistical Psychology 59(1):1–34 DOI 10.1348/000711005X48266.

Thalamala RC, Reddy AV, Janet B. 2018. A novel bio-inspired algorithm based on social spiders
for improving performance and efficiency of data clustering. Epub ahead of print 14 February
2018. Journal of Intelligent Systems DOI 10.1515/JISYS-2017-0178.

Zhang T, Ramakrishnan R, Livny M. 1996. BIRCH: an efficient data clustering method for
very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on
Management of Data, Montreal, 103–114.

Zhou Y, Wu H, Luo Q, Abdel-Baset M. 2018. Automatic data clustering using nature-inspired
symbiotic organism search algorithm. Knowledge-Based Systems 163:546–557
DOI 10.1016/j.knosys.2018.09.013.

Zhou Y, Zhou Y, Luo Q, Abdel-Basset M. 2017. A simplex method-based social spider
optimization algorithm for clustering analysis. Engineering Applications of Artificial Intelligence
64:67–82 DOI 10.1016/j.engappai.2017.06.004.

Thalamala et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.201 33/33

http://dx.doi.org/10.1145/276305.276312
http://dx.doi.org/10.1016/j.engappai.2016.11.003
http://dx.doi.org/10.1016/j.neucom.2017.07.038
http://dx.doi.org/10.1109/2.781637
https://archive.ics.uci.edu/ml/datasets.php
https://archive.ics.uci.edu/ml/datasets.php
http://dx.doi.org/10.1016/j.asej.2016.01.010
http://dx.doi.org/10.1109/TKDE.2002.1033770
http://www.ezcodesample.com/semanticsearchart/downloadsdatacorpus.html
http://www.ezcodesample.com/semanticsearchart/downloadsdatacorpus.html
http://dx.doi.org/10.1016/j.engappai.2016.08.013
http://dx.doi.org/10.1348/000711005X48266
http://dx.doi.org/10.1515/JISYS-2017-0178
http://dx.doi.org/10.1016/j.knosys.2018.09.013
http://dx.doi.org/10.1016/j.engappai.2017.06.004
http://dx.doi.org/10.7717/peerj-cs.201
https://peerj.com/computer-science/

	A novel variant of social spider optimization using single centroid representation and enhanced mating for data clustering
	Introduction
	Related Work
	Proposed Algorithm: SSODCSC
	Results
	Discussion
	Conclusion
	References


<<
  /ASCII85EncodePages false
  /AllowTransparency false
  /AutoPositionEPSFiles true
  /AutoRotatePages /None
  /Binding /Left
  /CalGrayProfile (Dot Gain 20%)
  /CalRGBProfile (sRGB IEC61966-2.1)
  /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2)
  /sRGBProfile (sRGB IEC61966-2.1)
  /CannotEmbedFontPolicy /Warning
  /CompatibilityLevel 1.4
  /CompressObjects /Off
  /CompressPages true
  /ConvertImagesToIndexed true
  /PassThroughJPEGImages true
  /CreateJobTicket false
  /DefaultRenderingIntent /Default
  /DetectBlends true
  /DetectCurves 0.0000
  /ColorConversionStrategy /LeaveColorUnchanged
  /DoThumbnails false
  /EmbedAllFonts true
  /EmbedOpenType false
  /ParseICCProfilesInComments true
  /EmbedJobOptions true
  /DSCReportingLevel 0
  /EmitDSCWarnings false
  /EndPage -1
  /ImageMemory 1048576
  /LockDistillerParams false
  /MaxSubsetPct 100
  /Optimize true
  /OPM 1
  /ParseDSCComments true
  /ParseDSCCommentsForDocInfo true
  /PreserveCopyPage true
  /PreserveDICMYKValues true
  /PreserveEPSInfo true
  /PreserveFlatness true
  /PreserveHalftoneInfo false
  /PreserveOPIComments false
  /PreserveOverprintSettings true
  /StartPage 1
  /SubsetFonts true
  /TransferFunctionInfo /Apply
  /UCRandBGInfo /Preserve
  /UsePrologue false
  /ColorSettingsFile (None)
  /AlwaysEmbed [ true
  ]
  /NeverEmbed [ true
  ]
  /AntiAliasColorImages false
  /CropColorImages true
  /ColorImageMinResolution 300
  /ColorImageMinResolutionPolicy /OK
  /DownsampleColorImages false
  /ColorImageDownsampleType /Average
  /ColorImageResolution 300
  /ColorImageDepth 8
  /ColorImageMinDownsampleDepth 1
  /ColorImageDownsampleThreshold 1.50000
  /EncodeColorImages true
  /ColorImageFilter /FlateEncode
  /AutoFilterColorImages false
  /ColorImageAutoFilterStrategy /JPEG
  /ColorACSImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /ColorImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /JPEG2000ColorACSImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /JPEG2000ColorImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /AntiAliasGrayImages false
  /CropGrayImages true
  /GrayImageMinResolution 300
  /GrayImageMinResolutionPolicy /OK
  /DownsampleGrayImages false
  /GrayImageDownsampleType /Average
  /GrayImageResolution 300
  /GrayImageDepth 8
  /GrayImageMinDownsampleDepth 2
  /GrayImageDownsampleThreshold 1.50000
  /EncodeGrayImages true
  /GrayImageFilter /FlateEncode
  /AutoFilterGrayImages false
  /GrayImageAutoFilterStrategy /JPEG
  /GrayACSImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /GrayImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /JPEG2000GrayACSImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /JPEG2000GrayImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /AntiAliasMonoImages false
  /CropMonoImages true
  /MonoImageMinResolution 1200
  /MonoImageMinResolutionPolicy /OK
  /DownsampleMonoImages false
  /MonoImageDownsampleType /Average
  /MonoImageResolution 1200
  /MonoImageDepth -1
  /MonoImageDownsampleThreshold 1.50000
  /EncodeMonoImages true
  /MonoImageFilter /CCITTFaxEncode
  /MonoImageDict <<
    /K -1
  >>
  /AllowPSXObjects false
  /CheckCompliance [
    /None
  ]
  /PDFX1aCheck false
  /PDFX3Check false
  /PDFXCompliantPDFOnly false
  /PDFXNoTrimBoxError true
  /PDFXTrimBoxToMediaBoxOffset [
    0.00000
    0.00000
    0.00000
    0.00000
  ]
  /PDFXSetBleedBoxToMediaBox true
  /PDFXBleedBoxToTrimBoxOffset [
    0.00000
    0.00000
    0.00000
    0.00000
  ]
  /PDFXOutputIntentProfile (None)
  /PDFXOutputConditionIdentifier ()
  /PDFXOutputCondition ()
  /PDFXRegistryName ()
  /PDFXTrapped /False

  /CreateJDFFile false
  /Description <<
    /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002>
    /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002>
    /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e>
    /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e>
    /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e>
    /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e>
    /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e>
    /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002>
    /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e>
    /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.)
    /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e>
    /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e>
    /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e>
    /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e>
    /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers.  Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.)
  >>
  /Namespace [
    (Adobe)
    (Common)
    (1.0)
  ]
  /OtherNamespaces [
    <<
      /AsReaderSpreads false
      /CropImagesToFrames true
      /ErrorControl /WarnAndContinue
      /FlattenerIgnoreSpreadOverrides false
      /IncludeGuidesGrids false
      /IncludeNonPrinting false
      /IncludeSlug false
      /Namespace [
        (Adobe)
        (InDesign)
        (4.0)
      ]
      /OmitPlacedBitmaps false
      /OmitPlacedEPS false
      /OmitPlacedPDF false
      /SimulateOverprint /Legacy
    >>
    <<
      /AddBleedMarks false
      /AddColorBars false
      /AddCropMarks false
      /AddPageInfo false
      /AddRegMarks false
      /ConvertColors /NoConversion
      /DestinationProfileName ()
      /DestinationProfileSelector /NA
      /Downsample16BitImages true
      /FlattenerPreset <<
        /PresetSelector /MediumResolution
      >>
      /FormElements false
      /GenerateStructure true
      /IncludeBookmarks false
      /IncludeHyperlinks false
      /IncludeInteractive false
      /IncludeLayers false
      /IncludeProfiles true
      /MultimediaHandling /UseObjectSettings
      /Namespace [
        (Adobe)
        (CreativeSuite)
        (2.0)
      ]
      /PDFXOutputIntentProfileSelector /NA
      /PreserveEditing true
      /UntaggedCMYKHandling /LeaveUntagged
      /UntaggedRGBHandling /LeaveUntagged
      /UseDocumentBleed false
    >>
  ]
>> setdistillerparams
<<
  /HWResolution [2400 2400]
  /PageSize [612.000 792.000]
>> setpagedevice