key: cord-028657-q2ghtpd9
authors: Grass-Boada, Darian Horacio; Pérez-Suárez, Airel; Arco, Leticia; Bello, Rafael; Rosete, Alejandro
title: Overlapping Community Detection Using Multi-objective Approach and Rough Clustering
date: 2020-06-10
journal: Rough Sets
DOI: 10.1007/978-3-030-52705-1_31
sha: 
doc_id: 28657
cord_uid: q2ghtpd9

The detection of overlapping communities in Social Networks has been successfully applied in several contexts. Taking into account the high computational complexity of this problem as well as the drawbacks of single-objective approaches, community detection has been recently addressed as Multi-objective Optimization Evolutionary Algorithms (MOEAs). One of the challenges is to attain a final solution from the set of non-dominated solutions obtained by the MOEAs. In this paper, an algorithm to build a covering of the network based on the principles of the Rough Clustering is proposed. The experiments in a synthetic networks showed that our proposal is promising and effective for overlapping community detection in social networks.

The Analysis of Social Networks has received a lot of attention due to its wide range of applications in several contexts [1] . Specifically, in Social Network Analysis, the Community Detection Problem (CDP) plays an important role [5] . Community detection in social networks aims to organize the nodes of the network in groups or communities such that nodes belonging to the same community are densely interconnected but sparsely connected with the remaining nodes in the network [2] . Even though most of the community detection algorithms assume that communities are disjoint, according to Palla et al. in [6] , most real-world networks have overlapping community structure, that is, a node can belong to more than one community.

On the other hand, since the community detection problem has an NP-hard nature, most reported approaches use heuristics to search for a set of nodes that optimises an objective function which captures the intuition of community, these single-objective optimization approaches face two main difficulties: a) the optimization of only one function confines the solution to a particular community structure, and b) returning one single partition may not be suitable when the network has many potential structures. To overcome the aforementioned problems, many community detection algorithms model the problem as a Multi-objective Optimization Problem, and specifically, they use Multi-objective Optimization Evolutionary Algorithms (MOEAs) to solve them.

Once the set of non-dominated solutions is obtained by the MOEAs, one of the main challenges is to accomplish a final solution. Most of the proposed algorithms [5, [7] [8] [9] use the internal criteria (e.g., Modularity Index [10] ) or the external criteria (e.g., Normalized Mutual Information (NMI) [3] ) to select the final solution. The drawbacks of these approaches are that the internal criteria does not often correspond to the objective function used by MOEAs and the external criteria uses the ground truth of the network, which it is not always known. Also, the selected final solutions obtained by both approaches do not use the knowledge of the overlapping communities (Pareto set) obtained by MOEAs.

Rough Set Theory (RST) may be used to evaluate significance of attributes, to deal with inconsistent data, and to describe dependencies among attributes, to mention just some uses in machine learning and data mining [22] .

The main advantage of Rough Set Theory in data analysis is that it does not need any preliminary or additional information about data [17] . RST allows to approximate a rough concept by a pair of exact concepts, called the lower and upper approximations. The lower approximation is the set of objects definitely belonging to a vague concept, whereas the upper approximation is the set of objects possibly belonging to the mentioned vague concept [17] . The upper and lower approximations can be used in a broader context such as clustering, denoted as Rough Clustering [13] .

In our proposal, we focus on describing the relationship between the elements of the network (vertices) only taking into consideration their belonging to the communities of the Pareto Set. Then, we use Rough Clustering to obtain a final covering of the network, that describes the communities with their lower and upper approximations. The lower approximation is the set of vertices belonging to the community without uncertainty, whereas the upper approximation is the set of vertices possibly belonging to this community, therefore located at the boundary of it. Hence, the selected final solution uses the knowledge of the overlapping communities (Pareto set) obtained by MOEAs.

In this paper, we propose an Overlapping Community Detection Algorithm using Multi-objective approach and Rough Clustering, denoted as MOOCD-RC. Our algorithm allows selecting the final solution based on the subjective information as the number of vertices located in the cores or boundaries of the communities. As a consequence, it helps decision-makers (DM) incorporate their domain knowledge into the community detection process. Our main contributions are as follows:

1. We define an indiscernibility relationship between vertices of the network by taking the number of communities in the Pareto Set where they match. 2. We use the Rough Clustering foundation to build and describe the final covering of the network through the lower and upper approximations of the communities.

This paper is arranged as follows. Section 2 briefly introduces the necessary notions of multi-objective community detection problem and Rough Clustering. In Sect. 3, we introduce our proposal. Section 4 presents the experimental evaluation of our proposal and compared against other related state-of-the-art algorithms over synthetic networks. Finally, Sect. 5 gives the conclusions and some ideas about future work.

This section introduces the necessary background knowledge for understanding the proposed method. First, the definition of multi-objective community detection problem and multi-objective algorithms of the related work are presented. Next, we will give the basics about Rough Set Theory and Rough Clustering.

Let G = (V, E) be a given network, where V is the set of vertices and E is the set of edges among the vertices. A multi-objective community detection problem aims to search for a partition P * of G such that:

where P is a partition of G, Ω is the set of feasible partitions, r is the number of objective functions, f i is the ith objective function and min(·) is the minimum value obtained by a partition P taking into account all the objective functions.

With the introduction of the multiple objective functions, there is usually no absolute optimal solution, thus, the goal is to find a set of Pareto optimal solutions [2] . A commonly used way to solve a multi-objective community detection problem is by using MOEAs [9] . The first algorithm using MOEAs for detecting overlapping communities is named Multiobjective Evolutionary Algorithm to solve CDP (MEA CDP) [5] . MEA CDP uses an undirected representation of the solution and the classical Nondominated Sorting Genetic Algorithm II (NSGA-II) with the reverse operator to search for the solutions optimising the average community fitness, the average community separation and the overlapping degree among communities.

On the other hand, the Improved Multiobjective Evolutionary Algorithm to solve CDP (iMEA CDP) [7] uses the same representation and optimization framework of MEA CDP but it proposes to employ the PMX crossover operator and the simple mutation operator as evolutionary operators. iMEA CDPs employs the Modularity function [10] and a combination of the average community separation and overlapping degree as its objective functions.

The Overlapping Community Detection Algorithm based on MOEA (MOEA-OCD) [9] uses the classical NSGA-II optimization framework and a representation based on adjacents among edges of the network. On the other hand, MOEA-OCD uses the negative fitness sum and the unfitness sum as objective functions. Unlike previously mentioned algorithms, in MOEA-OCD algorithm, a local expansion strategy is introduced into the initialization process to improve the quality of initial solutions.

Another algorithm is the Maximal Clique based on MOEA (MCMOEA) [8] which first detects the set of maximal cliques of the network and then it builds the maximal-clique graph. Starting from this transformation, MCMOEA uses a representation based on labels and the Multiobjective Evolutionary Algorithm based on Decomposition (MOEA/D) in order to detect the communities optimising the Radio Cut (RC) and Kernel K-Means (KKM) objective functions [11] .

In [16] the authors combine Granular Computing and a multi-objective optimization approach for discovering overlapping communities in social networks. This algorithm, denoted as MOGR-OV, starts by building a set of seeds that is afterwards processed for building overlapping communities, using three introduced steps, named expansion, improving and merging.

Most of the exiting works focus on developing MOEAs to detect overlapping communities but not addresses the problem of selecting a final solution from the set of the obtained non-dominated solutions.

The main components in the Rough Set Theory are an information system and an indiscernibility relation [17] . The classical RST was originally proposed using on a particular type of indiscernibility relations called equivalence relations (i.e., those that are symmetric, reflexive and transitive). Yao et al. [19] described various generalizations of rough sets by relaxing the assumptions of an underlying equivalence relation.

RST takes a pair of precise concepts to study the vagueness of a concept, named the lower and upper approximations. The lower approximation composes of all objects which surely belong to the concept, whereas the upper approximation contains all objects which perhaps belong to the concept. The boundary region of the vague concept is the difference between the upper and the lower approximations [18] .

Lingras et al. [15] define another generalization of the approximate sets, seeing them as interval sets. The authors propose the rough k-means algorithm, where the concept of k-means is extended by viewing each cluster as an interval or rough set. The core idea is to separate discernible from indiscernible objects and to assign objects to lower A(X) and upper A(X) approximations of a set X. This proposal allows overlaps between clusters [20] . The upper and lower approximation concepts require to follow some of the basic rough set properties such as [14] :

1. An object v can be part of at most one lower approximation. This implies that any two lower approximations do not overlap. 2. An object v that is member of a lower approximation of a set is also part of its upper approximation. This implies that a lower approximation of a set is a subset of its corresponding upper approximation. 3. If an object v is not part of any lower approximation it belongs to two or more upper approximations. This implies that an object cannot only belong to a single boundary region.

The way to incorporate rough sets into k-means clustering requires adapting the calculation of the centroids and deciding whether an object is assigned to a lower or upper approximation of a cluster. In the first moment, the centroids of clusters are calculated including the effects of lower as well as upper approximations. Next, an object is assigned to the lower approximation of a cluster when the distance (similarity) between the object and the particular cluster center is smaller than the distances to the remaining other cluster centers [14] .

The proposed algorithm obtains a final covering through two steps. It starts building sets of indiscernible (similar) objects that form basic granules of knowledge on the network G = (V, E), where V represents the set of nodes and E represents the set of edges which connect nodes. Thus, a partition of the set V is obtained allowing us to define an equivalence relation in V . From our point of view, two vertices should be related if they share many communities at the Pareto Set. Next, through the Rough Clustering foundations, specifically the rough k-means algorithm ideas [15] , we build the final covering of the network by viewing each community as a rough set, which allows us to obtain overlapping communities.

In this step, we build a set of granules which represents a partition of V . First of all, we describe a series of useful concepts that we are applying in our proposal. 

and only if satisfies the following conditions: 

is defined as follows:

where ps is the number of solutions in P S and match(v i , v j ) = mc(vi,vj ) |Gv i |·|Gv j | . We build the thresholded similarity graph G β = (V, E β ) based on Eq. 2 and the user-defined parameter β (β ∈ [0, 1]). Let G r = {G r1 , G r2 , . . . , G rq } be the β-connected component set. By definition, the connected component set in a graph constitutes a partition of the set of vertices.

We will say that a vertex v i ∈ V is related with a vertex v j ∈ V , denoted as v i R ps v j , if and only if ∃G ri ∈ G r such that v i , v j ∈ G ri , being R ps a equivalence relation. The set built from all the vertices related to a vertex v i forms the so called equivalence class = (V, E) . Hence, G ri is a subgraph on G = (V, E) induced from [v i ] R ps . Therefore, G r is viewed as granules of indistinguishable elements which do not share vertices. These granules constitutes our initial granularity criterion [21] , and also we will use them to build the final covering of the network.

We take the k biggest granules, G ri ∈ G r , according to the number of vertices, as prototypes of clusters and the remaining of them are assigned to those selected ones. Therefore, the foundation is to initially covering the network with those granules of indistinguishable vertices that give greater coverage of the network. The variable k, 1 ≤ k ≤ q receives the median value of the number of clusters that form the solutions at the Pareto Set. For this purpose, we define a similarity function between any two granules G ri , G rj ∈ G r . This function is defined as follows:

As described in Sect. 2, the use of k-means clustering in Rough Clustering requires adapting the calculation of the centroids (cluster prototype) and decides whether an object is assigned to a lower or upper approximation of a cluster. In our case, we selected as prototypes of communities the k biggest granules, according to their number of vertices. Next, the remaining granules are assigned to those selected ones. A granule G ri is assigned to the lower approximation of a community when the similarity between G ri and the particular prototype of the community G rj , 1 ≤ j ≤ k, is much greater than the similarity to the remaining other prototypes. In this case, the similarity function defined in the Eq. 3 is used for deciding whether the remained granules are assigned to a lower or upper approximation of the selected k granules.

Worth noting that in this step, the assignation process uses the granules obtained in the previous step, G r = {G r1 , G r2 , . . . , G rq }. The selected k biggest granules represent the initial communities of network and also the lower approximations of them. The remaining granules G ri , k < i ≤ q will be part of the lower or upper approximations of the communities according to the similarity S Gr and the γ user-defined parameter (γ ∈ [0, 1]).

The pseudocode of MOOCD-RC is shown in Algorithm 1. It is important to notice that the used Pareto Set is the result of using the MOGR-OV algorithm [16] . In MOOCD-RC, initially the cover CV is formed by the k greatest granules in G r , which ones represent the lower approximations of the communities. These k selected granules represent the prototypes of communities to be built. Afterly, the remaining granules are included in the lower or upper approximations of the communities in CV according to S Gr . Worth noting that the lower approximation of those communities are formed by the vertices that definitely belong to them, whereas the upper approximations are formed by the vertices that are located at the boundary of the communities. These vertices represent the overlapping in themselves.

In the first step, the building of the equivalence classes is tightly bound to the thresholded similarity graph G β = (V, E β ), which in turn depends on the β user-defined parameter. The higher the value of β the smaller granules will be obtained and vice versa. On the other hand, in the second step the dimensions of the lower and upper approximations of the communities depend on γ user-defined 

if |T | > 1 then ∀Gr i ∈ T take the community CVi associated; Add Gr j to CVi; else Take take the community CVi associate to Gr max ; Add Gr j to CVi and CVi;

return CV parameter. In the way of this parameter changes we will obtain boundaries of communities more or less tight. The parameters β and γ allow decision-makers to obtain a final covering of the network by adjusting the cores or boundaries of the communities. In our experiments, we set β = 0.75 and γ = 0.1. We chose these values according to the related works [13, 14, 20] .

In this section, we conduct several experiments for evaluating the effectiveness of our proposal. Since the built-in communities in benchmark networks are already known, we use the Normalized Mutual Information external evaluation measure to test the performances of different community detection algorithms.

Hence, the experiments were focused on evaluating the accuracy attained by our proposal in terms of the NMI value. Our algorithm was applied to synthetic networks generated from the Lancichinetti-Fortunato-Radicchi (LFR) benchmark dataset [4] . Its performances were compared against the one attained by MEA CDP [5] , iMEA CDP [7] , MCMOEA [8] and MOEA-OCD [9] algorithms, described in Sect. 2.

The algorithms of the related works do not build a final covering from the communities of the Pareto Set. Thus, we choose the best solution in the Pareto Set, according to the NMI, and compare this solution with respect to the ones obtained by our algorithm.

The NMI takes values in [0, 1] and it evaluates a set of communities based on how much these communities resemble a set of communities manually labeled by experts, where 1 means identical results and 0 completely different results.

In LFR benchmark networks, both node degrees and community sizes follow the power-law distribution and they are regulated using the parameters τ 1 and τ 2 . Besides, the significance of the community structure is controlled by a mixing parameter μ, which denotes the average fraction of edges each vertex has with others from other communities in the network. The smaller the value of μ, the more significant community structure the LFR benchmark network has. The parameter O n is specially defined for controlling the overlapping rate of communities in the network. O n is the number of overlapping nodes, evaluating overlapping density among communities. Similar to μ, the higher the value of O n , the more ambiguous the community structure is.

In the first part of the experiment, we set the network size to N = 1000, τ 1 = 2, τ 2 = 1, the node degree is in [0, 50] with an average value of 20, whilst the community sizes vary from 10 to 50 elements. Using previous parameter values we vary μ from 0.1 to 0.6 with an increment of 0.05. After, we set μ = 0.1 and μ = 0.5, and we vary the percent of overlapping nodes existing in the network (parameter O n of LFR Benchmark) from 0.1N to 0.5N with an increment of 0.1; the other parameters remain the same as the first experiment.

The average NMI value attained for each algorithm over the LFR benchmark when μ varies from 0.1 to 0.6 with an increment of 0.05, as show in Fig. 1 . As the value of μ increases the performance of each algorithm deteriorates, being both MOEA-OCD and MOOCD-RC those that performing the best. As the mixing parameter μ exceeds 0.5, the MOEA-OCD algorithm begins to decline in its performance and it is outperformed by MOOCD-RC. Figure 1 shows the good performance of our method.

For summarizing the above results, we evaluated the statistical significance of the NMI values using the Friedman test as Non-Parametric Statistic Procedure included in the KEEL Software Tool. Also, we used the Holms and Finner as post hoc methods. Table 1 shows the average ranks obtained by each method in the Friedman test. Our method ranks second, however, Table 2 shows the overall performance of MOEA-OCD with respect to the remaining algorithms, where Fig. 2 . Our proposal and MOEA-OCD have a performance almost stable, independently of the number of overlapping nodes in the network, being MOEA-OCD the one that performs the best. On the other hand, when the structure of the communities is uncertain, the performance of the MOEA-OCD algorithm drops off when the overlapping in the network increases, being our proposal the one that performs better, as shown in Fig. 3 .

Similar to the previous experiment, we evaluated the statistical significance of the NMI values. Table 3 shows the average ranks obtained by each algorithm in the Friedman test. The Friedman statistic value distributed according to chisquare with three degrees of freedom is 25.92. Besides, the p-value computed by the Friedman test is 0.00001. Our algorithm ranks second, however, like the previous experiment, Table 4 shows the overall performance of MOEA-OCD with respect to the remaining algorithms, where there is not statistically significance between our proposal and MOEA-OCD. From the above experimental results, we can conclude that MOEA-OCD and our proposal have outstanding performances on LFR benchmark networks in most cases. However, our algorithm employs the information contained in the communities of Pareto Set to build a final covering of the network. Although the solutions of Pareto Set do not have overlapping communities, our proposal does not depend on this for building the final communities. Thus, our algorithm can be used by multi-objective evolutionary algorithms which build disjoint or overlapping community structures.

It should be noted that our proposal depends on the obtained non-dominated solutions. In these experiments we used the algorithm MOGR-OV [16] to generate the Pareto Set. On the other hand, the settings of β and γ have a narrow relationship over the obtained final covering. Following, we will give a brief description about this.

In the above experiments, the parameters β and γ are fixed to 0.75 and 0.1, respectively. We will have as results boundaries of communities more or less tight, depending on the way we change those parameters. Hence, both of them allow decision-makers to analyze the network according to the domain problem. Using the synthetic network generated above with the parameters values μ = 0.1 and O n = 0.1N , we will show the overlapping communities with different lower and upper approximation scales. For that, we change the γ parameter and keep the same β value used in the experiments. The parameter γ allows to tune the boundaries of communities. Thus, the higher the value of γ is, the wider the boundaries are and vice versa, which means that there is going to be more or less overlapping vertices, respectively. Furthermore, we build two coverings of the obtained synthetic network by considering γ = 0.1 and γ = 0.25. For a better comprehension of the studied network we used the graph analysis tool Gephi. It employs both the network properties (e.g., vertex degree) and also the identified communities in the network in the visualization process. Figures 4 and 5 showed next were obtained using the Force Atlas 2 [23] method belonging to Gephi. 1 As shown in Figs. 4 and 5, the covering obtained using γ = 0.25 shows boundaries of communities wider than the covering obtained with γ = 0.1. Thus, the communities showed in Fig. 5 have more overlapping vertices than communities showed in Fig. 4 . The overlapped vertices are bigger visualized than others and they are placed in the boundaries of communities. As described before, the parameter γ allows the DM from its own knowledge to tight or wide the boundaries of communities. In this way, the decision maker has a mechanism to weigh the importance of lower and upper approximations in the obtained communities. However, the adjustment of β and γ has a direct control over the final covering. Worth noting that our algorithm builds the final covering only using the information about the communities of the Pareto Set. 

In this paper, we proposed a new algorithm, named MOOCD-RC, for discovering overlapped communities through a combination of a multi-objective approach and Rough Clustering. It is composed of two steps: (a) build the granules of the indiscernible objects, and (b) build the final covering of network.

In the fist step, MOOCD-RC defined an equivalence relation between each pair of vertices of the network through the thresholded similarity graph. The obtained equivalence classes under the indiscernibility relation induce a granule set which constitutes our initial granularity criterion. We will also use them to build the final covering of the network. Afterward, in the second steps, the algorithm built the resulting communities through the Rough Clustering, taking the k greatest granules as prototypes of the communities; they also represent the lower approximations inside their own communities.

The MOOCD-RC algorithm was evaluated over synthetic networks in terms of its accuracy and it was compared against four algorithms of the related work. From the above experimental results, we can draw the conclusion that MOEA-OCD and our algorithm have outstanding performances on LFR benchmark networks in most cases. Moreover, this evaluation showed that MOOCD-RC is promising and effective for overlapping community detection in complex networks. As future work, we would like to make a more automatic adjustment to the β and γ parameters.

A survey of tools for community detection and mining in social networks

Multi-objective community detection in complex networks

Detecting the overlapping and hierarchical community structure of complex networks

Benchmark graphs for testing community detection algorithms

Separated and overlapping community detection in complex networks using multiobjective evolutionary algorithms

Uncovering the overlapping community structure of complex networks in nature and society

An improved multi-objective evolutionary algorithm for simultaneously detecting separated and overlapping communities

A maximal clique based multiobjective evolutionary algorithm for overlapping community detection

Overlapping community detection in complex networks using multi-objective evolutionary algorithm

Detect overlapping and hierarchical community structure in networks

Complex network clustering by multiobjective discrete particle swarm optimisation based on decomposition

Rough-fuzzy collaborative clustering

Qualitative and quantitative combinations of crisp and rough clustering schemes using dominance relations

Applying rough set concepts to clustering

Interval set clustering of Web users with rough k-means

Multiobjective overlapping community detection algorithms using granular computing

Rough Sets: Theoretical Aspects of Reasoning About Data

Rough sets: some extensions

Generalization of rough sets using modal logic

An evolutionary rough partitive clustering

Granular computing: basic issues and possible solutions

A generalized definition of rough approximations based on similarity

ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software