DenGraph-HO: A Density-based Hierarchical Graph Clustering
Algorithm

Nico Schlitter, Tanja Falkowski and Jörg Lässig

Abstract

DenGraph-HO is an extension of the density-based graph clustering algorithm DenGraph.
It is able to detect dense groups of nodes in a given graph and produces a hierarchy
of clusters which can be efficiently computed. The generated hierarchy can be used to
investigate the structure and the characteristics of social networks. Each hierarchy level
provides a different level of detail and can be used as the basis for interactive visual
social network analysis. After a short introduction of the original DenGraph algorithm
we present DenGraph-HO and its top-down and bottom-up approaches. We describe the
data structures and memory requirements and analyse the run time complexity. Finally,
we apply the DenGraph-HO algorithm to real-world datasets obtained from the online
music platform Last.fm and from the former U.S. company Enron.

1. Introduction

In 2011, we proposed DenGraph-HO in order
to fulfill the special needs of social network
analysts (Schlitter, Falkowski & Lässig 2011). In
most cases, the visual inspection of a network is
the first step of the analytical process and helps
to determine the basic graph characteristics and
further actions. DenGraph-HO supports this early
stage by providing a quick visual analysis of
the network structure. It provides the ability of
zooming into network clusterings and has proven
its usefulness for our practical work.

Our approach differs from traditional hierar-
chical clustering methods in that DenGraph-HO
is a non-partional clustering algorithm. We con-
sider the fact that not all nodes are necessarily
members of clusters. In addition, the proposed

hierarchy is not strictly built up by the classic
divisive or agglomerative approach that is known
from literature. We generalize these methods and
propose a top-down approach and a bottom-up
approach by extending the hierarchy paradigms.
The proposed hierarchy supports superordinate
clusters that contain subclusters.

Each level of the hierarchy represents a
clustering that fulfills the original DenGraph
paradigms, which will be presented in Section
2.1. The levels, respectively the clusterings, differ
in the density that is required to form a cluster.
While lower level clusterings aggregate nodes
with a lower similarity, higher level clusterings
require a higher similarity between nodes. The
density-based cluster criteria are controlled by
the parameters η and ε which are iteratively in-


or decreased to obtain different levels of the hier-
archy. An existing clustering is used to compute
the clustering of the next level. The efficiency of
our algorithm is based on this iterative sequence
of cluster adaptations instead of a complete new
clustering.

The remainder of this article is organized as
follows. Section 2 discusses related work and
introduces the original DenGraph algorithm and
its variations DenGraph-O and DenGraph-IO.
Section 3 covers the proposed top-down and
bottom-up approaches of DenGraph-HO as well
as the used data structures, their memory require-
ments and the algorithm’s run time complexity.
Its usability is demonstrated in Section 4 by ap-
plying DenGraph-HO to two real-world datasets.
Finally, a conclusion and an outlook are given in
Section 5.

2. Related and Previous Work

Clustering is a data mining method which is used
to detect hidden structures in huge data sets. Its
purpose is to bundle of single objects into groups
in such a way that objects of the same group
are more similar to each other than to objects of
other groups.

K-means (MacQueen 1967) is a commonly
used and well studied clustering algorithm for
spatial data. The algorithm strictly groups data
all points into clusters. The number of clusters
has to be predefined and stays constant during
the clustering process. Each data point belongs
to exactly one cluster. This can been seen as one
major drawback of the algorithm since it does no
deal with the concept of outliers.

Density-based approaches like DBSCAN
(Ester, Kriegel, Sander & Xu 1996) require no
previous knowledge of the number of clusters
hidden in the data. Clusters are defined as
regions that have a high density of data points
and that are surrounded by regions of data
points of lower density. Since data points
may not belong to any cluster, DBSCAN is
a non-partitioning clustering method, which
provides the ability to handle outliers. The
major drawback of DBSCAN is its inability to
find nested or overlapping clusters. Therefore,

the authors extended their original approach to
overcome this problem. The algorithms OPTICS
(Ankerst, Breunig, Kriegel & Sander 1999),
HiCS (Achtert, Böhm, Kriegel, Kröger, Müller-
Gorman & Zimek 2006), HiCO (Achtert, Böhm,
Kröger & Zimek 2006) and DiSH (Achtert,
Böhm, Kriegel, Kröger, Müller-Gorman &
Zimek 2007) were designed to generate cluster
hierarchies.

Since density-based clustering approaches pro-
vide significant advantages they have also been
used for graph clustering purposes. DenShrink
(Huang, Sun, Han & Feng 2011), a parameter-
free algorithm for community detection, applies
modularity optimization to reveal the embedded
hierarchical structures with various densities in
large-scale weighted undirected networks.

DenGraph (Falkowski, Barth & Spiliopoulou
2007) and SCAN (Xu, Yuruk, Feng & Schweiger
2007) have been independently proposed by
extending DBSCAN. Both methods operate on
undirected graphs, use a local cluster criteria
and provide the ability to detect communities
in social networks. However, a comparison of
both methods (Kriegel, Kröger, Ntoutsi & Zimek
2011) pointed out that they significantly differ in
the applied similarity function. DenGraph oper-
ates on weighted graphs and considers the weight
of an edge between two nodes as similarity of
these nodes. On the contrary, SCAN uses a sim-
ilarity function that is based on the topology of
the underlying graph. The similarity of two nodes
correlates to the number of neighbors they share.
Later, SCAN’s similarity function was also used
for the divisive hierarchical clustering DHSCAN
(Yuruk, Mete, Xu & Schweiger 2007) and the
agglomerative hierarchical clustering AHSCAN
(Yuruk, Mete, Xu & Schweiger 2009).

Since 2007, the development of DenGraph
continued as well. We proposed the extensions
DenGraph-O that allows overlapping clusters and
DenGraph-IO which tracks community evolu-
tion over time (Falkowski, Barth & Spiliopoulou
2008). In the following, we briefly introduce the
original DenGraph algorithm and its extensions.
In section 3 we then propose the density-based
hierarchical algorithm DenGraph-HO.


2.1. DenGraph

Given a graph G = (V,E) consisting of a set
of nodes V and a set of weighted, undirected
edges E, the DenGraph algorithm produces a
clustering ζ = {C1,...,Ck} where each clus-
ter Ci (i = 1...k) consists of nodes VCi ⊆ V .
Since DenGraph is a non-partitioning cluster-
ing algorithm there can be noise nodes VN =
{u ∈V | u /∈Ci} that are not part of any cluster
Ci. The remaining non-noise nodes are members
of only one cluster and either core nodes or
border nodes.

Definition The ε -neighborhood Nε (u) of a node
u∈V is defined as set of nodes that are connected
to u having a distance less than or equal to ε .
Nε (u) = {v ∈V | ∃(u,v)∈ E ∧ dist(u,v)≤ ε},
where dist(u,v) is the distance between u and v.

A node u ∈ V is considered as core node if it
has an ε -neighborhood of at least η neighbor
nodes (|Nε (u)|≥ η ). Nodes which are in the ε -
neighborhood of a core node, but do not have an
own ε -neighborhood are called border nodes.

According to (Falkowski 2009), the actual
cluster criterion is based on the concepts di-
rectly density-reachable, density-reachable and
density-connected which are defined below and
illustrated in Figure 1.

Definition Let u,v ∈V be two nodes. Node u is
directly density-reachable from v within V with
respect to ε and η if and only if v is a core node
and u is in its ε -neighborhood, i.e. u ∈ Nε (v).

Definition Let u,v ∈V be two nodes. Node u is
density-reachable from v within V with respect
to ε and η if there is a chain of nodes p1,..., pn
such that p1 = v, pn = u and for each i = 2,...,n
it holds that pi is directly density-reachable from
pi−1 within V with respect to ε and η .

Definition Let u,v ∈ V be two nodes. u is
density-connected to v within V with respect to
ε and η if and only if there is a node m ∈ V
such that u is density-reachable from m and v is
density-reachable from m.

Definition Let G(V,E) be an undirected,
weighted graph. A non-empty set C ⊆ V is

Figure 1: Concepts of Connectivity (Falkowski 2009)

denoted as Cluster with respect to ε and η if
and only if:

• For all u,v ∈ V , if u ∈ C and v is density
reachable from u, then v ∈C.

• For all u,v ∈C u is density-connected to v
within V with respect to ε and η .

The complete DenGraph procedure is de-
scribed in Algorithm 1. It uses a stack in order
to process the graph nodes. In a first step, all
nodes V are marked as noise. Afterwards, each
so far unprocessed node v is visited and checked
if it has an ε -neighborhood. If the neighbor-
hood contains at least η nodes (|N(v)|≥ η ) the
node v is marked as core and a new cluster is
founded. Each of v’s neighbors within the ε -
neighborhood is marked as border, becomes a
member of the new cluster and is pushed on the
stack. After processing all neighbors, each node
u from the stack is checked regarding having an
ε -neighborhood and is marked correspondingly.
If u became a core node, all of it’s neighbors are
marked as border and pushed on the stack. This
procedure is repeated until all nodes of the graph
are processed.

According to Algorithm 1 the time complexity
of our procedure mainly depends on the number
of nodes and the number of edges. Each node
is visited once and each edge is processed up to
two times - once from both end-nodes. Conse-
quently, the overall run time complexity of the
DenGraph algorithm is O(|V|+ |E|), where |V|
is the number of nodes and |E| the number of
edges. (Falkowski 2009)

Figure 2 shows the clustering of an exemplary
interaction graph. The graph was clustered by
applying the original DenGraph. Core nodes are
blue, border nodes are green and noise nodes are


Algorithm 1: DenGraph
input : Graph,Clustering ζ ,η ,ε
output: Clustering ζ

begin

foreach (u ∈V|u.state = noise) do
if (|Nε (u)|≥ η) then

Cluster=CreateNewCluster();
Cluster.addNode(u);
u.state=core;
foreach n ∈ Nε (u) do

Cluster.addNode(n);
n.state=border;
stack.push(n);

repeat
v=stack.pop();
if (|Nε (v)|≥ η) then

v.state=core;
foreach
n ∈ Nε (v)|n.state 6= core do

Cluster.addNode(n);
n.state=border;
stack.push(n);

until stack is empty;

return ζ ;

Figure 2: DenGraph Clustering (Falkowski 2009)

drawn in red color. Each cluster contains one
or more core nodes which are connected with
each other. Following the DenGraph paradigm,
the total number of nodes per cluster is greater
than η . Nodes that are not member of any cluster
are considered as noise nodes.

Figure 3: DenGraph-O Clustering (Falkowski 2009)

2.2. DenGraph-O

Practical work with DenGraph in the field of
Social Network Analysis revealed a minor draw-
back: While in real world applications nodes - re-
spectively human beings - might be part of more
than one community, the original DenGraph al-
gorithm does not allow for clusters to overlap.
The affected nodes were exclusively assigned to
only one cluster among the cluster candidates.

This issue was addressed in (Falkowski
et al. 2008). The extended version, called
DenGraph-O, allows border nodes that are mem-
bers of multiple clusters. As a consequence sev-
eral clusters of a graph might be overlapped.
An example for those overlapping clusters is
illustrated in Figure 3. Cluster 1 overlaps with
Cluster 2 and Cluster 5 while Cluster 3 overlaps
with Cluster 5 and Cluster 4.

2.3. DenGraph-IO

In 2007, Falkowski et al. proposed DenGraph-IO
(Falkowski & Barth 2007, Falkowski et al. 2008)
to analyse the dynamics of communities over
time. The authors compared the changes between
clusterings that were obtained in different points
in time. For this, it is necessary to iteratively
compute the graph clusterings that are subse-
quently observed over time. A huge computa-
tional effort would be necessary if the original
DenGraph was used to process multiple con-
secutive snapshots of social networks. However,


as social structures often change slowly over
time, the graphs Gt and Gt+1 differ just slightly.
Therefore, a complete re-clustering, as the use of
the original DenGraph algorithm would demand,
is quite inefficient.

The incremental clustering algorithm
DenGraph-IO addresses this issue and updates
an existing clustering based on the changes of
the underlying graph. The graph changes either
by adding/removing of nodes, adding/removing
of edges or changing the weight of an existing
edge. As a result of the graph updates, new
clusters may appear or existing clusters may be
removed, merged or split.

Since the DenGraph-IO algorithm deals exclu-
sively with the parts of the graph that changed
from one point in time to the other, the compu-
tational complexity is dramatically reduced and
even huge networks can be processed in reason-
able time. Our experiments using a real-world
social network obtained from Last.fm show that
handling 2,500 graph updates using DenGraph-
IO is about 400 times faster than the re-clustering
with DenGraph-O (Schlitter & Falkowski 2009,
Falkowski 2009).

3. DenGraph-HO

One challenge of the DenGraph algorithm is the
choice of the parameters ε and η . Several heuris-
tics were investigated (Falkowski 2009), how-
ever, the ”right” parameter combination mainly
depends on the aim of the analysis. If the analyst
is for example interested in observing strongly
connected nodes rather than in clusterings that
show the overall structure, the parameters need to
be chosen accordingly. DenGraph-HO addresses
this issue by allowing a quick visual inspection
of clusterings for a given set of parameter com-
binations. The process of zooming into or out
of the network can be interactively controlled by
the analyst according to his needs.

The proposed algorithm returns a hierarchical
clustering that describes the structure of the
underlying network. The resulting hierarchy pro-
vides multiple views of the network structure
in different levels of detail. Consequently, the
cluster hierarchy is an ideal basis for an efficient

(a) Hierarchical Clusterings (b) Hierarchy

Figure 4: Visualization of an Hierarchical DenGraph-HO
Clustering

zooming implementation. Zooming-in is done by
stepping up in the hierarchy. It provides a more
detailed view of the current cluster by presenting
its subclusters. A higher level of abstraction is
reached by zooming-out, which is equivalent
to merging similar clusters into superordinate
clusters.

Figure 4 shows an exemplary graph clustering
and the corresponding hierarchy as a tree of
clusters. For the sake of clarity we removed
the graph edges. The root of the hierarchy tree
represents the whole graph and children repre-
sent subclusters of their superordinate cluster.
Following this definition, the leaves of the tree
correspond to the smallest clusters that have no
subclusters.

The proposed DenGraph-HO hierarchy is
based on the concepts of the DenGraph algo-
rithm. Each level of the tree (besides the root)
represents a valid clustering that fulfills the
DenGraph-O paradigms. The hierarchy can be
built by repeatedly applying DenGraph-O while
using specific parameter settings for each level
of the tree. The choice of the parameters η and
ε is limited by constraints in order to ensure
that lower level clusters are subclusters of the
superordinate cluster.

Let us assume that the clustering ζl forms
level l of the hierarchy and is computed by
applying DenGraph-O with the parameters εl and
ηl . Level l + 1 represents a clustering that is
based on εl+1 and ηl+1 and guarantees a higher


similarity of nodes in the cluster. According
to the description above, ζl+1 has to contain
subclusters of clusters that are element of ζl .
In order to preserve this parent-child relationship
we have to ensure the following constraints:

1) The parameter εl that is used to generate
the clustering ζl has to be bigger than or
equal to εl+1 which is used to compute the
clustering ζl+1 : εl ≥ εl+1 .

2) The parameter ηl that is used to generate
the clustering ζl has to be lower than or
equal to ηl+1 which is used to compute
the clustering ζl+1 : ηl ≤ ηl+1 .

Increasing ε might lead to a transition of a node
state from noise or border to core or from noise
to border. By increasing ε a core node cannot
lose its state. This explains why increasing ε
might create a new cluster or expand an existing
one and why it surely avoids cluster reductions
or removals. The same argument holds for de-
creasing η and shows why the demanded cluster-
subcluster relation can be guaranteed by the
given constraints.

In the following, we discuss how the proposed
cluster hierarchy can be efficiently generated
based on a list of parameter settings that fulfill
the discussed constraints. An obvious approach
would be to perform multiple re-clusterings us-
ing DenGraph-O until each parameter setting is
processed. However, this is very inefficient and
would be a huge computational effort.

The proposed DenGraph-HO algorithm ad-
dresses this issue and uses incremental parameter
changes to generate the cluster hierarchy. Instead
of computing a complete new clustering for
each level, an existing clustering is used and
adapted. In the following, we discuss how an
existing clustering of level l can be used to
compute the clusterings of level l + 1 and l −1.
We propose a bottom-up and top-down approach
and analyse their efficiency depending on the
graph structure. The input for both approaches
is a graph G = (V,E) and an existing cluster-
ing ζl =

{
Cl1,...,C

l
k

}
that fulfills the DenGraph

paradigms.

3.1. Top-down Approach: Cluster Reduction,
Split or Removal

The top-down approach performs a zoom-in op-
eration and generates a new clustering for level
l + 1 of the hierarchy. Clusters of level l might
be reduced, split or removed. By decreasing
ε and increasing η the state of nodes within
a cluster might change. A former border node
might become noise (Cluster Reduction). Former
core nodes might get border state (possible Clus-
ter Splitting) or noise state (Cluster Reduction,
possible Cluster Splitting or Removal).

Due to the DenGraph paradigms, it is guar-
anteed that noise nodes can not reach border
or core state by decreasing ε or increasing η .
Thus, noise nodes will not change their state
and do not need to be processed. Consequently,
the top-down approach traverses just border and
core nodes and performs a re-clustering for each
existing cluster. For this purpose, we use the
modified DenGraph-O procedure shown in Algo-
rithm 3. Each cluster C of level l is re-clustered
by applying the parameters of level l + 1.

Algorithm 2 shows how the modified
DenGraph-O algorithm is used to create a
complete hierarchy. After generating a parameter
list of (ε,η)-pairs an initial DenGraph-O
clustering for level l=1 is iteratively adapted
to generate the subsequent clusterings for the
levels l+i.

Algorithm 2: Top-down Approach
input : Graph,lmax
output: Clustering ζ

PL=CreateParameterSettingList(lmax);
ζ1=DenGraph(Graph,PL[1].η ,PL[1].ε );
for l = 2 to lmax do

ζl =Top-down Step(ζl−1,PL[l].η ,PL[l].ε );

return ζ ;

Obviously, the characteristic of the cluster
hierarchy depends on the values specified in
the parameter list. For a given social network
analysis task, it is quite hard to find appropriate
parameters, that lead to a useful hierarchy -
especially if there is no prior knowledge about


the given social network. We are approaching
this problem by choosing parameter values which
are equally distributed over the parameter search
space. On this basis, parameter ranges that lead
to useful hierarchal structures can be discovered.
Later on, they can be explored in more detail
by re-applying the clustering method with new
parameter settings.

Algorithm 3: Top-down Step
input : Graph,ζ ,ε ,η
output: Clustering ζ

foreach (C ∈ ζ ) do
foreach r ∈C do r.state=noise;
foreach (u ∈C|u.state = noise) do

if (|Nε (u)|≥ η) then
Cluster=CreateNewCluster();
C.addCluster(Cluster);
Cluster.addNode(u);
u.state=core;
foreach n ∈ Nε (u) do

Cluster.addNode(n);
n.state=border;
stack.push(n);

while stack is empty do
v=stack.pop();
if (|Nε (v)|≥ η) then

v.state=core;
foreach n ∈ Nε (v)|n.state 6= core do

Cluster.addNode(n);
n.state=border;
stack.push(n);

return ζ ;

3.2. Bottom-up Approach: Cluster Creation,
Absorption and Merging

The bottom-up approach performs a zoom-out
operation and generates a new clustering on level
l −1 of the cluster hierarchy. Existing clusters
of level l might grow through the absorption of
nodes or through the merge with other clusters.

Our approach deals with changes of η only,
because changing ε as well would make it nec-
essary to process all graph nodes. However, if

we process all graph nodes our approach would
not be more efficient than the original DenGraph
algorithm.

Algorithm 4 shows how we expand the exist-
ing clusters by the iterative increase of η . Doing
so, the state of a core node remains unchanged.
A former noise node may become a core node
(cluster creation) or a border node (absorption).
A former border node could become core node
(absorption). In case a former border node is
member of multiple overlapping clusters, its tran-
sition to core state leads to a merge of those
clusters.

Algorithm 4: Bottom-up Approach
input : Graph,lmax,etamin,ε
output: Clustering ζ

η =ηmin;
ζlmax =DenGraph(Graph,η ,ε );
for l = lmax −1 to 1 do

η =η +1;
ζl =Bottom-up Step(ζl+1,ηl ,ε );

return ζ ;

Algorithm 5 describes the procedure that per-
forms the bottom-up step by processing the
changes of η . Since core nodes keep their state,
there is no need to consider them in the bottom-
up approach. Consequently, the proposed proce-
dure traverses only noise and border nodes in
order to determine their new state and to adapt
the clustering accordingly. Following this argu-
mentation, the procedure’s efficiency is based
on the saved time that the original DenGraph-O
would have spent for processing core nodes.

First, the algorithm iterates over all existing
clusters in order to expand them. For each cluster
it traverses all border nodes and updates their
state according to the parameters η and ε . In
case a former border node becomes core, this
new core node is pushed on the stack for further
processing. After dealing with all border nodes,
the procedure Expand checks if the new core
nodes absorb their neighbors into the cluster. In
case an absorbed node has less than η neigh-
bors in its ε -neighborhood it gets border state,


Algorithm 5: Bottom-up Step
input : Graph,Clustering ζ ,ε ,η
output: Clustering ζ

foreach (C ∈ ζ ) do
Cluster=CreateNewCluster();
Cluster.addSubCluster(C);
foreach (u ∈C|u.state = border) do

if (|Nε (u)|≥ η) then
u.state=core;
stack.push(u);

Expand(stack,Cluster,ε ,η );

ζ =DenGraph(Graph,ζ ,ε ,η );
return ζ ;

Algorithm 6: Expand
input : Graph,Clustering,Cluster,ε ,η
output: Clustering

while Stack is not empty do
u=stack.pop();
if (u is member of multiple clusters) then

foreach (C ∈ ζ|u ∈C) do
Cluster.addSubCluster(C);
foreach (p ∈C|p.state = border) do

if (|Nε (p)|≥ η) then
p.state=core;
stack.push(p);

else
p.state=border;

foreach (n ∈ Nε (u)|n.state ∈{noise,border}) do
Cluster.addNode(n);
if (|Nε (n)|≥ η) then

n.state=core;
stack.push(n);

else n.state=border;

return Clustering;

otherwise it becomes a core node. These newly
discovered core nodes are pushed on the stack
and the procedure is repeated until no further
nodes are absorbed into the cluster.

Due to the possibility of overlapping clusters,
a new core node might have been a member of
multiple clusters before. Consequently, its new
core state leads to a merge of the affected clusters

into a superordinate cluster.
After dealing with all border nodes the ex-

isting clusters are maximal expanded with re-
spect to the changed η and the constant ε . The
remaining noise nodes are processed to check
whether their state has changed. In case a for-
mer noise node becomes core, a new cluster is
created. To search for new clusters we use the
original DenGraph algorithm, which processes
exclusively noise nodes. Please note, that during
the handling of these noise nodes a newly created
cluster will not merge with an existing one. If the
new cluster were in ε -distance to another cluster,
the nodes of the new cluster would have been
already absorbed by this existing cluster during
its expansion phase.

3.3. Data Structures

In the last decade, the number of social net-
works and their participating users has rapidly
increased. Following this trend, DenGraph-HO
was developed to process efficiently even huge
graphs like facebook or twitter that have millions
of nodes and edges. To achieve this challenging
aim, the used data structures need to be opti-
mized for the most frequent operations of the
proposed clustering algorithm. In the following,
we briefly describe the used data structures and
their memory requirements.

The DenGraph-HO algorithm is implemented
using Java. All data structures including the
graph G = (V,E), the nodes V and the edges E
are modeled as single classes. The corresponding
objects are instantiated during the graph built-up
phase and each object is singularly materialized
in memory even if there are several references to
this object.

According to Algorithm 2 and 4, the top-down
approach as well as the bottom-up approach need
a valid clustering as starting point which is gener-
ated by the original DenGraph algorithm. During
this initial application of DenGraph procedure,
the algorithm traverses all noise nodes in order
to update the node states based on the cardinality
of their ε -neighborhood. For an efficient imple-
mentation, a list structure is needed that contains
references to the noise nodes, allows rapid node


traversing and also provides efficient insert and
remove operations for the current item. The class
LinkedList provided by Java fulfills these require-
ments. Traversing the list takes linear time and
works in O(n), the insert and remove operations
on the current item are handled in constant time
O(1).

Because it is basically a modified DenGraph
which performs the top-down step, Algorithm 3
benefits from the list of noise nodes as well.

We implemented a similar list of border nodes
for each cluster because the bottom-up step,
implemented in Algorithm 5, needs to traverse
all border nodes. Therefore, traversing nodes is
efficiently done for all procedures of the algo-
rithm.

In order to update the state of a node, the cardi-
nality of a node’s ε -neighborhood must be deter-
mined by counting edges that have weight values
less than ε . Due to the fact that this operation
is frequently used, we decided to implement a
binning mechanism for each node which assigns,
depending on the weight, the edges of each node
to a specific bin. In case the parameter εl for each
level l is specified in advance, the ranges of the
binning mechanism can be set accordingly. This
can be done for example during initializing the
graph and takes linear time O(n) where n denotes
the number of nodes in the graph.

For example, choosing εl=1 = 0.7 and εl=2 =
0.25 to generate a hierarchy with two levels,
the ranges for the binning mechanism should
be set to the same values. By this, all edges
with a weight less than 0.25 are put in bin 1,
edges with a weight between 0.25 and 0.75 are
stored in bin 2. The remaining edges are put
into bin 3. According to Algorithm 2, for each
non-noise node u the cardinality of N0.25(u), and
N0.7(u) must be determined. Due to the binning
mechanism, which counts the number of edges
in each bin, this task can be performed in O(l)
where l denotes the number of bins.

3.4. Run Time Complexity

The run time complexity of the original
DenGraph applied on a graph G(V,E) is O(|V|+

|E|), where |V| is the number of nodes and |E|
the number of edges. (Falkowski 2009).

By applying DenGraph k times with k different
(ε,η)-pairs, we would be able to produce a hier-
archy with k levels similar to the ones generated
by our DenGraph-HO algorithm. However, this
would be quite inefficient since for each iteration
each node is traversed even if the node’s state
definitely will not change. As described before,
DenGraph-HO overcomes this problem and tra-
verses only relevant nodes. In fact, for each
level of the hierarchy, DenGraph-HO deals only
with the subgraph G′(V ′,E′) where V ′ ⊂ V and
E′⊂ E. In the worst case scenario, if the original
graph G equals G′, DenGraph-HO has the same
run time like applying the original DenGraph
algorithm k times. However, our practical work
has shown that depending on the parameters ε
and η the subgraph G′ is up to 50% smaller than
the original graph G. This leads to a significant
run time reduction, which has a huge impact on
practical work. However, since this improvement
reduces the run time just by a constant factor, the
complexity according to the O-notation does not
change and is still O(|V|+|E|).

3.5. Memory Requirements

In the following, the memory requirement of the
data structures is briefly described. It depends on
the number of nodes |V|, the number of edges
|E| and the number of hierarchy levels lmax.
Assuming a 32bit environment, each pointer to
a single object requires 4 byte.

Each edge contains references to the two nodes
that are connected by this edge. In addition,
the weight of this connection is stored. Conse-
quently, each edge needs 2×4+4 = 12 byte. The
required memory to store all edges is |E|×12
byte.

For each node its identifier, the current state
(noise, border or core), the cluster(s) to which the
node belongs and a reference to the edge binning
structure that holds information about the node’s
edges are stored. Due to memory requirements
of 4 + 4 +(4×lmax)+ 4 = 12 + 4×lmax byte per
node, the whole set of nodes needs (12 + 4×
lmax)×|V| byte.


The graph structure stores a list of references
to all edges, a list containing the border and core
nodes and a list of all noise nodes. Since we use a
LinkedList there is a need for a forward pointer, a
backward pointer and the reference to the actual
node. In total the requirement of these lists is
12×|E|+ 12×|V| byte.

The binning structure for the edges contains
lmax bins that are implemented as a LinkedList
of edge references. Since each edge is listed
in two binning structures, we calculate a total
memory need of 2×|E|×12 byte for all binning
structures of the graph. This already includes the
for- and backward pointers of our list implemen-
tation. The addition storage of the binning ranges
takes |V|×lmax ×4 byte.

Each cluster contains references to the nodes
of this cluster. For each hierarchy level, we
assume that each node is member of one cluster.
As a result, the clustering demands for a total of
|V|×lmax ×12 byte.

During processing, additional memory for the
stack structure is required. In the worst case
scenario, this stack holds all nodes and is then
limited to |V|×4 bytes.

In total, we calculate a memory requirement of
|E|×48 +|V|×(28 + 20×lmax) byte to store all
data structures of the DenGraph-HO algorithm.
Consequently, a 3-level-clustering of a graph that
has 1200 nodes and 12600 edges requires about
710 KB.

4. Applications

In the following, we apply the DenGraph-HO al-
gorithm on different datasets and show its ability
for explorative visual social network analyses.
The presented algorithm was implemented as a
plugin for the open-source data mining frame-
work RapidMiner (Mierswa, Wurst, Klinken-
berg, Scholz & Euler 2006) which was formerly
known as YALE. For graph visualization we used
the prefuse toolkit developed by Jeffrey Heer
(Heer, Card & Landay 2005).

The first case study demonstrates the algo-
rithm’s usefulness and analyses the email com-
munication of the former U.S. company Enron.
This data was published by the Federal Energy

Regulatory Commission during the investigation
of the biggest bankruptcy case in US-history. For
our second analysis, we use information about
user’s music listening behavior provided by the
online music platform Last.fm.

4.1. Enron Case Study

The collapse of Enron, a U.S. company honored
in six consecutive years by “Fortune” as “Amer-
ica’s Most Innovative Company”, caused one of
the biggest bankruptcy cases in US-history. To
investigate the case, a data set of approximately
1.5 million e-mails sent or received by Enron
employees was published by the Federal Energy
Regulatory Commission.

Because of the public interest and the rareness
of available real world e-mail data many stud-
ies has been carried out on the Enron e-mail
dataset. The original dataset had some ma-
jor integrity problems and was cleaned up by
Melinda Gervasio at the independent, nonprofit
research institute SRI. The revised dataset was
used by Klimt and Yang when they presented
their introduction of then Enron corpus (Klimt &
Yang 2004). Bekkerman et al. use the dataset for
experiments in e-mail foldering and classification
(Bekkerman, McCallum & Huang 2004). Diesner
et al. analysed the evolution of structure and
communication behavior of the employees on
different organizational levels (Diesner, Frantz &
Carley 2005). Shetty and Adibi published a sub-
set of the original data containing approximately
250,000 e-mails from/to 151 Enron employees
which were sent during 1998 and 2002 (Shetty
& Adibi 2004). In 2011, we used this dataset to
analyse the evolution of communities over time
(Falkowski et al. 2008, Falkowski 2009).

For our experiments with the DenGraph-HO
clustering algorithm, we use a subset of this
dataset containing only messages sent from En-
ron employees to Enron employees.

Traditionally, clustering is based on the dis-
tance between the objects to be clustered. On a
graph of interactions, we model distance between
two actors based on the number of their interac-
tions. To deal with outliers we chose a value z so
that about 1.5 percent of all edges have an edge


weight larger than z. Afterwards, we ensure that
the weight of these edges is bounded to z.

Definition The distance between two actors u
and v is defined as

distance(u,v) =
min{u → v,v → u,z}−1

z−1
, (1)

where u→v is the number of messages sent from
u to v, v → u the number of messages sent from
v to u and z is a value specified for handling
outliers.

The distance function ranges in [0,1] and is
symmetric. If only one reciprocated interaction
exists between u and v, then their distance is one.

For our experiments we used a graph con-
sisting of 57 nodes and 81 edges that repre-
sents the internal Enron communication between
2000/12/04 and 2000/12/10. We applied the pre-
sented top-down approach and retrieved the clus-
ter hierarchy as shown in Figure 6. For each
hierarchy level the used parameters η and ε , the
number of resulting clusters and their sizes are
listed in Table 1. Figure 5 shows the graphical
representation of the interaction graph and the
computed clusters.

Due to the fact that the four members of
Cluster 2 did not communicate with the other
employees there is no connection between Clus-
ter 2 and the bigger Cluster 1. Within Cluster1
the three subclusters Cluster 3, Cluster 4 and

Figure 6: Hierarchy of the Enron Clustering

Table 1: Hierarchy of the Enron Clustering

Level ε η Labels # Nodes
0 - - Graph 57
1 1 2 Cluster 1 53

Cluster 2 4
2 0.988 3 Cluster 3 15

Cluster 4 11
Cluster 5 5

3 0.975 4 Cluster 6 8
Cluster 7 5

4 0.963 5 No
changes

5 0.95 6 Cluster 13 9

Cluster 5 appear. A reason might be, that the
respective cluster members work together in a
team or in a project. At level three, Cluster 3
splits into Cluster 6 and Cluster 7. Since the

Figure 5: Hierarchical Clustering of the Enron Graph


Table 2: Hierarchy of the Last.fm Clustering

Level ε η Cluster Labels # Nodes
0 - - Entire Graph 1209
1 0.05 22 hip-hop 35

indie, rock, alternative, punk 631
death metal, metal, heavy metal 85

2 0.037 24 hip-hop 31
punk, rock, indie 86
punk, rock 33
indie, rock, alternative 316
death metal, metal, heavy metal 37

3 0.025 26 indie, indie rock, rock 120

parameter combination of η and ε at level four
leads to no cluster changes, Cluster 13 is finally
formed within Cluster 5 at level five.

4.2. Last.fm Case Study

Last.fm is a music community with over 20
million active users based in more than 200
countries. After a user signs up, a media player
plugin is installed and all tracks a user listens
to are submitted to a database. Last.fm records
- among others - all artists a user listens to
and provides lists of the most frequently listened
artists for each week over the lifetime of a user.

Last.fm provides for each user a list of the
most frequently listened artists and the number of
times the artist was played. We obtained the user
listening behavior of 1,209 users over a periode
of 130 weeks (from March 2005 to May 2008)

and use this information to build user profiles by
extracting the genres of the most listened artists.

For each artist, Last.fm provides the tags that
are used by users to describe the artist. For
example, the band “ABBA” has 41 tags. The most
often used tags are “pop”, “disco”, “swedish”
and “70s”. The singer “Amy Winehouse” has 34
tags, the most often used tags are “soul”, “jazz”,
“female vocalists” and “british”.

Much work has been done using the collabo-
rative music tagging data from Last.fm. Chen et
al. studied the automatic classification of music
genres (Chen, Wright & Nejdl 2009). The au-
thors demonstrate the benefits of a classification
technique that uses the tags supplied by the users
for accurate music genre classification. Konstas
et al. created a collaborative recommendation
system for Last.fm based on both the social an-
notation and friendships between users (Konstas,
Stathopoulos & Jose 2009). In 2009, we ap-

Figure 7: Hierarchy of the Last.fm Clustering


Figure 8: Hierarchical Clustering of the Last.fm Graph

plied DenGraph-IO on the dataset to analyse
the dynamics of music communities (Schlitter &
Falkowski 2009, Falkowski 2009). Geleijnse et
al. used the data provided by Last.fm to investi-
gate whether the tagging of artists is consistent
with the artist similarities found with collabo-
rative filtering techniques (Geleijnse, Schedl &
Knees 2007). Since the authors found the data
both consistent and descriptive for creating a
ground truth for artist tagging and artist simi-
larity we are following their approach. We use
the tags provided by Last.fm for each artist to
describe the artist in a genre vector ~ai.

Definition An artist ai is defined as a genre
vector ~ai = (w1,w2,...,wk) of the k-most used
genre tags wi ∈W

Definition For each user u a user profile is
defined as

~u =
m

∑
i=1

~ai ·ci, (2)

while m is the number of artists listened to. The
value ci is provided by Last.fm and describes
how often the artist ai was listen to by user u.

Based on the user profiles we calculated the pair-
wise similarity of user music preferences and
generated a graph in which the nodes represent
the users and the edges encode the distance be-

tween users based on the similarity of their music
listening behavior (Schlitter & Falkowski 2009).

Definition The similarity of music preferences
between two users u and v is defined as

sim(u,v) =
∑

m
i=1 ui ·vi√

∑
m
i=1 u

2
i ·∑

m
i=1 v

2
i

(3)

For our analysis we used a Last.fm graph
consisting of 1,209 nodes and 12,612 edges. We
applied DenGraph-HO and obtained the cluster
hierarchy shown in Figure 7. The calculated
cluster labels are based on the user profiles of
the cluster members.

Table 2 gives more details about the clustering
of each hierarchy level. Shown are the parame-
ters ε and η , the cluster label and the number of
nodes for each cluster.

The semantical plausibility of the parent-child-
relationship between a superordinate cluster and
its subclusters can be demonstrated using the
(indie, rock, alternative, punk)-cluster. This clus-
ter is formed at level l = 1 and splits in the
subclusters (punk, rock, indie), (punk, rock) and
(indie, rock, alternative) at level l = 2. While
l increases, the size of the observable clusters
decreases and their semantic meaning becomes
more specific.


While traversing the hierarchy tree from the
leaves to the root, the number of nodes per
cluster increases. Due to the parameter setting
of ε and η the clusters grow either through
merging of clusters or through absorption of
nodes. These properties of DenGraph-HO and
the efficient calculation of the cluster hierarchy
are the basis for the proposed zooming purpose.

Figure 8 shows the entire graph and the clus-
ters found by DenGraph-HO. For the sake of
clarity graph edges are not drawn. Since we
limited the number of hierarchy levels in our
example and due to the small number of nodes,
the graph can easily be understood. However,
graphs with millions of nodes and a hierarchy
depth greater than ten ask for appropriate tools.
As a result, the ability of zooming through the
graph enriches our tool set and is an important
step for studying the structure of huge graphs.

5. Conclusion and Outlook

In this article we proposed the hierarchi-
cal density-based graph clustering algorithm
DenGraph-HO. The algorithm computes a hierar-
chy of clusters where each level of the hierarchy
fulfills the DenGraph paradigms. The advantage
of our method is based on the iterative change of
clustering parameters which allows an efficient
extension of a given clustering.

We demonstrated the algorithm’s practical use
for explorative visual network analysis by ap-
plying the algorithm both to a communication
graph based on the e-mail correspondence within
the former U.S. company Enron and to a social
network obtained from the online music platform
Last.fm.

Applied to the Last.fm data our algorithm
forms clusters by grouping users that have simi-
lar music listening preferences. For each cluster
we determined a cluster label that represents
music genres and which is based on the mu-
sic listening behavior of the cluster members.
The resulting cluster hierarchy shows a plausible
semantical relationship between superordinate
clusters and subclusters. Furthermore, clusters
that represent similar music genres are located

closely both in the graphical representation and
in the cluster hierarchy.

Since DenGraph-HO has proven its usefulness
for our practical work, our next step is to extend
our algorithm towards a scalable version that
is suitable for a high performance computing
environment. We will look into MapReduce and
grid computing technologies in order to paral-
lelize the algorithm execution. Thus, we will
be able to apply our algorithm to huge social
networks like Twitter or Facebook and to retrieve
the corresponding cluster hierarchy in acceptable
time.

In addition, we will investigate how to in-
tegrate the incremental approach known from
the DenGraph-IO algorithm. The result will be
an hierarchical incremental density-based graph
clustering algorithm which is able to track the
evolution of cluster hierarchies over time.

6. Acknowledgment

This study was supported by the members of the
grid computing project distributedDataMining
(http://www.distributedDataMining.org) which
provided the necessary computational power for
our graph clustering experiments.

References

Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P.,
Müller-Gorman, I. & Zimek, A. (2006), Finding
hierarchies of subspace clusters, in J. Fürnkranz,
T. Scheffer & M. Spiliopoulou, eds, ‘PKDD’,
Vol. 4213 of Lecture Notes in Computer Science,
Springer, pp. 446–453.

Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P.,
Müller-Gorman, I. & Zimek, A. (2007), Detec-
tion and visualization of subspace cluster hier-
archies, in K. Ramamohanarao, P. R. Krishna,
M. K. Mohania & E. Nantajeewarawat, eds,
‘DASFAA’, Vol. 4443 of Lecture Notes in Com-
puter Science, Springer, pp. 152–163.

Achtert, E., Böhm, C., Kröger, P. & Zimek, A. (2006),
Mining hierarchies of correlation clusters, in ‘SS-
DBM’, IEEE Computer Society, pp. 119–128.


Ankerst, M., Breunig, M. M., Kriegel, H.-P. & Sander,
J. (1999), ‘Optics: ordering points to identify the
clustering structure’, SIGMOD Rec. 28(2), 49–
60.

Bekkerman, R., McCallum, A. & Huang, G. (2004),
‘Automatic categorization of email into fold-
ers: Benchmark experiments on enron and sri
corpora’, Center for Intelligent Information Re-
trieval, Technical Report IR 418.

Chen, L., Wright, P. & Nejdl, W. (2009), Improving
music genre classification using collaborative
tagging data, in ‘Proceedings of the Second
ACM International Conference on Web Search
and Data Mining’, WSDM ’09, ACM, New York,
NY, USA, pp. 84–93.

Diesner, J., Frantz, T. L. & Carley, K. M. (2005),
‘Communication networks from the enron email
corpus ”it’s always about the people. enron is
no different”’, Comput. Math. Organ. Theory
11(3), 201–228.

Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996), A
density-based algorithm for discovering clusters
in large spatial databases with noise, in ‘Proc.
of 2nd International Conference on Knowledge
Discovery and Data Mining (KDD-96)’, pp. 226–
231.

Falkowski, T. (2009), Community Analysis in Dynamic
Social Networks, Sierke Verlag, Göttingen.

Falkowski, T. & Barth, A. (2007), ‘Density-based
temporal graph clustering for subgroup detection
in social networks’, Presented at Conference on
Applications of Social Network Analysis.

Falkowski, T., Barth, A. & Spiliopoulou, M. (2007),
Dengraph: A density-based community detection
algorithm, in ‘Proc. of the 2007 IEEE / WIC
/ ACM International Conference on Web Intel-
ligence’, IEEE Computer Society, Washington,
DC, USA, pp. 112–115.

Falkowski, T., Barth, A. & Spiliopoulou, M. (2008),
Studying community dynamics with an incre-
mental graph mining algorithm, in ‘Proc. of
the 14 th Americas Conference on Information
Systems (AMCIS 2008)’, Toronto, Canada.

Geleijnse, G., Schedl, M. & Knees, P. (2007), The
quest for ground truth in musical artist tagging
in the social web era, in S. Dixon, D. Bainbridge
& R. Typke, eds, ‘ISMIR’, Austrian Computer
Society, pp. 525–530.

Heer, J., Card, S. K. & Landay, J. A. (2005), prefuse:
a toolkit for interactive information visualization,
in ‘Proceedings of the SIGCHI conference on
Human factors in computing systems’, CHI ’05,
ACM, New York, NY, USA, pp. 421–430.

Huang, J., Sun, H., Han, J. & Feng, B. (2011),
‘Density-based shrinkage for revealing hierar-
chical and overlapping community structure in
networks’, Physica A Statistical Mechanics and
its Applications 390, 2160–2171.

Klimt, B. & Yang, Y. (2004), Introducing the enron
corpus, in ‘First Conference on Email and Anti-
Spam (CEAS)’, Mountain View, CA.

Konstas, I., Stathopoulos, V. & Jose, J. M. (2009), On
social networks and collaborative recommenda-
tion, in ‘Proceedings of the 32nd international
ACM SIGIR conference on Research and de-
velopment in information retrieval’, SIGIR ’09,
ACM, New York, NY, USA, pp. 195–202.

Kriegel, H.-P., Kröger, P., Ntoutsi, I. & Zimek, A.
(2011), Density based subspace clustering over
dynamic data, in J. B. Cushing, J. C. French &
S. Bowers, eds, ‘SSDBM’, Vol. 6809 of Lecture
Notes in Computer Science, Springer, pp. 387–
404.

MacQueen, J. B. (1967), Some methods for classifi-
cation and analysis of multivariate observations,
in L. M. L. Cam & J. Neyman, eds, ‘Proc. of
the fifth Berkeley Symposium on Mathematical
Statistics and Probability’, Vol. 1, University of
California Press, pp. 281–297.


Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M.
& Euler, T. (2006), Yale: Rapid prototyping
for complex data mining tasks, in ‘KDD ’06:
Proceedings of the 12th ACM SIGKDD in-
ternational conference on Knowledge discovery
and data mining’, ACM, New York, NY, USA,
pp. 935–940.

Schlitter, N. & Falkowski, T. (2009), Mining the
dynamics of music preferences from a social
networking site, in ‘Proceedings of the 2009
International Conference on Advances in Social
Network Analysis and Mining’, IEEE Computer
Society, Washington, DC, USA, pp. 243–248.

Schlitter, N., Falkowski, T. & Lässig, J. (2011),
Dengraph-HO: Density-based hierarchical com-
munity detection for explorative visual network
analysis, in ‘Research and Development in In-
telligent Systems XXVIII Incorporating Appli-
cations and Innovations in Intelligent Systems
XIX’, Springer, London, pp. 283–296.

Shetty, J. & Adibi, J. (2004), ‘The enron email dataset
database schema and brief statistical report’.
URL: http://www.cs.cmu.edu/∼enron/

Xu, X., Yuruk, N., Feng, Z. & Schweiger, T. A. J.
(2007), Scan: a structural clustering algorithm for
networks, in P. Berkhin, R. Caruana & X. Wu,
eds, ‘KDD’, ACM, pp. 824–833.

Yuruk, N., Mete, M., Xu, X. & Schweiger, T. A. J.
(2007), A divisive hierarchical structural clus-
tering algorithm for networks, in ‘ICDM Work-
shops’, IEEE Computer Society, pp. 441–448.

Yuruk, N., Mete, M., Xu, X. & Schweiger, T.
A. J. (2009), Ahscan: Agglomerative hierarchical
structural clustering algorithm for networks., in
N. Memon & R. Alhajj, eds, ‘ASONAM’, IEEE
Computer Society, pp. 72–77.

7. The authors

7.1. Nico Schlitter

Nico Schlitter is currently working as head of
the bwLSDF project at the Steinbuch Centre

for Computing, Karlsruhe Institute of Technol-
ogy. Much of his current work is related to
distributed storage solutions for the state of
Baden-Wuerttemberg. He is also head of the
multidisciplinary grid computing project www.
distributeddatamining.org where he works in the
fields of social network analysis and time series
analysis. After receiving his degree in computer
science from Chemnitz University of Technology
in 2006, he has been working at University
of Magdeburg in the field of RFID-based data
analysis and at University of Applied Sciences
Zittau/Görlitz in the area of simulation and opti-
mization.

7.2. Tanja Falkowski

Tanja Falkowski is currently working as head of
international relations at University of Göttingen.
In 2009, she received her Ph.D. in computer sci-
ence for her research on the analysis of commu-
nity dynamics in social networks. She developed
algorithms and methods to efficiently analyse
temporal dynamics of group structures in social
networks. Tanja Falkowski studied information
systems at Technical University Braunschweig
and wrote her diploma thesis at Haas School of
Business, University of California at Berkeley.

7.3. Jörg Lässig

Jörg Lässig is a Full Professor at the Faculty
of Electrical Engineering and Computer Sci-
ence at the University of Applied Sciences Zit-
tau/Görlitz. He holds degrees in computer sci-
ence and computational physics and received a
Ph.D. in computer science for his research on ef-
ficient algorithms and models for the generation
and control of competence networks at Chemnitz
University of Technology. He has been working
in various research and industrial projects and
is currently focusing on sustainable IT tech-
nologies. His research interests include efficient
algorithms, bio-inspired methods, and green in-
formation systems.

http://www.cs.cmu.edu/~enron/
www.distributeddatamining.org
www.distributeddatamining.org

	Introduction
	Related and Previous Work
	DenGraph
	DenGraph-O
	DenGraph-IO

	DenGraph-HO
	Top-down Approach: Cluster Reduction, Split or Removal
	Bottom-up Approach: Cluster Creation, Absorption and Merging
	Data Structures
	Run Time Complexity
	Memory Requirements

	Applications
	Enron Case Study
	Last.fm Case Study

	Conclusion and Outlook
	Acknowledgment
	References
	The authors
	Nico Schlitter
	Tanja Falkowski
	Jörg Lässig