A genetic search of patterns of behaviour in OSS communities


Expert Systems with Applications 39 (2012) 13182–13192
Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a
A genetic search of patterns of behaviour in OSS communities

M.R. Martínez-Torres ⇑
University of Seville, Facultad de Turismo y Finanzas, Avda. San Francisco Javier, s/n 41018 Sevilla, Spain

a r t i c l e i n f o a b s t r a c t
Keywords:
Open source software
Virtual communities
Social network analysis
Genetic algorithm
Factor analysis
0957-4174/$ - see front matter � 2012 Elsevier Ltd. A
http://dx.doi.org/10.1016/j.eswa.2012.05.083

⇑ Tel.: +34 954 55 43 10; fax: +34 954 55 69 89.
E-mail address: rmtorres@us.es
This paper proposes the identification of patterns of behaviour of open source software (OSS) communi-
ties using factor analysis and their social network analysis (SNA) features. OSS communities can be
modelled as a social network in which nodes represent the community members and arcs represent
the social interactions among them, and factor analysis is able to provide the factors that explain the
latent patterns of behaviour. Due to the complexity of the problem and the high number of SNA features
that can be extracted, this paper proposes a genetic search of an optimum subset of indicators leading to a
group of latent patterns of behaviour maximizing the explained data variance and the interpretation of
factors. Obtained results illustrate the feasibility of the proposed framework to extract relevant informa-
tion from a large set of data.

� 2012 Elsevier Ltd. All rights reserved.
1. Introduction

The OSS projects differ greatly from commercial software devel-
opment models in several aspects. For instance, commercial soft-
ware companies prevent access to the source code of their
products from outside developers and customers, while OSS allows
source code to be freely modified and redistributed under ‘‘open
source’’ licenses (Feller & Fitzgerald, 2002). Besides, OSS projects
are typically developed in a distributed and decentralized way as
a difference to proprietary software, based on closed and formal
structures. Precisely, one of the most distinctive characteristics of
OSS projects is the fact that they are written, developed, and de-
bugged largely by worldwide volunteers, who in most cases are
connected and collaborate solely through the Internet. Therefore,
the community behind the development of the project plays an
essential role for the project to success (Deek & McHugh, 2008).

Several authors have described OSS development teams as hav-
ing a hierarchical or onion-like structure (Crowston & Howison,
2005), with a central core of highly active individuals, surrounded
by other layers of progressively less active individuals. One exam-
ple of this is presented in the study by Ye, Nakakoji, et al. (2005)
where the central core is composed of the project leaders and core
members, with five outer layers containing active developers,
peripheral developers, bug reporters, passive users, and stakehold-
ers, respectively. It has been demonstrated that much of the OSS
development is realized by a small percentage of individuals de-
spite the fact that there are tens of thousands of available develop-
ers. Such concentration is called ‘‘participation inequality’’ (Kuk,
ll rights reserved.
2006), and it can be explained by the different user profiles of open
source communities. Participation inequality allows the categori-
zation of OSS community members in three groups (Mockus,
Fielding, & Herbsleb, 2002; Xu, Gao, Christley, & Madey, 2005).
Core members are responsible for guiding and coordinating the
development of an OSS project. They are usually involved with
the project during a long period of time and have made significant
contributions to the development and evolution of the system.
Moderators and leaders are included in this group. Active develop-
ers are those community members that regularly make contribu-
tions to the project. Finally, peripheral developers occasionally
contribute with new features to the existing system. This contribu-
tion is irregular, and the period of involvement is short and spo-
radic. Free riders (people who just are seeking answers without
making any contributions) are also included in this group.

The social structure of OSS teams directly influences the
participation and the decision-making process affecting the overall
performance of the project. Therefore, an important research
question is extracting the different patterns of behaviour in OSS
communities. These patterns leading to successful projects are of
great interest both for autonomous and sponsored communities,
that share a common aim of retaining and attracting participants
to their communities (West & O’mahony, 2008). However, the
structure of communities can only be derived from the participa-
tion activity of their members. For this purpose, OSS communities
have been frequently modelled as a social network, being the nodes
of the network the community members while the arcs represent
the flow of interactions among users (Toral, Martínez-Torres, &
Barrero, 2009a). These networks are then analyzed using Social
Network Analysis (SNA) techniques by obtaining a set of SNA fea-
tures. For instance, previous studies have considered the size and
out-degree of nodes (Valverde, Theraulaz, Gautrais, Fourcassie, &

http://dx.doi.org/10.1016/j.eswa.2012.05.083
mailto:rmtorres@us.es
http://dx.doi.org/10.1016/j.eswa.2012.05.083
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa
Sergio
Cuadro de texto

Sergio
Cuadro de texto

Sergio
Cuadro de texto


M.R. Martínez-Torres / Expert Systems with Applications 39 (2012) 13182–13192 13183
Sole, 2006), closeness centrality (Panchal, 2009) and betweeness
centrality (Hossain, Wu, & Chung, 2006; Toral, Martínez-Torres,
Barrero, & Cortés, 2009b), the clustering coefficient (Kwon, Oh, &
Jeon, 2007; Singh, 2010), structural holes (Okoli & Oh, 2007) or
the brokerage role of nodes (Sowe, Stamelos, & Angelis, 2006;
Barcellini, Détienne, & Burkhardt, 2009; Toral, Martínez Torres, &
Barrero, 2010), among other SNA features. Moreover, each of these
measurements can be computed for the whole network or for
several specific sub-networks like the one obtained from active
developers or from the core group of the community. Due to the
large number of possible SNA indicators that can be obtained,
previous studies have been focused on a small number of social net-
work features to characterize participation and obtain OSS patterns
of behaviour. As a difference, in this study participation is analyzed
using the whole set of SNA features that can be obtained from OSS
communities. For this purpose, a genetic search of the optimum
subset of indicators able of providing the main pattern of behaviour
in OSS communities is proposed. Therefore, the main contribution
of this paper is the possibility of dealing with all the SNA features
of the social networks modelling OSS communities through an
evolutionary computation technique like Genetic Algorithms.

The remainder of the paper is structured as follows. First, previ-
ous studies related to social network structures in OSS communities
are reviewed in Section 2. In Section 3 the problem is formulated
and the proposed approach described. Section 4 describes the
Genetic Algorithm implementation. Obtained results and discus-
sion are included in Section 5. Finally, conclusions are detailed in
Section 6.
2. Related work

Social network theory is based on the idea that social interac-
tion patterns reflect the behaviour of individuals (Freeman,
2004). Therefore, the analysis of the social interactions can provide
information about how individuals behave or how groups are orga-
nized. This is why social network analysis focuses on the relation-
ships between people, instead of on characteristics of people. By
mapping these relationships, network analysis helps to uncover
the emergent and informal communication patterns present in
an organization, which in turn can be used to explain several orga-
nizational phenomena.

A social network can be modelled as a graph with nodes repre-
senting people or groups, and links representing relationships or
information flows between them. OSS communities constitute
clear examples of dynamic social networks, as it is changing over
time. Social networks begin when developers join a project, work
with others, and form co-working relationships (Xu, Christley, &
Madey, 2006). These relationships are important because the sense
of belonging to a group is encouraged through social network
bonding. The aim of OSS communities is attracting and retaining
people, as this will benefit the underlying software. As new mem-
bers join the community, different user profiles emerge. Not all the
users are interested in participating the same way. Some of them, a
small percentage, intensely contribute to the development of the
project, while the rest of them make regular, occasional or even
no contribution at all. These variety of user profiles lead to a
core/periphery structure, where the core group of developers is lo-
cated at the center of the community and the rest of them far away
from the center depending on their contributions. The core group
members are strongly connected to each other while the periphery
contains members who are usually weakly connected to each other
as well as to the core members (Long & Siau, 2007).

Mathematically, a social network can be represented as a graph
G = (V, E) where V denotes a finite set of vertices and E denotes a
finite set of edges such that E # V � V. Some network analysis
methods are easier to understand when graphs are conceptualized
as matrices (Nooy, Mrvar, & Batagelj, 2005) as shown in Eq. (1).
M ¼ðmi;jÞn�n; where n ¼ jVj; mi;j ¼
1 if ðv i; v jÞ 2 E
0 otherwise

�
ð1Þ

In case of a valued graph, real valued weight function w(e) is
defined on the set of edges, i.e., wðeÞ¼ ExR, and the matrix is then
defined as given by Eq. (2).
Mi;j ¼
wðeÞ if ðv i; v jÞ 2 E
0 otherwise

�
ð2Þ

In the context of OSS communities, V is given by all the commu-
nity members and E is given by the interactions among them. The
resulting network is a directed network in the sense that edges are
actually arcs from one community member to another one, and the
direction of the arcs shows the flow of information between them.
It is also a valued network, as it is possible multiple interaction be-
tween the same community members.

Networks can be partitioned using some discrete characteristics
of vertices. For instance, several classes of vertices can be obtained
using the function w(e), that is, the strength of arcs. In the case of
OSS projects, these kinds of partitions should highlight the core/
periphery (C/P) structure of the community. A C/P structure divides
vertices in three distinct subgroups: vertices in the core, densely
connected with each other, and vertices on the periphery, which
in turn can be divided in active and peripheral vertices depending
on their level of interaction.

The following list summarizes the characteristics that can be
computed for the whole network or each of the mentioned sub-
networks as well as the main previous studies focused on them.

Size and connectivity: the size of the community is the number
of members the community has and the number of arcs represent
the connectivity among these community members. Both indica-
tors has been frequently used as a measure of the successful devel-
opment of a OSS project. Success in a virtual community could be
manifested through the level of participation, which can be under-
stood as the number of participants (Preece, 2001) or as the level of
community activity and quantity of community work output
(Hinds & Lee, 2008). Several topological indices based on connec-
tivity can be computed, like the Zagreb group index, the Randic
connectivity index or the Platt index, all of them defined in terms
of connections of nodes (Devillers & Balaban, 1999).

Density: it is defined as the number of lines in a simple network,
expressed as a proportion of the maximum possible number of
lines. The main problem of this definition is that it does not take
into account valued lines higher than 1 and it depends on the net-
work size. A different measure of density is based on the idea of the
degree of a node, which is the number of lines incident with it (Tor-
al et al., 2009a). A higher degree of nodes yields a denser network,
because nodes entertain more ties. The advantage of average de-
gree is that it is a non-size dependent measure of density. As OSS
communities are directed networks, several statistical measures
of the out-degree distribution can be considered. Finally, density
can be measured alternatively using an egocentric point of view;
the egocentric density of a node is the density of ties among its
neighbours (Nooy et al., 2005). In previous works, density has been
used to study the coordination performance of OSS projects. Sev-
eral studies conclude there is a negative impact of density over
the quality of the project and its coordination (Hossain & Zhu,
2009; Feczak & Hossain, 2011).

Components: A strong component is a maximal strongly con-
nected subnetwork. A network is said to be strongly connected if
each pair of vertices is connected by a path, taking into account
the direction of arcs (Nooy et al., 2005). In the context of this study,

Sergio
Cuadro de texto


13184 M.R. Martínez-Torres / Expert Systems with Applications 39 (2012) 13182–13192
components allow the identification of connected substructures in
the OSS community.

K-cores: a k-core is a sub-network in which each node has k de-
gree in that sub-network. The core with the highest degree is the
central core of the network, detecting the set of nodes where the
network rests on (Toral et al., 2010).

Distance: it is defined as the number of steps in the shortest
path that connects two vertices. Distance between members in
the OSS community can affect how ideas and discussions can
spread over the community. Short distances mean information
only has to travel a few links to reach anybody in the network
(Xu et al., 2006).

Closeness centralization: it is an index of centrality based on the
concept of distance. The closeness centrality of a node is calculated
considering the total distance between one node and all other
nodes, where larger distances yield lower closeness centrality
scores. The closeness centralization is an index defined for the
whole network, and it is calculated as the variation in the closeness
centrality of vertices divided by the maximum variation in close-
ness centrality scores possible in a network of the same size (Toral
et al., 2009b). It has been found that closeness centralization has a
positive correlation with coordination mechanism on OSS projects
(Pereira & Soares, 2007; Feczak & Hossain, 2011).

Betweenness centrality: it is a measure of centrality that rests on
the idea that a person is more central if he or she is more important
as an intermediary in the communication network (Nooy et al.,
2005). The centrality of a node depends on the extent to which this
node is needed as a link to facilitate the connection of nodes within
the network. If a geodesic is defined as the shortest path between
two nodes, the betweenness centrality of a vertex is the proportion
of all geodesics between pairs of other vertices that include this
vertex, and betweenness centralization of the network is the vari-
ation in the betweenness centrality of vertices divided by the max-
imum variation in betweenness centrality scores possible in a
network of the same size. It has been used to study the hierarchy
and centralization in free and open source software team commu-
nications (Crowston & Howison, 2006).

Brokerage roles: A broker is a middle node in a directed triad (a
set of three vertices and the lines among them). Different types of
brokerage roles can be distinguished considering mediation be-
tween members of the same or different groups. In this context
of OSS communities, these groups are given by active and periph-
eral members. Therefore, two possibilities of mediation can be con-
sidered as shown in Fig. 1.

The role of knowledge brokers has been highlighted in mailing
lists as community facilitators, helping answer those questions
knowledge seekers posted (Sowe et al., 2006). This role has also
been assigned to the core team of the OSS project, acting as inter-
mediary between expert software developers and peripheral users
and helping OSS projects to engage in a discourse and co-learning
experience with their user communities (Toral et al., 2010).

Clustering coefficient: It measures whether first degree neigh-
bours of a particular vertex interact with each other. Basically,
clustering coefficient is a measure of local cohesiveness through
the neighbour interactions of a vertex (Durugbo, 2012). The clus-
tering coefficient of a vertex is defined as the ratio of the number
Broker Broker

Active 
member

Peripheral 
member

Active 
member

Peripheral 
member

Fig. 1. Brokerage roles.
of links to the total possible number of links among its neighbours,
and the clustering coefficient of a social network is the average of
all the clustering coefficients of the vertices. It has been used to
characterize the small-world phenomena in networks (Xu et al.,
2006). Besides, highly clustered networks provide better informa-
tion propagation (Gao & Madey, 2007).

Proximity prestige: It is defined in terms of the output domain of
a node, which is the number of nodes for which there is a path to
that node. Following Nooy et al. (2005) definition, the proximity
prestige of a node is the proportion of all nodes in the network
(excluding itself) that are in its output domain divided by the mean
distance from all nodes in its output domain. Therefore, a zero va-
lue of the proximity prestige means that this node is isolated.

3. Formulation of the problem and proposed framework

The problem consists of finding a set of SNA indicators able to
explain the different structural patterns of OSS communities. Fac-
tor Analysis is a multivariate statistical technique usually em-
ployed for the identification of latent dimensions or factors on a
dataset. These factors are not directly observable and segment
the dataset into relatively homogeneous segments (Rencher,
2002). It is assumed each variable is dependent on a linear combi-
nation of the common factors, and the coefficients are known as
loadings (Toral & Martínez Torres, 2009c). Mathematically, the fac-
tor analysis model expresses each variable as a linear combination
of underlying common factors f1, f2 , . . . , fm, with an accompanying
error term to account for that part of the variable that is unique
(not in common with the other variables). For y1, y2, . . . , yp in
any observation vector y, the model is as follows:

y1 � l1 ¼ k11f1 þ k12 f2 þ���þ k1mfm þ e1
y2 � l2 ¼ k21f1 þ k22 f2 þ���þ k2mfm þ e2
. . .

yp � lp ¼ kp1f1 þ kp2 f2 þ���þ kpmfm þ ep

ð3Þ

Model (3) can be written in matrix notation as in Eq. (4), where K is
the factor loadings matrix.

y � l ¼ Kf þ e ð4Þ

The coefficients kij are called loadings and serve as weights,
showing how each yi individually depends on the underlying fac-
tors (Lee and Lee, 2011). With appropriate assumptions, kij indi-
cates the importance of the jth factor fj to the ith variable yi and
can be used in interpretation of fj. It is expected the loadings will
partition the variables into groups corresponding to factors.

Ideally, the number of factors m should be substantially smaller
than the problem dimension p; otherwise we have not achieved a
parsimonious description of the variables as functions of a few
underlying factors. In the case of exploratory factor analysis, the
lack of theoretical background causes that the number of factors
to be extracted is a priori unknown. Therefore, factors must be se-
lected attending to the homogeneity of their indicators.

The main problem of using factor analysis when considering a
large set of indicators is that the final result is conditioned by
the number of variables included in the analysis. The described
model try to fit the data in set of factors, and the inclusion of
non appropriate variables may distort the obtained latent factors.
However, it is not easy to decide which variables are or not appro-
priate, above all in those situations where the theoretical back-
ground cannot guide this process. This paper proposes a
computation framework where the selection of indicators is per-
formed using Genetic Algorithms (GA). Each element (or chromo-
some using typical GA notation) of the population considered by
GA represent a subset of all the possible SNA indicators. Fig. 2
shows a brief scheme of the proposed framework.

Sergio
Cuadro de texto


Fig. 2. Proposed framework.

Fig. 3. Outline of the genetic algorithm implementation.

M.R. Martínez-Torres / Expert Systems with Applications 39 (2012) 13182–13192 13185
The developed algorithm consists of two loops. The inner loop
consists of the evaluation of the fitness function for each element
of the population using factor analysis. The outer loop generates
a new generation using genetic operators like reproduction, cross-
over and mutation, using a selection based on the individual fitness
of each element of the population. The framework stops working
when a selected stopping criterion is reached.
4. Genetic algorithm implementation

Genetic Algorithms are a family of computational models in-
spired by evolution (Holland, 1975; Goldberg, 1989). These algo-
rithms encode a potential solution to specific problem on a
simple chromosome-like data structure and apply genetic opera-
tors to these structures in order to preserve critical information
(Martínez-Torres & Toral-Marín, 2010). An initial population Pi
composed of Nt chromosomes is considered. Goldberg (1989) stud-
ied the optimum number of chromosomes for a population accord-
ing to the chromosome’s length. His main conclusion was that the
optimum population’s size value gets higher as the chromosome’s
length increases. This initial population is generated randomly in
order to preserve the diversity in the population and the fitness
function is calculated to evaluate the goodness of each chromo-
some. The mechanism for generating the subsequent generations
is based on the selection scheme from (l + k) evolution strategy
(Michalewicz, 1996; Reina, Toral, Johnson, & Barrero, 2012). The
l best chromosomes are included directly in the next generation.
The crossover and mutation operations are responsible of generat-
ing k chromosomes of a new population. The crossover consists of
using two members of a population Pj to generate two new mem-
bers of the next population Pj+1 by crossing their genetic informa-
tion. The new chromosomes contain genetic information from the
predecessors. The purpose of mutation is to change the genetic
information of a chromosome included in Pj to generate a new
chromosome of Pj+1. Fig. 3 illustrates the implemented genetic
algorithm.

4.1. Chromosome encoding

A chromosome Ci represent the subset of indicators that will be
considered to perform factor analysis. The total number of

Sergio
Cuadro de texto


13186 M.R. Martínez-Torres / Expert Systems with Applications 39 (2012) 13182–13192
indicators extracted for the OSS communities according to the de-
scribed SNA features is 60 (see appendix). Therefore, each chromo-
some is a binary string of length 60 where a value of 1 means that
the corresponding variable is part of the considered subset of
indicators and a value of 0 means that it is excluded from this list.
Notice that the space of possible solutions is formed by
260 = 1.1529e+018 possibilities. That means that we should per-
form 260 different factor analyses to completely explore the space
of possible solutions. In this kind of problems, GA can perform a
guided search of the optimum solution with lower computational
cost than exploring one by one all the possibilities. The representa-
tion of chromosomes as binary strings has the advantage that this
chromosomal encoding is complete and valid. Complete means
that the whole space of possible solutions can be represented
and valid means that all of them can be computed. The composi-
tion of a chromosome is shown in Fig. 4.
4.2. Evaluation function

The fitness function quantifies the suitability of each chromo-
some as a solution. Genetic operators make selections based on
individual fitness. That means that chromosomes with high fitness
value have more chance of being selected, passing their genetic
material (via reproduction, crossover or mutation) to the next gen-
eration. As a result, the fitness function provides the pressure for
evolution towards a new generation with chromosomes of higher
fitness than the previous ones. In this case, the fitness function
should measure how well factor analysis can identify latent factors.
However, its capacity to perform such identification depends on
the following parameters:

� Explained variance: It refers to the percentage of the total sam-
ple variance explained by the considered factors.
� Correlations between variables: It is the average of the sum of

the squared correlation coefficients between indicators. Consid-
ered indicators must be correlated as the factor analysis is based
on the interrelationships among variables.
� Interpretability of factors: A factor is well defined if it is

explained by at least three variables (Rencher, 2002). That is
to say, at least three different factor loadings should be maxi-
mized for each considered factor.

As a result, the fitness function requires to be defined as a mul-
ti-objective fitness function considering the aforementioned
parameters.

F ¼ c1 Var þ c2
1
n

Xk
i¼1

r2i þ c3 Interp ð5Þ

Explained variance and correlations among variables exert
opposite effects on the evolution of GA. Explained variance makes
the GA to evolve towards a minimum number of indicators, as it is
easiest to explain the variance of the dataset whenever a lower
number of variables are considered. As a difference, correlations
among variables makes the GA to search for a higher number of
variables, as this parameter is maximized by including as many
variables as possible. However, the third parameter, interpretabil-
ity of factors, is the most important one, because this parameter
guarantees factors are well defined.
1

I1

0

I2

1

I3

0

I4

1

I5

0

I6

. . . 1

I59

0

I60

Fig. 4. Chromosome’s composition.
C1, C2, and C3 coefficients in Eq. (5) are used to adjust the rela-
tive importance of the three parts of the fitness function. Obvi-
ously, the range of them is [0,1], with the restriction of
C1 + C2 + C3 = 1.

4.3. Stopping criteria

The population’s average fitness function has been chosen as
the stop criterion of the genetic algorithm. If Pav,j denotes the pop-
ulation’s average fitness function, the stopping criterion can be for-
mulated as:

Sc ! Pav;jþ1 � Pav;j ð6Þ
4.4. Procedure of transition

The procedure used to generate a new population Pj+1 from the
previous population Pj is as follows:

� The best 20% chromosomes are copied from Pj to Pj+1. This
ensures that the best individuals of each population will be
included in the next generation. As a consequence, the likeli-
hood of using a good chromosome for reproduction operations
gets higher.
� The 80% of the new chromosomes are generated by using cross-

over (75%) and mutation (5%) operations. This aims to favour
the diversity of the chromosomes. Pc denotes the probability
of a chromosome Ci to take part in a crossover operation and
it is calculated as:

PCi ¼
ffðCiÞPn
i¼1 ffðCiÞ

ð7Þ

The term ff(Ci) stands for the evaluation of the fitness function
for the chromosome Ci. Consequently, the best chromosomes are
more likely to be selected. The crossover operation is illustrated
in Fig. 5.

A one-point crossover operation has been implemented. The
point of cross is denoted by pk, where 0 6 k 6 l, and l is the size
of the chromosome. The value of k is randomly chosen for each
crossover operation. This point divides each chromosome into
two parts RGj and LGj. The two new chromosomes are then ob-
tained swapping LGj,1 by LGj,2.

Similarly, Pm is the probability of a chromosome i to take part in
a mutation operation. The purpose of mutation is to make small
changes in the chromosomes. These changes consist of modifying
one chromosome’s bit. According to De Jong (1975) we have calcu-
lated Pm as l

�1. The mutation operation is illustrated in Fig. 6.
The position of the mutated bit is denoted by pm, where

0 6 m 6 l. The value of pm is randomly chosen for each mutation
operation.

5. Results

The proposed approach has been applied to twelve virtual
communities listed in Table 1. They correspond to Linux Debian
ports to different processor architectures. The Debian Project is
an association of individuals who have made common cause to
create a free operating system called Debian GNU/Linux, or simply
Debian for short (Robles, Gonzalez-Barahona, & Michlmayr, 2005;
Mateos-Garcia & Steinmueller, 2008).

Each community was analyzed from the year in which each
community started its activity to 2010. For each year and commu-
nity, a social network based on interactions among participants has
been extracted. As a result, a total of 134 social networks have been
analyzed, extracting the set of data detailed in the appendix.

Sergio
Cuadro de texto


Fig. 5. Crossover operation.

Fig. 6. Mutation operation.

M.R. Martínez-Torres / Expert Systems with Applications 39 (2012) 13182–13192 13187
Once this large dataset has been obtained, GA has been applied to
obtain an optimum subset of indicators able to identify OSS commu-
nities profiles according to their topological participation structure.
A population’s size of 10000 chromosomes will be considered
(Alander, 1992). That means that the 2000 best chromosomes are di-
rectly included in the next population, 7500 chromosomes are gen-
erated by the crossover operation and 500 chromosomes are
Table 1
Analyzed Linux Debian communities.

URL

Debian port to m68k
(Debian-68k)

http://lists.debian.org/debian-68k/

Debian port to ARM
(Debian-ARM)

http://lists.debian.org/debian-arm/

Debian port to Intel IA-64
(Debian-ia64)

http://lists.debian.org/debian-ia64/

Debian port to Alpha
(Debian-alpha)

http://lists.debian.org/debian-alpha/

Debian port to AMD64
(Debian-amd)

http://lists.debian.org/debian-amd64/

Debian port to BSD (Debian-
bsd)

http://lists.debian.org/debian-bsd/

Debian port to HPPA
(Debian-hppa)

http://lists.debian.org/debian-hppa/

Debian port to Hurd
(Debian-hurd)

http://lists.debian.org/debian-hurd/

Debian port to MIPS
(Debian-mips)

http://lists.debian.org/debian-mips/

Debian port to PowerPC
(Debian-ppc)

http://lists.debian.org/debian-powerpc/

Debian port to SPARC
(Debian-s390)

http://lists.debian.org/debian-s390/

Debian port to SPARC
(Debian-sparc)

http://lists.debian.org/debian-sparc/
generated by the mutation operation. Factor analysis is used for eval-
uating each chromosome. The cost function follows the general
structure defined in Eq. (5). The coefficients C1, C2 and C3 represent
the relative weigh of each part of the fitness function. GA has been
run for different values of the three parameters, obtaining the results
shown in Table 2. It can be observed that interpretability only
reaches a value of 1 when C3 is overweighed, while the explained
variance varies depending on the value of C1. That is why the set of
coefficients with the values C1 = 0.1, C2 = 0.1 and C3 = 0.8 has been
chosen (This set of value is shown in italics in Table 2). An interpret-
ability of factors equal to 1 guarantees that all the latent factors are
interpretable. Using this set of values, GA converged after 30 gener-
ations, with an explained variance of 70.93%, and 23 indicators
grouped in four factors. Time required by genetic algorithm execu-
tion was 2873.6 s (47.89 min). This value is much smaller than the
alternative option of exploring the whole solution space. According
to the chosen encoding, the size of a chromosome is l = 60 bits, so the
space of possible solutions is 2l = 260 = 1.1529e+018. The idea of
finding the optimum solution exploring all the possibilities is unat-
tainable. The time necessary to carry out a single factor analysis is
12.9 ms. That means it would take more than 470 million years to ex-
plore the whole space of possible solutions. The genetic algorithm
implementation is able to speed up the search of an optimal solution.
Description Period

Motorola 68k port of Debian GNU/Linux. Debian currently
runs on the 68020, 68030, 68040 and 68060 processors

1998–2010

ARM port for Debian GNU/Linux. Debian fully supports a port
to little-endian ARM

1999–2010

Discussions on the intel IA64 (aka Itanium, Merced) port of
Debian GNU/Linux

2001–2010

The purpose of this project is to assist developers and others
interested with the ongoing project to port the Debian
distribution of Linux to the Alpha family of processors

1998–2010

Porting Debian to AMD x86-64 architecture 2004–2010

This is a port of the Debian operating system, complete with
apt, dpkg, and GNU userland, to the NetBSD kernel

2001–2010

This is a port to Hewlett-Packard’s PA-RISC architecture 2001–2010

The GNU Hurd is a totally new operating system being put
together by the GNU group

1999–2010

MIPS port of Debian GNU/Linux, able to run at both
endiannesses

1999–2010

PowerPC port of Debian GNU/Linux. The PowerPC
architecture allows both 64-bit and 32-bit implementations

1999–2010

Discussions on the IBM S/390 port of Debian GNU/Linux 2001–2010

This port runs on the Sun SPARCstation series of
workstations, as well as some of their successors in the sun4
architectures

1998–2010

Sergio
Cuadro de texto


Table 2
GA results for different values of coefficients c1, c2 and c3.

Parameters
(c1/c2/c3)

Explained
variance

Indicators Factor
number

Interpretability

1.00/0.00/0.00 77.73 20 7 0.00
0.80/0.10/0.10 79.80 49 12 0.00
0.60/0.20/0.20 78.03 46 10 0.30
0.40/0.30/0.30 73.87 53 10 0.18
0.20/0.40/0.40 75.70 49 10 0.30
0.00/0.50/0.50 72.40 48 9 0.44
0.00/0.00/1.00 59.65 14 2 1.00
0.10/0.10/0.80 70.93 23 4 1.00
0.20/0.20/0.60 71.60 36 5 0.60
0.30/0.30/0.40 74.94 49 10 0.30
0.40/0.40/0.20 77.28 51 11 0.27
0.50/0.50/0.00 75.16 52 11 0.08
0.00/1.00/0.00 72.53 55 11 0.09
0.10/0.80/0.10 72.52 54 11 0.18
0.20/0.60/0.20 75.37 54 12 0.17
0.30/0.40/0.30 76.79 53 12 0.16
0.40/0.20/0.40 75.31 43 8 0.50
0.50/0.00/0.50 73.24 22 5 0.80

Fig. 7. Fitness distribution over 30 generations of the genetic algorithm.

Table 3
Selected set of indicators.

Description

VAR02 Number of interactions
VAR03 Number of repeated interactions
VAR04 Density
VAR06 The Zagreb group index
VAR09 Free riders
VAR10 Active members
VAR11 Members responsible of more than 50% of contribution
VAR17 Normalized size of output-domain (standard deviation)
VAR18 Proximity prestige (average value)
VAR26 Network betweenness Centralization
VAR29 Number of vertices with betweenness centrality >0
VAR30 Egocentric density (average value)
VAR31 Egocentric density (standard deviation)
VAR33 Number of developed brokerage roles among active me
VAR34 Number of vertices developing a brokerage role among
VAR35 Number of developed brokerage roles among active me
VAR41 Density
VAR42 Average degree
VAR45 Average out-degree
VAR48 Normalized size of output-domain (average value)
VAR50 Proximity prestige (average value)
VAR53 Closeness centrality (average value)
VAR59 Egocentric density (average value)

13188 M.R. Martínez-Torres / Expert Systems with Applications 39 (2012) 13182–13192
The evolution of the genetic clustering algorithm is detailed in
Fig. 7. The initial population (generation 0) has a low fitness value,
which indicates that the individuals of the population are far from
the optimum. As the number of generations increase, the fitness of
individuals within the population also increases, as the genetic
algorithm is biased towards the survival of genetic material con-
tained within the individuals with high fitness function values.

The optimum subset of indicators provided by GA is listed in Ta-
ble 3. In particular, the indicators description and the network over
which they have been calculated are detailed.

The results from factor analysis using the set of variables se-
lected by the genetic algorithm are detailed in Table 4. Usually, a
number of factors equal to the number of eigenvalues higher than
1 is selected (Rencher, 2002). Consequently, up to four latent fac-
tors can be distinguished as result of factor analysis.

The indicators associated to each factor are obtained from the
factor loadings using a Varimax rotation. All the indicators associ-
ated in this way with the same factor are hypothesized to share a
common meaning that the analyst should discover. Table 5 shows
which indicators are associated to each factor and their corre-
sponding factor loadings.

On the other hand, factor scores are used to categorize the ori-
ginal sample of OSS communities, which can be approximated to
one of the identified latent factors. Consequently, the original sam-
ple of OSS communities can be categorized in four groups. An anal-
ysis of variance (ANOVA) has been applied to the categorization of
Network/subnetwork

Complete network
Complete network
Complete network
Complete network
Complete network
Complete network

s Complete network
Complete network
Complete network
Complete network
Complete network
Complete network
Complete network

mbers Complete network
active members and free riders Complete network
mbers and free riders Complete network

Active members network
Active members network
Active members network
Active members network
Active members network
Active members network
Active members network

Table 4
Explained variance of resulting factor analysis.

Factor Eigenvalues

Value Variance (%) Cumulative (%)

1 11.881 47.526 47.526
2 4.388 17.553 65.078
3 2.309 9.234 74.313
4 1.455 5.818 80.131
5 0.914 3.658 83.789
6 0.799 3.195 86.984
7 0.652 2.606 89.590

..

. ..
. ..

. ..
.

23 0.000 0.001 100.000

Sergio
Cuadro de texto


Table 5
Identified factors.

Description Loading

F1 VAR02 Number of interactions 0.921
VAR03 Number of repeated interactions 0.900
VAR06 The Zagreb group index 0.836
VAR09 Free riders 0.800
VAR10 Active members 0.931
VAR11 Members responsible of more than 50% of contributions 0.703
VAR29 Number of vertices with betweenness centrality >0 0.945
VAR33 Number of developed brokerage roles among active members 0.929
VAR34 Number of vertices developing a brokerage role among active members and free riders 0.925
VAR35 Number of developed brokerage roles among active members and free riders 0.910

F2 VAR17 Normalized size of output-domain 0.854
VAR18 Proximity prestige (average value) 0.847
VAR26 Network Betweenness Centralization 0.764
VAR30 Egocentric density (average value) 0.881
VAR31 Egocentric density (standard deviation) 0.719

F3 VAR48 Normalized size of output-domain (average value) 0.717
VAR50 Proximity prestige (average value) 0.901
VAR53 Closeness centrality (average value) 0.899
VAR59 Egocentric density (average value) 0.740

F4 VAR04 Density (complete network) 0.705
VAR41 Density (active members netw.) 0.897
VAR42 Average degree 0.799
VAR45 Average out-degree 0.703

Table 6
Statistical significance of ANOVA.

F Sig F Sig

VAR02 41.837 0.000 VAR31 16.644 0.000
VAR03 36.386 0.002 VAR33 41.910 0.000
VAR04 7237 0.003 VAR34 86.377 0.000
VAR06 24.322 0.000 VAR35 46.100 0.000
VAR09 50.931 0.000 VAR41 8018 0.000
VAR10 72.496 0.000 VAR42 10.950 0.000
VAR11 23.912 0.000 VAR45 6489 0.000
VAR17 53.598 0.000 VAR48 26.019 0.000
VAR18 40.664 0.000 VAR50 16.445 0.306
VAR26 31.039 0.000 VAR53 14.434 0.000
VAR29 73.973 0.000 VAR59 7488 0.000
VAR30 37.410 0.000 0.000

Table 7
Mean values of selected indicators.

F1 F2 F3 F4

VAR02 11096.3846 2264.8140 1935.4688 1092.2500
VAR03 8102.5385 1553.1860 1464.1250 900.3333
VAR04 0.0201 0.0558 0.0268 0.0663
VAR06 1.9473E7 1.9473E7 1.9473E7 1.9473E7
VAR09 353.6154 353.6154 353.6154 353.6154
VAR10 377.9231 377.9231 377.9231 377.9231
VAR11 8.1538 8.1538 8.1538 8.1538
VAR17 0.2353 0.2353 0.2353 0.2353
VAR18 0.0894 0.0894 0.0894 0.0894
VAR26 0.1048 0.1048 0.1048 0.1048
VAR29 198.3462 198.3462 198.3462 198.3462
VAR30 0.2611 0.2611 0.2611 0.2611
VAR31 0.3163 0.3163 0.3163 0.3163
VAR33 60910.6538 60910.6538 60910.6538 60910.6538
VAR34 55.8846 55.8846 55.8846 55.8846
VAR35 4590.7692 4590.7692 4590.7692 4590.7692
VAR41 0.0742 0.0742 0.0742 0.0742
VAR42 383725.4850 383725.4850 383725.4850 383725.4850
VAR45 152903.9688 152903.9688 152903.9688 152903.9688
VAR48 0.7834 0.7834 0.7834 0.7834
VAR50 0.2969 0.2969 0.2969 0.2969
VAR53 0.2972 0.2972 0.2972 0.2972
VAR59 0.4730 0.4730 0.4730 0.4730

M.R. Martínez-Torres / Expert Systems with Applications 39 (2012) 13182–13192 13189
the original sample in the four groups obtained form factor analy-
sis. The aim of this analysis consists of checking the null hypothesis
of equal population means. Table 6 details the F statistic, the ratio
of two different estimators of population variance, which appears
together with its corresponding critical level or observed signifi-
cance. The results is that the null hypotheses have been rejected
in all the cases with a significance value below 0.05. That means
the obtained categorization from factor analysis is well defined.

Table 7 details the mean value of the considered 23 indicators
per each of the distinguished groups. Using this information as
well as the factor loadings of Table 5, the following websites struc-
ture patterns can be distinguished:

Factor 1: This factor considers high sized communities character-
ized by a high number of interactions and a clear hierar-
chy among its members. The core group is located on
top of this hierarchy, and they are responsible of devel-
oping a brokerage role among the rest of community
members like active members and free riders. Factor 1
includes those communities with the highest number
of members belonging to the core group.
Fig. 8. Interpretation of identified factors.

Sergio
Cuadro de texto


Size

Hierarchy

LargeSmall

High

Low

Debian-68k

Debian-ARMDebian-ia64

Debian-alpha

Debian-amd64

Debian-BSD

Debian-hppa

Debian-hurd

Debian-mips

Debian-ppc

Debian-s390

Debian-sparc

Fig. 9. Patterns of behaviour of Debian Linux ports communities.

13190 M.R. Martínez-Torres / Expert Systems with Applications 39 (2012) 13182–13192
Factor 2: Factor 2 refers to those communities with a high local
connectivity. Communities following this pattern exhi-
bit the highest average value of egocentric density and
proximity prestige, which means their nodes are highly
connected with their neighbours. Although these com-
munities are not so high as those of Factor 1, they also
exhibit an important size with a high number of interac-
tions. However, the level of hierarchy is considerable
lower. The role of the core group is shaded by the high
activity of the rest of the community. In fact, F2 commu-
nities are the ones with the lowest number of free riders.

Factor 3: Factor 3 corresponds to small and centralized networks,
where active members exhibit a high involvement and
the core group also performs an important brokerage
role. This factor includes communities with a strong
hierarchy among their members but with a flatter struc-
ture compared to F1 communities.

Factor 4: Factor 4 considers the smallest communities but with a
high density, which means that active members are
highly interconnected. Despite of being smaller than F3
communities, they exhibit a higher average degree value.
As a difference, the activity of the core group is shad-
owed by the active members of the community, so the
hierarchy is much weaker than in F3 communities.

Obtained factors can be interpreted in terms of size and hierar-
chy of communities, as represented in Fig. 8. This figure graphically
visualizes the four identified patterns of behaviour of OSS commu-
nities. Size represents how many users the OSS project is able to at-
tract. Therefore, it can be interpreted as a measurement of how
successful the underlying software is. Hierarchy refers to the inter-
nal organization of the community.

A high hierarchy means that the core group exerts an meaning-
ful influence over the rest of the community while a low hierarchy
means there is not a dominant group. Fig. 9 details the position of
the considered Linux Debian communities in terms of size and
hierarchy. The general trend is to be located on the right part of
this figure as communities try to attract as many members as pos-
sible. As a difference, they follow different strategies regarding
their internal hierarchy level. In general, they tend to maintain a
certain hierarchy and only one community has a clear low hierar-
chy level.

One of the main implications of this results is that size and hier-
archy are independent dimensions, so it is not incompatible a high
size with a strong hierarchy. In fact, the growth in size impels the
creation of hierarchical layers. In general, projects gain stability as
more or less formal hierarchies emerge. Hierarchy mainly depends
on the activity of the core group team. It is their responsibility to
canalize the discussion, to provide solutions o alternatives to the
posted problems and to facilitate the incorporation of qualified ac-
tive members as part of this core group. Hierarchy also causes a
rise in the entry barriers to the core group. Outsiders only can ac-
quire authority inside the project through the legitimate peripheral
participation procedure. Therefore, hierarchy guarantees that the
project is under the control of the core group of developers.
6. Conclusions

This paper describes a procedure for the extraction of the pat-
terns of behaviour of open source communities using a genetic
search over a large set of Social Network Analysis indicators. As a
difference to previous studies, only focused on small set of SNA fea-
tures, the proposed evolutionary computation technique combined
with factor analysis allows to consider all the possible metrics able
to characterize interactions among community members. Obtained
results show four patterns of behaviour that have been extracted
as latent factors from the whole data set. These patterns can be
in turn classified in terms of size and hierarchy of communities.
Analyzed communities can be approached to a point in a bidimen-
sional space described by these two dimensions. In general, OSS
communities tend to gain as many members as possible, and their
internal organization facilitates a certain hierarchy to emerge.
Hierarchy can be understood as a stabilizing factor of the commu-
nity that assigns the authority to a reduced core group member.
Access to the inner circle of the community can only be achieved
through a process of participation and interaction with the rest
of the community.

Sergio
Cuadro de texto


M.R. Martínez-Torres / Expert Systems with Applications 39 (2012) 13182–13192 13191
Appendix A
Indicators
 Description
 Network
VAR01
 Number of community members
 Complete

VAR02
 Number of interactions
 Complete

VAR03
 Number of repeated interactions
 Complete

VAR04
 Density
 Complete

VAR05
 Average degree
 Complete

VAR06
 The Zagreb group index
 Complete

VAR07
 The Randic connectivity index
 Complete

VAR08
 Index of relinking
 Complete

VAR09
 Free riders
 Complete

VAR10
 Active members
 Complete

VAR11
 Members responsible of more than

50% of contributions

Complete
VAR12
 Average out-degree
 Complete

VAR13
 Standard deviation out-degree
 Complete

VAR14
 Number of components with more

than 1 user

Complete
VAR15
 Per cent of users included in
components
Complete
VAR16
 Normalized size of output-domain
(average value)
Complete
VAR17
 Normalized size of output-domain
(standard deviation)
Complete
VAR18
 Proximity prestige (average value)
 Complete

VAR19
 Proximity prestige (standard

deviation)

Complete
VAR20
 Maximum k-core (value of k)
 Complete

VAR21
 Number of users included in

maximum k-core

Complete
VAR22
 Number of k-cores with more than 1
user (k > 0)
Complete
VAR23
 Per cent if users included in k-cores
(k > 0)
Complete
VAR24
 Closeness centrality (average value)
 Complete

VAR25
 Closeness centrality (standard

deviation)

Complete
VAR26
 Network Betweenness Centralization
 Complete

VAR27
 Average value of vertices betweeness

centrality

Complete
VAR28
 Standard deviation of vertices
betweeness centrality
Complete
VAR29
 Number of vertices with betweenness
centrality > 0
Complete
VAR30
 Egocentric density (average value)
 Complete

VAR31
 Egocentric density (standard

deviation)

Complete
VAR32
 Number of vertices developing a
brokerage role among active members
Complete
VAR33
 Number of developed brokerage roles
among active members
Complete
VAR34
 Number of vertices developing a
brokerage role among active members
and free riders
Complete
VAR35
 Number of developed brokerage roles
among active members and free riders
Complete
VAR36
 Average brokerage role of core group
with active members
Complete
VAR37
 Average brokerage role of core group
with free riders
Complete
VAR38
 Average number of core group
 Complete
Appendix A (continued)
Indicators
 Description
 Network
neighbours in the first level

VAR39
 Average number of core group

neighbours in all levels

Complete
VAR40
 Average number of levels in which
users are accessed by the core group
Complete
VAR41
 Density
 Active
members
VAR42
 Average degree
 Active
members
VAR43
 The Randic connectivity index
 Active
members
VAR44
 Index of relinking
 Active
members
VAR45
 Average out-degree
 Active
members
VAR46
 Standard deviation out-degree
 Active
members
VAR47
 Per cent of users included in
components
Active
members
VAR48
 Normalized size of output-domain
(average value)
Active
members
VAR49
 Normalized size of output-domain
(standard deviation)
Active
members
VA050
 Proximity prestige (average value)
 Active
members
VAR51
 Proximity prestige (standard
deviation)
Active
members
VAR52
 Per cent if users included in k-cores
(k > 0)
Active
members
VAR53
 Closeness centrality (average value)
 Active
members
VAR54
 Closeness centrality (standard
deviation)
Active
members
VAR55
 Number of vertices with closeness
centrality > 0
Active
members
VAR56
 Network betweenness centralization
 Active
members
VAR57
 Average value of vertices betweeness
centrality
Active
members
VAR58
 Standard deviation of vertices
betweeness centrality
Active
members
VAR59
 Egocentric density (average value)
 Active
members
VAR60
 Egocentric density (standard
deviation)
Active
members
References

Alander, J. T. (1992). On optimal population size of genetic algorithm. In Proceedings
CompEuro 1992. Computer systems and software engineering, 6th annual European
computer conference (pp 65–70).

Barcellini, F., Détienne, F., & Burkhardt, J.-M. (2009). Participation in online
interaction spaces: Design-use mediation in an open source software
community. International Journal of Industrial Ergonomics, 39(3), 533–540.

Crowston, K., & Howison, J. (2005). The social structure of free and open source
software development. First Monday, 10(2).

Crowston, K., & Howison, J. (2006). Hierarchy and centralization in free and open
source software team communications. Knowledge, Technology, & Policy, 18(4),
65–85.

De Jong, K. A. (1975). An analysis of behaviour of a class of genetic adaptive systems.
Thesis. University of Michigan.

Deek, F. P., & McHugh, J. A. M. (2008). Open source: Technology and policy. NY:
Cambridge University Press.

Sergio
Cuadro de texto


13192 M.R. Martínez-Torres / Expert Systems with Applications 39 (2012) 13182–13192
Devillers, J., & Balaban, A. T. (1999). Topological indices and related descriptors in QSAR
and QSPR. The Netherlands: Gordon and Breach Science Publishers.

Durugbo, C. (2012). Modelling user participation in organisations as networks.
Expert Systems with Applications, 39(10), 9230–9245.

Feczak, S., & Hossain, L. (2011). Exploring computer supported collaborative
coordination through social networks. The Journal of High Technology
Management Research, 22(2), 121–140.

Feller, J., & Fitzgerald, B. (2002). Understanding open source software development.
London, UK: Addison-Wesley.

Freeman, L. C. (2004). The development of social network analysis: A study in the
sociology of science. Vancouver, Canada: Empirical Press.

Gao, Y., & Madey, G. (2007). Network Analysis of the SourceForge.net Community. In
J. Feller, B. Fitzgerald, W. Scacchi, A. Sillitti (Eds.), IFIP International Federation
for Information Processing, vol. 234, Open Source Development, Adoption and
Innovation, Boston, Springer, pp. 187–200.

Goldberg, D. E. (1989). Genetic algorithm in search, optimization and machine learning.
Reading, MA: Addison-Wesley.

Hinds, D. & Lee, R.M., 2008. Social network structure as a critical success condition
for virtual communities. In Proceedings of the 41st Hawaii international
conference on system sciences (pp. 323–333).

Holland, J. (1975). Adaptation in Natural and Artificial Systems. Ann Arbor, MI:
University of Michigan Press.

Hossain, L., Wu., A., & Chung, K. (2006). Actor centrality correlates to project based
coordination. In: P. Hinds, D. Martin (Eds), Proceedings of the CSCW’06
conference, Banff, Canada (pp. 363–372).

Hossain, L., & Zhu, D. (2009). Social networks and coordination performance of
distributed software development teams. The Journal of High Technology
Management Research, 20(1), 52–61.

Kuk, G. (2006). Strategic Interaction and Knowledge Sharing in the KDE Developer
Mailing List. Management Science, 52(7), 1031–1042.

Kwon, D., Oh, W., & Jeon, S. (2007). Broken Ties: The Impact of Organizational
Restructuring on the Stability of Information-Processing Networks. Journal of
Management Information Systems, 24(1), 201–231.

Lee, Y., & Lee, H. (2011). Application of factor analysis for service R&D classification:
A case study on the Korean ICT industry. Expert Systems with Applications, 3(3),
2119–2124.

Long, Y., & Siau, K. (2007). Social network structures in open source software
development teams. Journal of Database Management, 18(2), 25–40.

Martínez-Torres, M. R., & Toral-Marín, S. L. (2010). Strategic group identification
using evolutionary computation. Expert Systems with Applications, 37(7),
4948–4954.

Mateos-Garcia, J., & Steinmueller, W. E. (2008). The institutions of open source
software: Examining the Debian community. Information Economics and Policy,
20, 333–344.

Michalewicz, Z. (1996). Genetic algorithm + data structures = evolution programs (3rd
ed.). Berlin, Germany: Springer-Verlag.

Mockus, A., Fielding, T., & Herbsleb, D. (2002). Two case studies of open source
software development: Apache and Mozilla. ACM Transactions on Software
Engineering and Methodology, 11(3), 309–346.

Nooy, W., Mrvar, A., & Batagelj, V. (2005). Exploratory network analysis with Pajek.
New York: Cambridge University Press.
Okoli, C., & Oh, W. (2007). Investigating recognition-based performance in an open
content community: A social capital perspective. Information & Management,
44(3), 240–252.

Panchal, J. H. (2009), Co-evolution of products and communities in mass-
collaborative product development – a computational exploration. In
Proceedings of international conference on engineering design (ICED’09) (p. 147).

Pereira, C. S., & Soares, A. L. (2007). Improving the quality of collaboration
requirements for information management through social networks analysis.
International Journal of Information Management, 27(2), 86–103.

Preece, J. (2001). Sociability and usability in online communities: determining and
measuring success. Behaviour & Information Technology, 20(5), 347–356.

Reina, D. G., Toral, S. L., Johnson, P., & Barrero, F. (2012). An evolutionary
computation approach for designing mobile ad hoc networks. Expert Systems
with Applications, 39(8), 6838–6845.

Robles, G., Gonzalez-Barahona, J. M., & Michlmayr, M. (2005). Evolution of volunteer
participation in libre software projects: Evidence from Debian. In Proceedings of
the first international conference on open source systems, Genova (pp. 100–107).

Rencher, A. C. (2002). Methods of Multivariate Analysis. Wiley Series in Probability and
Statistics (2nd ed.). Berlin: Springer.

Singh, P. V. (2010). The small-world effect: The influence of macro-level properties
of developer collaboration networks on open-source project success. ACM
Transactions on Software Engineering and Methodology, 20(2), 1–27.

Sowe, S., Stamelos, I., & Angelis, L. (2006). Identifying knowledge brokers that yield
software engineering knowledge in OSS projects. Information and Software
Technology, 48(11), 1025–1033.

Toral, S. L., Martínez-Torres, M. R., & Barrero, F. (2009a). Virtual communities as a
resource for the development of OSS projects: the case of Linux ports to
embedded processors. Behaviour and Information Technology, 28(5), 405–419.

Toral, S. L., Martínez-Torres, M. R., Barrero, F., & Cortés, F. (2009b). An empirical
study of the driving forces behind online communities. Internet Research, 19(4),
378–392.

Toral, S. L., & Martínez Torres, M. R. (2009). International comparison of R&D
investment by European, US and Japanese Companies. International Journal of
Technology Management, 49(1/2/3), 107–122.

Toral, S. L., Martínez Torres, M. R., & Barrero, F. (2010). Analysis of virtual
communities supporting OSS projects using social network analysis. Information
and Software Technology, 52(3), 296–303.

Valverde, S., Theraulaz, G., Gautrais, J., Fourcassie, V., & Sole, R. V. (2006). Self-
organization patterns in wasp and open source communities. IEEE Intelligent
Systems, 21(2), 36–40.

West, J., & O’mahony, S. (2008). The role of participation architecture in growing
sponsored open source communities. Industry & Innovation, 15(2), 145–168.

Xu, J., Gao, Y., Christley, S. & Madey, G. (2005). A topological analysis of the open
source software development community. In Proceedings of the 38th annual
Hawaii international conference on system sciences. HICSS ‘05 (pp. 188–198).

Xu, J., Christley, S., & Madey, G. (2006). Application of social network analysis to the
study of open source software. In Jürgen Bitzer & Philipp J. H. Schröder (Eds.),
The economics of open source software development. Elsevier Press.

Ye, Y., Nakakoji, K., et al. (2005). The co-evolution of systems and communities in
free and open source software development. Free/open source software
development. S. Koch. Hershey, PA, Idea Group Inc. (IGI) (pp. 59–82).

Sergio
Cuadro de texto


	A genetic search of patterns of behaviour in OSS communities
	1 Introduction
	2 Related work
	3 Formulation of the problem and proposed framework
	4 Genetic algorithm implementation
	4.1 Chromosome encoding
	4.2 Evaluation function
	4.3 Stopping criteria
	4.4 Procedure of transition

	5 Results
	6 Conclusions
	Appendix A
	References