Symbiotic filtering for spam email detection

Clotilde Lopes a Paulo Cortez a,∗ Pedro Sousa b Miguel Rocha b
Miguel Rio c

aDep. of Information Systems/Algoritmi, University of Minho, 4800-058
Guimarães, Portugal

bDep. of Informatics/CCTC, University of Minho, 4710-059 Braga, Portugal
cDep. of Electrical Engineering, University College London, WC1E 7JE UK

Abstract

This paper presents a novel spam filtering technique called Symbiotic Filtering (SF)
that aggregates distinct local filters from several users to improve the overall perfor-
mance of spam detection. SF is an hybrid approach combining some features from
both Collaborative (CF) and Content-Based Filtering (CBF). It allows for the use
of social networks to personalize and tailor the set of filters that serve as input to
the filtering. A comparison is performed against the commonly used Naive Bayes
CBF algorithm. Several experiments were held with the well-known Enron data,
under both fixed and incremental symbiotic groups. We show that our system is
competitive in performance and is robust against both dictionary and focused con-
tamination attacks. Moreover, it can be implemented and deployed with few effort
and low communication costs, while assuring privacy.

Key words: Anti-spam filtering, Naive bayes, Collaborative filtering,
Content-based filtering, Word attacks

1 Introduction

Unsolicited bulk email, widely known as spam, has become a serious prob-
lem for network administrators and for Internet users in general. According to
MAAWG (2009), it accounts for 89% to 92% of all email messages sent and it
consumes resources (i.e. time spent reading messages, bandwidth, CPU and
disk) that are far away from negligible. Spam is also an intrusion of privacy

∗ Corresponding author. E-mail pcortez@dsi.uminho.pt; tel.: +351 253510313; fax:
+351 253510300.

Preprint submitted to Elsevier 27 November 2009


and used to spread malicious content (e.g. online fraud or viruses). The cost
of sending these emails is, however, very close to zero. Internet connections
are cheap and with the advent of botnets (e.g. Bobax worms), criminal orga-
nizations have access to potentially millions of infected computers and thus
send emails from what has to be regarded as legitimate users (Ramachandran
and Feamster, 2006; Kanich et al., 2008).

In the last years, the most popular anti-spam solutions have been based in
Content-Based Filtering (CBF) (in particular, Bayesian filtering) (Garriss
et al., 2006; Guzella and Caminhas, 2009). These class of algorithms use mes-
sage features (e.g. word frequencies) for statistically discriminating email into
legitimate (ham) and illegitimate (spam) messages. However, CBF presents
some drawbacks. Often, there is a large gap between the high-level concept
(e.g. spam image) and the low-level message features (e.g. bit colors). Also,
CBF tends to give weak performances for new users, as it requires a large
number of representative examples. Moreover, spammers can mix spam with
normal words (often not visible to the final user), in what is known as dictio-
nary or focused attacks (Nelson et al., 2008). When users flag these messages as
spam, their training set is contaminated and the CBF performance is heavily
reduced. As an alternative, Collaborative Filtering (CF) is a distinct anti-
spam strategy, where information (e.g. IP or message fingerprints) is shared
about spam messages (Zhong et al., 2008). Yet, pure CF often suffers from
several issues (e.g. first-rater, sparsity of data and privacy, see Section 2).

To solve the CBF drawbacks, we propose a novel distributed approach, termed
Symbiotic Filtering (SF), that combines features from CBF and CF. Symbiosis
is a close interaction among different entities and this phenomenon is present
not only in biological species but also in business enterprises (Alocilja, 2000).
We prefer the term symbiotic rather than collaborative or cooperative, since
in this case each individual may have a distinct goal, as spam by definition is a
personal concept. The idea is to promote a cooperation among distinct entities
(e.g. email users) interested on personalized filtering (e.g. spam detection).
Rather than exchanging messages, these entities will share information about
what each local filter has learned (e.g. Bayesian model). The aim of SF is to
foster mutual relationships, where all or most members benefit. Under SF, a
given user is interested in improving filtering at a personal level. The Internet
is used to gather collaborators among these (high number of) users. High group
dynamics are expected, as members may join or leave the collaboration, and
also there are privacy issues regarding what can be shared. SF is different
from the centralized CBF-CF works (e.g. (Yu et al., 2003)) since SF data
and models are distributed through different entities. Hence, there are issues
of user management (e.g. adding or removing a user), privacy, security and
motivation (e.g. each user should benefit from the collaboration).

The main contributions of this paper are: i) we propose the new SF concept

2


that combines filters from distinct entities in order to improve local filtering
while assuring privacy; ii) we apply SF to spam detection and compare it
with a local CBF filter (i.e. Naive Bayes); iii) the spam detection performance
is measured under distinct scenarios, to test the effect of using fixed and
incremental symbiotic groups and also to access the robustness to dictionary
and focused attacks. This paper is structured as follows. Section 2 presents
the related work. Next, we introduce the individual and symbiotic filtering
methods (Section 3). The results are presented in Section 4. Finally, closing
conclusions are drawn (Section 5).

2 Related Work

Several solutions have been proposed to fight spam, which can fall into three
main categories (Garriss et al., 2006; Méndez et al., 2008): designing New Mail
Protocol Systems (NMPS), Collaborative Filtering (CF) and Content-Based
Filtering (CBF). Examples of the first approach are: digitally signing mail,
where recipients authenticate the sender’s address and mail content (Wong,
2005); requiring the sender to “pay” (e.g. give a small fee or solve a com-
putational puzzle) for each message sent (Loder et al., 2004) or limiting the
amount of email any sender may send (Walfish et al., 2006). Yet, none of these
solutions is currently adopted in a massive fashion and we are still far from a
worldwide acceptance of a NMPS. Also, the proposal of payment systems did
not take into account the effect spamming botnets, where a large number ma-
chines are controlled for malicious messaging (Ramachandran and Feamster,
2006; Kanich et al., 2008).

CF is based on sharing information about spam messages and it can be
based on lists (e.g. blacklists with IP addresses of known spammers), or di-
gest/fingerprints extracted from spam messages. Often, DNS-based Blackhole
Lists (DNSBLs) work in a centralized fashion, being vulnerable to Denial-of-
Service (DoS) attacks. Also, they may blacklist legitimate users, with a false
negative rate of about 50%, and spammers that use BGP spectrum agility
techniques are rarely listed in DNSBLs (Ramachandran and Feamster, 2006).
Several CF systems based on social networks have also been proposed. For
example, Kong et al. (2005) suggest a system where users manually identify
spam and then publish a digest through her/his social network. Garriss et al.
(2006) propose the propagation of whitelists among socially connected users.
Zhong et al. (2008) introduce a large-scale privacy CF based on digests.

CBF filters use a text classifier, such as the popular Naive Bayes algorithm
(used by the Thunderbird client), that learns to discriminate spam from mes-
sage features (e.g. common spam words). At the present time, CBF is the
most used anti-spam solution (Garriss et al., 2006). Current research relies

3


mainly on improving individual classifier performance, by a better preprocess-
ing (Méndez et al., 2008) or enhancement of the learning algorithm (Chang
et al., 2008; Guzella and Caminhas, 2009). Ensembles that combine distinct
spam classifiers have also been proposed (Hershkop and Stolfo, 2005).

Both CF and CBF have drawbacks. CF often suffers from first-rater, sparsity
of data and privacy problems. The first issue is due to the difficulty of classify-
ing emails that have not been rated before, the second problem is present when
users rate few messages and the last problem depends on what is shared. For
example, while presenting a better privacy digest protocol (when compared
with previous CF solutions), the approach of Zhong et al. (2008) is still vul-
nerable to privacy breaches. In addition, people have personal views of what is
spam and CF often discards this issue (Gray and Haahr, 2004). On the other
hand, CBF frequently suffers from lack of sufficient training messages and
is highly vulnerable to contamination attacks (as explained in the previous
section).

By fusing the CF and CBF views there is a potential for a better personalized
filtering. However, the number of studies that unify CBF and CF is scarce
and mainly focused towards recommendation systems that run at centralized
systems (Yu et al., 2003). Garg et al. (2006) describe a spam strategy based
on sharing of email filters among collaborating users. The authors point out
that exchanging filters requires less communication than CF systems, as the
need to share filters is relatively rare when compared with exchanging digests
each time an email is received. Yet, they fail to acknowledge motivational (i.e.
each user should benefit from the collaboration), temporal (i.e. how to syn-
chronize several distinct filters), privacy and security issues regarding filter
sharing. More recently, Lai et al. (2009) presented a collaborative approach to
exchange spam rules. However, this approach was designed for rule sharing at
the server level, demanding secure channels between all these servers. Also, the
authors only explored simple attributes (e.g. message length) and not word
content. Moreover, explicit (and simple) human understandable rules are re-
quired, thus approaches such as Neural Networks or Support Vector Machines
are not suitable. In contrast, our SF approach is more flexible, since it can
address any type of CBF filter. Also, it is suited for the Web 2.0 paradigm,
where users can exchange filters from their social networks. The SF approach
should also be more robust to contamination when compared to pure CBF,
since it aggregates responses from several (possible unknown) users and tar-
geting a specific victim may be easy but contaminating the whole symbiotic
group is not.

4


3 Filtering Methods

3.1 Naive Bayes filtering

We will address only textual content (i.e. word frequencies) of email mes-
sages. This popular approach (e.g. Thunderbird filter) has the advantage of
being generalizable to wider contexts, such as spam instant messaging (spim)
detection. While different data mining algorithms can be adopted for spam
filtering, such as Support Vector Machines (Cheng and Li, 2006), we will use
the simpler Naive Bayes (NB), which is widely adopted by anti-spam filtering
tools (Metsis et al., 2006; Guzella and Caminhas, 2009). As both individual
and symbiotic strategies will be compared using the same learning algorithm,
we believe that most of the results presented in this paper can be extended
to other text classifiers. We will also adopt the preprocessing proposed in
(Kosmopoulos et al., 2008):

(1) The word frequencies are extracted from the subject and body message
(with the HTML tags previously removed). Each message j is encoded
into a vector xj = (x1j, . . . ,xmj), where xij is the number of occurrences
of token Xi in the text.

(2) The feature selection is applied, which consists in ignoring any words
when xij < 5 in the training set and then selecting up to the 3000 most
relevant features according to the Mutual Information (MI(Xi)) crite-
rion:

MI(Xi) =
∑

c∈{s,¬s}
p(Xi|c) log(

p(Xi|c)
p(Xi)p(c)

) (1)

where c is the message class (s - spam or ¬s - ham), p(Xi|c) is the
probability of finding token Xi in emails from class c, p(Xi) and p(c) are
the proportions of Xi terms and c class examples present in the data.

(3) Each xij value is transformed into: x
′
ij = log(xij + 1) (TF transform),

x′′ij = x
′
ij ·log(k/

∑
k δik) (IDF transform) and x

′′′
ij = x

′′
ij/

√∑
l(xlj)

2 (length

normalization), where δik is 1 if the token i exists in the message k and
0 otherwise.

The NB computes the probability that a document j is spam (s) for a filter
trained over Du email data from user u, according to:

p(s|xj,Du) = α ·p(s|Du)
m∏
i

p(Xi|s,Du) (2)

5


where α is normalization constant that ensures that p(s|x,Du)+p(¬s|x,Du) =
1, p(s|Du) is the p(s) of dataset Du. The p(Xi|s,Du) estimation depends on
the NB version. In this work, we will use the multi-variate Gauss NB (as
implemented in the R tool, see Section 4) (Metsis et al., 2006):

p(Xi|c,Du) =
1

σi,c
√

2π
exp(−

(x′′′ij −µi,c)2

2σ2i,c
) (3)

where µi,s and σi,s are the mean and standard deviation estimated from the
c = s or c = ¬s messages of Du.

In (Nelson et al., 2008), it has been shown that local spam filters are vulnerable
to dictionary and focused contamination attacks. The former attack is used to
reduce the CBF efficiency, leading the victim to read spam, while the latter can
be used to prevent the victim from reading an important email. Both attacks
can be achieved by sending spam messages mixed with normal words. Once
the victim labels these messages as spam, the training set is contaminated and
the filter will be affected the next time it is retrained. A dictionary aggression
consists in sending a large amount of normal words, while the focused assault
assumes that the attacker has some knowledge of a specific message that the
victim will receive in the future (e.g. a competing offer for a given contract).

3.2 Symbiotic filtering

In our proposed SF, the individual predictions can be combined by using
a collaborative ensemble of the local filters. To tackle the concept drift (i.e.
the learning tasks changes through time) nature of spam (Fdez-Riverola et al.,
2007), the ideal symbiotic combination function should be dynamic. To achieve
this we propose a hierarchical learning, where the outputs of the distinct fil-
ters are used as the inputs of another (meta-level) learner. Hence, each user
has a local meta-learner that is responsible for aggregating the distinct filter
responses. This meta-learner is dynamically trained to get a high accuracy on
the user past data, thus it assigns different weights to the CBF filters through
time (see Figure 6). While several algorithms could be used for this hierarchi-
cal learning (e.g. SVM), we will adopt the same NB described in the previous
section. The rationale is that NB is commonly adopted by anti-spam solutions,
thus incorporating SF into these tools would be simpler by reuse of code.

We assume that each user u trains a local filter θu,t over her/his Du training
data. Filters can be trained asynchronously and L filters will be available for
each user at time t: {θ1,t, . . . ,θL,t}. The Symbiotic NB (SNB) meta-model

6


spam probability is given by:

p(s|xj,D′u) = α ·p(s|D′u)
∏L
i=1 p(θi,t|s,D′u)

p(θi,t|c,D′u) =
1

σi,c
√

2π
exp(−(p(s|xj,θi,t,D

′
u)−µi,c)2

2σ2
i,c

)
(4)

where D′u is the SNB training set and p(s|xj,θi,t,D′u) is the probability given
by the filter θi,t, as computed in Equation 2. To reduce memory and compu-
tational requirements, we allow that D′u ⊆ Du, where M = |D′u| denotes the
most recent messages from u mailbox. It should be noted that any token from
xj that is not considered by θi,t will simply be discarded by the filter from
user i. Similarly, any input attribute from θi,t that is not included in xj will
be set to 0.

While sharing models is less sensitive than exchanging email messages, there
are still privacy issues to be considered. For instance, if user A has access to
the filter of user B, then A may feed a given token (or set of tokens) into the
model and thus know some probability that such token was classified by B as
spam or ham. Our privacy solution resides in an anonymous exchange of the
filters, which can occur under a centralized server or a Peer-to-Peer (P2P)-like
application.

Under the first option, all users register into a centralized and secure service.
This service could be implemented by large companies or email providers (e.g.
Gmail or Hotmail), when all emails are stored at a given server. For scalabil-
ity, user profiles could be defined (e.g. country or profession) and clustering
algorithms could be used to group users with similar interests. Another vari-
ant would be the definition of social networks, where users could choose their
“friends”. These systems could, for example, be implemented by social net-
working websites (e.g. Facebook or MySpace). Alternatively, when the mes-
sages are stored locally at the client side, the server would be responsible for
a blind exchange of the filters, using secure transfers (left of Figure 1). To
exchange the filters, a standard format should be adopted, such as the Pre-
dictive Model Markup Language (PMML) (Grossman et al., 2002), which is
compatible with a large number of data mining tools. In should be noted that
exchanging filters requires less communication costs. For example, a filter built
from a millions of emails can be described by a few hundreds or thousands
of bytes (depending on the filter algorithm used) (Garg et al., 2006). When
a new email is received, the user can easily compute the SF, as a copy of all
filters is available locally.

The above systems have the disadvantage of having to depend and trust in
a centralized service. As an alternative, the use of a P2P-like distribution
scheme is also possible to be adopted (right of Figure 1). Under this solution
all peers may donate, store and fetch filters among each other. This approach

7


anonymousanonymous
filter filter

user A

user C

user B

anonymous
filter filter

anonymous

filter A filter B

user C

user Buser A server
secure

Fig. 1. Anonymous exchange of filters by using a secure server (left) or a P2P-like
application (right)

could be implemented as a trusted and secure plug-in of an email client (e.g.
Thunderbird). The filter sharing process among the peers could work similarly
to the explained in the secure server scheme (i.e. using PMML). Further pri-
vacy increase could be achieved if each individual does not know the symbiotic
group composition. However, for some scenarios it may also be attractive that
the symbiotic group composition could be assessed by all the participants in
a given group. Yet, even in such scenarios it might be very difficult to “guess”
who created each particular model, as in SF there will be typically a large
number of users that dynamically may join or leave the collaboration.

4 Experiments and Results

4.1 Spam data

To evaluate SF, ideally there should be real mailboxes collected from distinct
users (possibly from a social network) during a given time period. Yet, due
to logistic and privacy issues, it is quite difficult to obtain such data (in par-
ticular personal messages) and make it public. Hence, we will use a synthetic
mixture of real spam and ham messages, in a strategy similar to what has
been proposed in (Metsis et al., 2006; Zhong et al., 2008). The ham messages
will come from the Enron email collection, which was originally used for a
global evaluation of filters by merging all Enron user messages into a single
corpus. In particular, we will use the cleaned-up form provided by (Becker-
mann et al., 2004) and we will select the five Enron employees with the largest
mailboxes collected during the same time period: kaminski-v (kam), farmer-d
(far), beck-s (bec), lokay-m (lok) and kitchen-l (kit). Since these employees
worked at the same organization, it is reasonable to assume that they would
know each other, i.e. belong to a social network. We will also use the spam
collection of Bruce Guenter (http://untroubled.org/spam/), which is based in
spam traps (i.e. fake emails published in the Web), during the years of 2006
and 2007 (our dataset was built in 2008). Only messages with Latin character
sets were selected, because the ham messages use this type of character coding
and non-Latin mails would be easy to detect. Also, since this collection con-

8


tains several copies of the same messages (due to the use of multiple traps),
we removed duplicates by comparing MD5 signatures of the body messages.

We propose a mixture algorithm that is based on the time that each message
was received (date field, using the GMT time zone). By preserving the tem-
poral order of the emails, we believe that a more realistic mixture is achieved
than the sampling procedures adopted in (Metsis et al., 2006) or (Zhong et al.,
2008). Since the Enron data is from a previous period (see Table 1), we first
added 6 years to the date field of all ham messages. Let St denote a spam mes-
sage received at time t, Si,f = (Sti,Sti+1, . . . ,Stf ) the time ordered sequence
of the Bruce Guenter spam, Hu,i,f and Su,i,f the sequences of ham and spam
messages for user u from time ti to tf . For a given time period t ∈ (ti, . . . , tf ),
the algorithm randomly selects |S′i,j| spam messages from Si,j. Then, Su,i,j is
set by sampling messages from S′i,j with a probability of P for each message
selection. The size of S′i,j (cardinality) is given by:

|S′i,j| =
R·

∑L
i=1
|Hu,i,j|

P ·L
(5)

where L denotes the total number of users available at the time period and
R is the overall (i.e. including all user and time data) spam/ham ratio. Since
the time periods are different for each user (Table 1), four time sequences (i.e.
ti and tj values) were used by the algorithm (Figure 2).

Table 1
The S-Enron corpus main characteristics

user ham spam time spam

size size period /ham

kam 4363 2827 [12/05,05/07] 0.6

far 3294 2844 [12/05,05/07] 0.9

bec 1965 2763 [01/06,05/07] 1.4

lok 1455 2202 [06/06,05/07] 1.5

kit 789 623 [02/07,05/07] 0.8

The mixture is affected by the R and P parameters. Since a high number of ex-
periments is addressed in this work, we will fix these parameters to reasonable
values. While the global spam/ham ratio is R = 1, the individual ratios range
from 0.6 to 1.5. Also, the spam/ham ratios fluctuate through time (as shown
in Figure 3). On the other hand, the probability of spam selection affects the
percentage of common spam between users. If two users have similar profiles
(e.g. email exposure), then they should receive similar spam. We assume that
this scenario is expected for the Enron employees and thus set P = 0.5. Under
this setup and for a given time period, any 2 users will receive around 50% of

9


Date (Year)
2006 2007

kam

far

bec

lok

kit

L=2 L=3 L=4 L=5

A
B

C

Fig. 2. Time view of the S-Enron mailboxes

similar spam, 3 users will share around 25% of spam and so on. The result-
ing corpus is named S-Enron and it is publicly available in its raw form at:
http://www3.dsi.uminho.pt/pcortez/S-Enron.

0 10 20 30 40 50 60

0
.5

1
.0

1
.5

2
.0

batch (x100 emails)

●

●

●

●

●
●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●
●

●

●

●

●

● ●

●

●

●

● ●

●

●
●

●

●

●

●

●

●

●

●

●

●

●

●
●

●
●

●

●
●

●

●

●

●

Fig. 3. Evolution of the spam/ham ratio for the user far and scenario C

4.2 Evaluation

As explained in Section 3.2, spam detection suffers from concept drift (Fdez-
Riverola et al., 2007). Features such as the amount of spam received, the
ham/spam ratio and even the content itself evolve through time. Hence, to
evaluate spam filters, we will adopt the more realistic incremental retraining
evaluation procedure, which periodically trains and tests filters. Under this
procedure, a mailbox is split into batches b1, . . . ,bn of K adjacent messages
(|bn| may be less than K) (Metsis et al., 2006). For i ∈ {1, . . . ,n − 1}, the
filter is trained with Du = b1 ∪ . . .∪bi and tested with the messages from bi+1
(Figure 4). This procedure is more realistic than the simple 50% train/test
split adopted in (Zhong et al., 2008).

The predicted class for a probabilistic filter is given by s if p(s|xj,Du) > D,
where D ∈ [0.0, 1.0] is a decision threshold. For a given D and test set, it is

10


K
|b  |

1

2

3
test set

test set

test set

training set

tr. set

training set

n−1
training set ts. set

......

runs

b b b
1 2 3

batch
b

n
...

mailbox

KKK n

Fig. 4. The incremental retraining procedure

possible to compute the true (TPR) and false (FPR) positive rates:

TPR = TP/(TP + FN)

FPR = FP/(TN + FP)
(6)

where TP , FP , TN and FN denote the number of true positives, false posi-
tives, true negatives and false negatives. The receiver operating characteristic
(ROC) curve shows the performance of a two class classifier across the range
of possible threshold (D) values, plotting FPR (x-axis) versus TPR (y-axis)
(Fawcett, 2006). The global accuracy is given by the area under the curve
(AUC=

∫ 1
0 ROCdD). A random classifier will have an AUC of 0.5, while the

ideal value would be 1.0. Since the cost of losing normal email (FP) is much
higher than receiving spam (FN), D is usually set to favor points in the low
false-positive region of the ROC. Thus, we will also adopt the Normalized
AUC (NAUC), which is the AUC area in the section FPR ≤ r, divided by r
(Chang et al., 2008). Typically, the target FPR rate (r) is close to 0.0. We
will also compute the relative Gain of the symbiotic performance over the local
filter: Gain= ξSNB/ξNB − 1, where ξ is the evaluation metric (i.e. AUC or
NAUC) and SNB and NB are the symbiotic and individual filters. With the
incremental retraining procedure, one ROC will be computed for each bi+1
batch and the overall results will be presented by adopting the vertical aver-
aging ROC (i.e. according to the FPR axis) algorithm presented in (Fawcett,
2006). Statistical confidence will be given by a paired t-student test, at the
95% confidence level (Flexer, 1996).

4.3 Experimental setup

All experiments were conducted in the R environment, an open source and
high-level programming language for data analysis (R Development Core Team,
2008). In particular, the NB algorithm described in Section 3.1 is implemented
by the naiveBayes function of the e1071 R package, while the text prepro-
cessing uses several functions from the tm package (Feinerer et al., 2008).

11


During the all experiments, we set K = 100 (a reasonable value also adopted
in (Metsis et al., 2006)). For the SNB, we used a similar number for the
hierarchical training set size, i.e. M = 100. This small value has the advantage
of reducing memory requirements (the user only needs 100 messages in his
mailbox) and some initial experiments with larger values of M revealed no
gain in performance. The AUC, NAUC and Gain values will be shown in
percentage. All NAUC values will be computed with r = 0.01 (1%).

4.4 Fixed symbiotic group

Two distinct scenarios will be tested, according to the time periods A and B
of Figure 2. Given the S-ENRON corpus characteristics, in this work we will
explore a small number of fixed symbiotic users: L = 5 for A and L = 3 for B.

The incremental retraining method (Section 4.2) was applied to both sce-
narios, by considering all messages within the corresponding time period.
Thus, the number of kam, far and bec batches (n) will be different for A
and B. The obtained results are summarized as the mean of all test sets (bi+1,
i ∈ {1, . . . ,n − 1}) and shown in Table 2 and Figure 5. The best values are
in bold, while underline denotes a statistical significance (i.e. p-value<0.05).
In Figure 5, bars denote 95% t-student confidence intervals and only the most
interesting region of FPR is shown for the ROC curves.

For the first scenario (A), the symbiotic strategy outperforms the local filter
for all users and metrics, except for lok and NAUC. A similar behavior occurs
for the B setting, where SNB is better than NB except for kam and AUC.
As false positives have higher costs in spam detection, the NAUC results are
particularly important. Thus, it is interesting to notice that there is a high
AUC improvement (i.e. Gain) given by the symbiotic method in several cases
(kam, far and kit for A and bec for B).

To demonstrate the SNB dynamics, Figure 6 shows the first two consecutive
graphs of the SNB input importances under scenario II. Each edge represents
the influence (in %) of the NB filter (the origin) in the symbiotic model (the
destination), as measured by applying a sensitivity analysis procedure (Kewley
et al., 2000). The text in bold (e.g. b2) denotes the last batch used to train the
NB classifier. For example, the first SNB model of user far (left graph) uses a
NB filter from kam that was trained using 200 messages (Dkam = b1 ∪ b2).

12


Table 2
The results for scenarios A and B
scenario A AUC NAUC

user n NB SNB Gain NB SNB Gain

kam 16 62.1 95.6 54 0.8 74.4 9804

far 13 93.5 95.1 2 15.6 58.6 276

bec 9 91.5 94.0 3 53.5 65.8 23

lok 9 91.4 95.2 4 79.9 75.6 -3

kit 15 74.6 95.3 28 18.2 71.7 294

scenario B AUC NAUC

user n NB SNB Gain NB SNB Gain

kam 70 94.7 94.3 -0.4 54.0 73.0 35

far 60 89.4 91.7 2.6 54.3 66.9 23

bec 48 83.5 93.4 11.9 23.8 74.3 212

4.5 Incremental symbiotic group

A more realistic scheme is adopted for the time period C, where users join the
symbiotic group in an incremental fashion, at different time stages according
to Figure 2. Thus, L will grow from 2 to 5. The results are presented in Table
3. As expected, the symbiotic strategy (SNB) clearly favors newcomers, which
have small mailboxes and thus benefit from the collaboration. In effect, the
NAUC differences are quite large, such as in bec and lok for L = 4 and L = 5;
and kit for L = 5. For demonstration purposes, the ROC curves are plotted
for kam, far and bec, when L = 3 and L = 5 (Figure 7). However, the results
show that even “veteran” users benefit from the symbiotic relation when the
number of users grow. For instance, the kam and far NAUC results for L = 5
improve, with a gains of 28% and 16%, respectively.

4.6 Contamination attacks

We will repeat the experiments of Section 4.4, by considering only scenario
A and user bec to test the effects of mailbox contamination. The dictionary
assault is simulated by replacing the first 10 spam emails at batch 4 from
bec by the GNU aspell (http://aspell.net/) English dictionary (version 6.0,
with 138599 tokens). In Figure 8, gray lines denote the behavior of NB and
SNB without the attack (i.e. results of Section 4.4), black lines show the per-
formance under the attack and the dot-dashed vertical line shows when the

13


0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

kam (A)

●
●

● ●
● ●

● ●
● ●

●

● SNB
NB

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

far (A)

●

●

●

●

● ●
●

●
● ●

●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

bec (A)

●
●

●
●

●

●
●

●

●
●

●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

lok (A)

●
●

●
●

● ●
●

●
● ●

●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

kit (A)

●
●

●

●
●

●
●

● ●
● ●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

kam (B)

●

●

●
●

●
● ●

● ● ●
●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

far (B)

●

●

●
●

● ●
● ●

● ● ●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

bec (B)

● ●
●

●
● ●

● ● ●
● ●

Fig. 5. ROC curves for scenarios A and B

bec

kam

far

bec

kam

far

b
1

b
1

b
1

b
2

b
3

b
3

b
2

b
2

b
2

b
2

 31% 33%  33%

 31%

41%

33%

 26%

100%

 32%

 34%

 34%

 33%

33%
 33%

 33%

b

 38%

1

b
2

b
2

b
1

b
2

b
1

Fig. 6. Examples of SNB input importances for B

attack starts. The filters of bec are not trained with contaminated messages at
batch 4, yet these messages appear in the test set and thus the NB and SNB
performances suffer a moderate decay. The true effect of the attack is only
visible at batch 5, where local CBF is highly affected. Only 10 messages were
replaced and yet the filter detection capability is reduced to a random classi-
fier (since AUC=0.5) through all remaining batches. In contrast, the symbiotic

14


0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

kam (L=3)

●
●

●
●

●
●

●
● ● ●

●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

kam (L=5)

●

●

●

● ●
●

●
● ● ●

●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

far (L=3)

●
●

●
●

●
●

● ● ●
● ●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

far (L=5)

●

●

●
● ●

● ●
● ●

● ●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

bec (L=3)

●
●

●
●

● ●
● ● ●

● ●

0.00 0.02 0.04 0.06 0.08 0.10

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

bec (L=5)

●

●

●

● ● ●

● ● ● ●
●

Fig. 7. ROC curves for users kam, far and bec (L=3 at left and L=5 at right)

method is only initially affected, since as time goes by the performance gets
closer to the no attack scenario. Also, the remaining symbiotic users maintain
their spam detection capabilities, as shown by the far results, which is a rep-
resentative example. This behavior is explained by the SNB algorithm, which
simply discards a given filter if it does not help to predict the recent past M
messages of the user. Hence, this experiment shows that SF is robust also to
saboteurs, i.e. if a particular user intentionally feeds the group with a random
or bad filter then this filter will be simply ignored.

0
.5

0
.6

0
.7

0
.8

0
.9

1
.0

bec

test set batch (x100 e−mails)

A
U

C

2 3 4 5 6 7 8 9

●
●

●

●
●

●

●

●

●
●

●

●

●

●

●

●

●

●

SNB
NB
SNB (attack)
NB (attack)

0
.5

0
.6

0
.7

0
.8

0
.9

1
.0

far

test set batch (x100 e−mails)

A
U

C

4 5 6 7 8 9 10 11

●

●

●

●

● ●
●

●

●

● ●

●

● ●
●

●

●

●

SNB
SNB (attack)

Fig. 8. The effect of the dictionary attack

The dictionary attack can be solved by performing a rollback (i.e. returning

15


Table 3
The results for scenario C
L=2 AUC NAUC

user n NB SNB Gain NB SNB Gain

kam 4 94.4 88.1 -3 30.1 48.4 61

far 3 89.9 82.1 -9 60.7 54.3 -11

L=3 AUC NAUC

user n NB SNB Gain NB SNB Gain

kam 17 91.5 87.4 -5 47.5 59.5 25

far 16 86.6 87.0 0.4 44.6 59.1 33

bec 14 80.4 87.6 9 58.2 65.8 13

L=4 AUC NAUC

user n NB SNB Gain NB SNB Gain

kam 39 95.4 96.8 1.5 75.0 75.6 0.8

far 32 88.7 94.7 6.8 53.1 72.5 37

bec 28 84.5 96.9 15 13.0 78.6 504

lok 29 73.1 96.2 32 6.1 71.5 1076

L=5 AUC NAUC

user n NB SNB Gain NB SNB Gain

kam 15 98.4 98.0 -0.5 61.3 78.2 28

far 13 89.6 95.5 6.5 61.8 71.9 16

bec 8 85.2 97.4 14 1.8 70.3 3848

lok 9 63.6 97.3 53 0.7 78.3 10992

kit 15 74.6 97.9 31 18.2 71.8 295

to the previous filter) or using the RONI defense (Nelson et al., 2008), which
rejects training examples that have a large negative impact in spam detection.
Yet, focused assaults are more difficult to prevent and finding a defense is
still an open problem (Nelson et al., 2008). We believe SF is an interesting
solution due to the same rationale presented for the dictionary aggression, i.e.
the combination of multiple filters should overcome the limitations of a single
model contamination.

A new set of experiments was devised, using again scenario A and bec mailbox.
During a given run, a legitimate message was randomly selected, from batches
6 to 9, as the target text. We assume that the attacker is confident about the

16


target content and thus can guess 50% of the target words. At batch 4, 10
spam emails were replaced by the contaminated messages. We repeated this
procedure during 20 runs. The effect of this attack on spam is minimal and thus
we will only show the effect on the target ham emails. Figure 9 plots the filter
spam probability (y-axis) for each target message. The obtained probability
for each run is plotted along the x-axis (total of 20 runs). Since all target
messages are ham, a robust filter should present low spam probabilities, near
the zero horizontal axis. The results show that local filter (NB) is much more
vulnerable to focused attacks than the symbiotic strategy (SNB). The spam
probability mean values of NB and SNB are 0.69 and 0.32 (the differences
are statistically significant). For example, when using a decision threshold of
D = 0.5, 14 (of 20) messages are classified by NB as spam, while this number
lowers to 6 for SNB. Even if D is raised to 0.999, NB predicts 13 spam emails
and SNB only detects 5.

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

runs

sp
a

m
 p

ro
b

a
b

ili
ty

 f
o

r 
ta

rg
e

t 
te

xt
 (

h
a

m
)

1 5 10 15 20

● ●

●

● ● ●

● ●

● ● ●

●

●

●

●

● ●

●

● ●

● SNB
NB

Fig. 9. The effect of the focused attack

5 Conclusions

Email has long been one of the most important and widely used Internet ap-
plications. However, spam emerged quickly after email itself and nowadays it
accounts for the majority of the email traffic. Thus, several anti-spam tech-
niques were developed. This paper proposes a novel Symbiotic Filtering (SF)
approach, which combines several features from Collaborative Filtering (CF)
and Content-Based filtering (CBF). It takes the advantage of use of social
networks, where users with the same or related interests have the opportunity
to form mutually beneficial alliances with the aim of enhancing spam detec-

17


tion techniques. Instead of sharing messages or digests, the idea is to share
filters. To combine the individual probabilities, the proposed solution uses the
concept of hierarchical learning, where a meta-learner is dynamically trained
to improve its accuracy.

After describing the proposed model, we compared the effectiveness of the SF
versus CBF, using for that purpose a realistic mixture of real spam and ham
messages. Promising results were obtained by the SF, which outperformed the
local filtering for a small number of users (from 3 to 5). Moreover, we have
shown that SF is more robust to word attacks (e.g. dictionary or focused as-
saults). Furthermore, we proposed several deployment scenarios for SF, under
a secure server or P2P settings.

There is a continuous race between spammers and anti-spammers and a local
classifier that is currently perfect will be eventually defeated. We believe that
a stronger protection is achieved by adopting a dynamic cooperation of filters
from distinct users. In future work, we intent to apply SF to other personalized
filtering scenarios, such as Web page blocking (e.g. sensitive content). Also,
we will explore scalability issues. Under a large group, this could be achieved
by adopting user selection algorithms (e.g. clustering user profiles).

Acknowledgments

This work is supported by FCT grant PTDC/EIA/64541/2006.

References

Alocilja, E. (2000). Principles of Biosystems Engineering. Erudition Books,
Massachusetts, USA.

Beckermann, R., McCallum, A., and Huang, G. (2004). Automatic catego-
rization of email into folders: benchmark experiments on Enron and SRI
corpora. Ir-418, University of Massachusetts Amherst.

Chang, M., Yih, W., and Meek, C. (2008). Partitioned Logistic Regression
for Spam Filtering. In 14th ACM SIGKDD int. conference on Knowledge
discovery and data mining, pages 97–105.

Cheng, V. and Li, C. (2006). Personalized Spam Filtering with Semi-
supervised Classifier Ensemble. In IEEE/WIC/ACM International Con-
ference on Web Intelligence.

Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition
letters, 27(8):861–874.

Fdez-Riverola, F., Iglesias, E. L., Dı́az, F., Méndez, J. R., and Corchado, J. M.

18


(2007). Applying lazy learning algorithms to tackle concept drift in spam
filtering. Expert Systems with Applications, 33(1):36–48.

Feinerer, I., Hornik, K., and Meyer, D. (2008). Text Mining Infrastructure in
R. Journal of Statistical Software, 25(1-54).

Flexer, A. (1996). Statistical Evaluation of Neural Networks Experiments:
Minimum Requirements and Current Practice. In Proceedings of the 13th
European Meeting on Cybernetics and Systems Research, volume 2, pages
1005–1008, Vienna, Austria.

Garg, A., Battiti, R., and Cascella, R. (2006). May I borrow your filter?
Exchanging filters to combat spam in a community. In Advanced Infor-
mation Networking and Applications, 2006. AINA 2006. 20th International
Conference on, volume 2.

Garriss, S., Kaminsky, M., Freedman, M., Karp, B., Mazières, D., and Yu,
H. (2006). RE: reliable email. In Proceedings of the 3rd conference on
Networked Systems Design and Implementation (NSDI), pages 297–310, San
Jose, CA. USENIX Association Berkeley, CA, USA.

Gray, A. and Haahr, M. (2004). Personalised, Collaborative Spam Filtering.
In 1st Conference on E-Mail and Anti-Spam CEAS.

Grossman, R., Hornick, M., and Meyer, G. (2002). Data Mining Standards
Initiatives. Communications of ACM, 45(8):59–61.

Guzella, T. and Caminhas, W. (2009). A review of machine learning ap-
proaches to Spam filtering. Expert Systems with Applications, 36:10206–
10222.

Hershkop, S. and Stolfo, S. (2005). Combining Email Models for False Positive
Reduction. In 11th ACM SIGKDD int. conference on Knowledge discovery
and data mining, pages 21–24.

Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V.,
and Savage, S. (2008). Spamalytics: An Empirical Analysis of Spam Mar-
keting Conversion. In Computer and Communications Security Conference
(CCS’08), pages 27–31. ACM.

Kewley, R., Embrechts, M., and Breneman, C. (2000). Data Strip Mining for
the Virtual Design of Pharmaceuticals with Neural Networks. IEEE Trans
Neural Networks, 11(3):668–679.

Kong, J., Boykin, P., Rezaei, B., Sarshar, N., Roychowdhury, V., Rothenstein,
B., Damian, I., Paramonov, P., Lyuksyutov, S., Barrat, A., et al. (2005).
Let your cyberalter ego share information and manage spam. eprint arXiv:
physics/0504026.

Kosmopoulos, A., Paliouras, G., and Androutsopoulos, I. (2008). Adaptive
Spam Filtering Using Only Naive Bayes Text Classifiers. In CEAS 2008 -
Fifth Conference on Email and Anti-Spam.

Lai, G., Chen, C., Laih, C., and Chen, T. (2009). A collaborative anti-spam
system. Expert Systems with Applications, 36:6645–6653.

Loder, T., Van Alstyne, M., and Wash, R. (2004). An economic answer to
unsolicited communication. In proceedings of the 5th ACM conference on
Electronic Commerce, pages 40–50. ACM New York, NY, USA.

19


MAAWG (2009). Email Metrics Program: The Network Operators’ Perspec-
tive. Report #10 – third and fourth quarter 2008, Messaging Anti-Abuse
Working Group, S. Francisco CA, USA.

Méndez, J., Cid, I., Glez-Peña, D., Rocha, M., and Fdez-Riverola, F. (2008).
A Comparative Impact Study of Attribute Selection Techniques on Näıve
Bayes Spam Filters. In Springer, editor, 8th Industrial Conference on Data
Mining, volume LNAI 5077, pages 213–227.

Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006). Spam Filtering
with Naive Bayes – Which Naive Bayes? In Third Conference on Email and
Anti-Spam (CEAS), pages 125–134.

Nelson, B., Barreno, M., Chi, F., Joseph, A., Rubinstein, B., Saini, U., Sutton,
C., Tygar, J., and Xia, K. (2008). Exploiting Machine Learning to Subvert
Your Spam Filter. In 1st Usenix Workshop on Large-Scale Exploits and
Emergent Threats, pages 1–9. ACM Press.

R Development Core Team (2008). R: A language and environment for statis-
tical computing. R Foundation for Statistical Computing, Vienna, Austria,
ISBN 3-900051-00-3, http://www.R-project.org.

Ramachandran, A. and Feamster, N. (2006). Understanding the Network-
Level Behavior of Spammers. In ACM, editor, SIGCOMM’06, pages 291–
302.

Walfish, M., Zamfirescu, J., Balakrishnan, H., Karger, D., and Shenker, S.
(2006). Distributed Quota Enforcement for Spam Control. In Proceedings
of the 3rd conf. on Networked Systems Design and Implementation (NSDI),
pages 281–296, San Jose, CA. USENIX Association Berkeley.

Wong, M. (2005). Sender authentication: What to do. White Paper, July.
Yu, K., Schwaighofer, A., Tresp, V., Ma, W., and Zhang, H. (2003). Collab-

orative Ensemble Learning: Combining Collaborative and Content-Based
Information Filtering via Hierarchical Bayes. In 19th Int. Conf. on Uncer-
tainty in Artificial Intelligence (UAI), pages 353–360. ACM.

Zhong, Z., Ramaswamy, L., and Li, K. (2008). ALPACAS: A Large-scale
Privacy-Aware Collaborative Anti-spam System. In IEEE INFOCOM,
pages 556–564.

20