Estimating translation probabilities for social tag suggestion


Expert Systems with Applications 42 (2015) 1950–1959
Contents lists available at ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a
Estimating translation probabilities for social tag suggestion
http://dx.doi.org/10.1016/j.eswa.2014.10.002
0957-4174/� 2014 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

⇑ Corresponding author at: Room 4-506, FIT building, Tsinghua University,
Beijing 100084, China. Tel.: +86 138 1035 7951.

E-mail addresses: cxx.thu@gmail.com (X. Chen), liuzy@tsinghua.edu.cn (Z. Liu),
sms@tsinghua.edu.cn (M. Sun).

1 The original record was obtained from the book review website Douban
(www.douban.com) in Chinese. Here we translate it into English for comprehension.
Xinxiong Chen ⇑, Zhiyuan Liu, Maosong Sun
Department of Computer Science and Technology, State Key Lab on Intelligent Technology and Systems, National Lab for Information Science and Technology, Tsinghua
University, Beijing 100084, China

a r t i c l e i n f o
Article history:
Available online 12 October 2014

Keywords:
Natural language processing
Tag suggestion
Translation model
Word alignment model
Pointwise mutual information
a b s t r a c t

The task of social tag suggestion is to recommend tags automatically for a user when he or she wants to
annotate an online resource. In this study, we focus on how to make use of the text description of a
resource to suggest tags. It is intuitive to select significant words from the text description of a source
as the suggested tags. However, since users can arbitrarily annotate any tags to a resource, tag suggestion
suffers from the vocabulary gap issue — the appropriate tags of a resource may be statistically insignifi-
cant or even do not appear in the corresponding description. In order to solve the vocabulary gap issue, in
this paper we present a new perspective on social tag suggestion. By considering both a description and
tags as summaries of a given resource composed in two languages, tag suggestion can be regarded as a
translation from description to tags. We propose two methods to estimate the translation probabilities
between words in descriptions and tags. Based on the translation probabilities between words and tags
estimated for a large collection of description-tags pairs, we can suggest tags according to the words in a
resource description. Experiments on real-world datasets indicate that our methods outperform other
methods in precision, recall and F-measure. Moreover, our methods are relatively simple and efficient,
which makes them practical for Web applications.
� 2014 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license

(http://creativecommons.org/licenses/by-nc-nd/3.0/).
1. Introduction

In Web 2.0, Web users often use tags to collect and share online
resources such as Web pages, photos, videos, movies and books. As
an example, we consider a social tagging system for books. Table 1
presents a book entry annotated with several tags by multiple
users.1 On the top of Table 1 we list the title and a short introduction
of the book ‘‘The Count of Monte Cristo’’. The bottom of Table 1 shows
the annotated tags, each of which is followed by a number in brack-
ets, which is the total number of users who used the tag to annotate
the book. As the tags of online resources are annotated collabora-
tively by multiple users, we also refer to these tags as social tags.
For a resource, we refer to the additional information, such as the
title and the introduction of a book, as description, and the user-
annotated social tags as annotation.

The task of social tag suggestion is to automatically recommend
tags for a user when he or she wants to annotate a resource. Social
tag suggestion, as a crucial component for social tagging systems,
can help users annotate resources. Moreover, social tag suggestion
is usually considered as an equivalent problem to modeling social
tagging behaviors, which is playing an increasingly important role
in social computing and information retrieval.

Most online resources have descriptions, usually containing
abundant information about resources (Liu, Chen, & Sun, 2011).
For example, on a book review website, each book entry contains
a title, the author(s) and an introduction of the book. Thus, a num-
ber of researchers (Liu et al., 2011; Katakis, Tsoumakas, & Vlahavas,
2008; Mishne, 2006; Xu, Fu, Mao, & Su, 2006) propose to automat-
ically suggest tags based on resource descriptions, which is
collectively known as the content-based approach (Xu et al.,
2006). In this study, we focus on how to make use of the text
description of a resource to suggest tags. Note that besides descrip-
tions, online resources may also have multimedia data (e.g.,
images, videos and audio files) and a survey of multimedia tagging
can be found in Wang, Ni, Hua, and Chua (2012).

One may think to suggest tags by selecting important words
from descriptions. This approach is far from sufficient because
descriptions and annotations use diverse vocabularies, which is
typically referred to as the vocabulary gap problem (Liu et al.,
2011). The vocabulary gap is usually reflected in two primary
issues:

http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2014.10.002&domain=pdf
http://creativecommons.org/licenses/by-nc-nd/3.0/
http://dx.doi.org/10.1016/j.eswa.2014.10.002
http://creativecommons.org/licenses/by-nc-nd/3.0/
mailto:cxx.thu@gmail.com
mailto:liuzy@tsinghua.edu.cn
mailto:sms@tsinghua.edu.cn
http://www.douban.com
http://dx.doi.org/10.1016/j.eswa.2014.10.002
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa


Table 1
An example of social tagging. The number in the bracket after each tag is the total count of users that annotated the book with the tag.

Description
Title: The Count of Monte Cristo
Intro: The Count of Monte Cristo is one of the most popular fiction by Alexandre Dumas. The writing of the work was completed in 1844. . . .

Annotation
Dumas (2748), Count of Monte Cristo (2716), foreign literature (1813), novel (1345), France (1096), classic (1062), revenge (913), famous book (759), . . .

X. Chen et al. / Expert Systems with Applications 42 (2015) 1950–1959 1951
1. A portion of tags in the annotation do appear in the correspond-
ing description, but they may not be statistically significant.

2. A portion of tags may even not appear in the description.

Taking the book entry in Table 1 as example, the tag ‘‘classic’’
had been annotated by 1062 users but it did not appear in the
description; another appropriate tag ‘‘famous book’’ also did not
appear in the description.

Many approaches have been proposed to reduce the vocabulary
gap and find the semantic correspondence between descriptions
and annotations. Several researchers regard social tag suggestion
as a classification problem by considering each tag as a category
label (Fujimura, Fujimura, & Okuda, 2008; Heymann, Ramage, &
Garcia-Molina, 2008; Katakis et al., 2008; Lee & Chun, 2007;
Mishne, 2006; Ohkura, Kiyota, & Nakagawa, 2006). Various classi-
fiers such as Naive Bayes, kNN and SVM have been explored and
words are used as features. Some researchers propose to use the
topic information between words and tags to suggest tags (Iwata,
Yamada, & Ueda, 2009; Si, Liu, & Sun, 2010).

In this paper, we propose a new perspective on social tag sug-
gestion to solve the vocabulary gap problem. By regarding both
the description and the annotation as parallel summaries of a
resource, we want to build a translation model to estimate the
translation probabilities between the words in descriptions and
tags in annotations. The translation probabilities are able to cap-
ture the semantic relation between words and tags. After obtaining
the translation probabilities, the tagging behavior associated with
a resource can then be regarded as a word translation process:

1. A user reads the resource description and understands its sub-
stance according to the important words in the description.

2. Triggered by the important words in the description, the user
translate these words into corresponding tags and annotate
the resource with these tags.

In Fig. 1, we provide a simple example to demonstrate the basic
idea of using word translation for tag suggestion. In this figure,
some words in the first sentence of book description are translated
to tags in the annotation. The translation is denoted with various
arrows from words or phrases in the description to tags in the
annotation. For example, the phrase Count of Monte Cristo in the
description is translated to two tags, including Dumas and Count
of Monte Cristo, and the word fictions is translated to novel.

In this paper, we propose two methods to estimate transla-
tion probabilities between words in descriptions and tags in
Fig. 1. An example of the word translation method for suggesting tags from a given
description.
annotations. One method is the word alignment model (WAM) in
statistical machine translation (SMT) and the other method is
mutual information (MI) (Lin, 1998). It is straightforward to use
WAM since it is the basic model in SMT to estimate the translation
probabilities. For training, WAM requires a collection of parallel
documents, where each document pair should have the compara-
ble length. In this paper we propose a sampling method to prepare
length-balanced description–annotation pairs for WAM. Moreover,
we propose the second method, MI, to estimate the translation
probabilities. Mutual information is a popular measure that can
utilize co-occurrence information to measure semantic similarities
between two words (Lin, 1998).

Our model can solve the vocabulary gap problem because the
translation probabilities estimated by WAM and MI are able to
capture the semantic relation between words and tags. Thus we
can suggest tags that are not statistically significant or even not
appear in the descriptions based on the translation probabilities.
We hypothesize that our approach is better than the methods
mentioned above and conduct experiments to investigate the per-
formance of our model in the task of tag suggestion. Experiments
on real-world datasets indicate that our method outperforms other
methods in precision, recall and F-measure. Moreover, our method
is relatively simple and efficient, as proven with the computational
complexity, which makes it practical for Web applications.

The remainder of this paper is organized as follows: In Section 2
we briefly introduce some of the most commonly used methods for
tag suggestion. Sections 3 and 4 introduce the details of our
approach. Section 5 presents the experimental evaluation of our
approach compared to other existing techniques. Finally Section
6 concludes the paper.
2. Related work

Many researchers have built social tag suggestion systems
based on collaborative filtering (CF) (Herlocker, Konstan, Borchers,
& Riedl, 1999; Herlocker, Konstan, Terveen, & Riedl, 2004). CF is a
widely used technique in recommender systems (Resnick &
Varian, 1997). The collaboration-based methods typically base
their suggestions on the tagging history of the given resource
and user, without considering resource description. Matrix Factor-
ization (Rendle, Balby Marinho, Nanopoulos, & Schmidt-Thieme,
2009) and FolkRank (Jaschke, Marinho, Hotho, Schmidt-Thieme,
& Stumme, 2008) are representative CF methods for social tag sug-
gestion. Most of these methods suffer from the cold-start problem
(Lam, Vu, Le, & Duong, 2008), i.e., they are not able to provide effec-
tive suggestions for resources that no one has annotated yet. The
content-based approach for social tag suggestion ameliorates
the cold-start problem of the collaboration-based approach by
suggesting tags according to resource descriptions. Therefore, the
content-based approach plays an important role in social tag
suggestion, especially for new resources and new tagging systems
without tagging history.

Several researchers regard social tag suggestion as a classifica-
tion problem by considering each tag as a category label
(Fujimura et al., 2008; Heymann et al., 2008; Katakis et al., 2008;
Lee & Chun, 2007; Mishne, 2006; Ohkura et al., 2006). Various


1952 X. Chen et al. / Expert Systems with Applications 42 (2015) 1950–1959
classifiers such as Naive Bayes, kNN, SVM and neural networks
have been explored to solve the social tag suggestion problem.

There are two issues emerging from the classification-based
methods: (1) the annotations provided by users are noisy, and
the classification-based methods cannot handle the issue well
(Liu et al., 2011); (2) the training cost and classification cost of
many classification-based methods are usually proportional to
the number of classification labels (Si et al., 2010). Thus, these
methods may be inefficient for a real-world social tagging system,
where hundreds of thousands of unique tags should be considered
as classification labels.

Inspired by the emerging popularity of latent topic models such
as Latent Dirichlet Allocation (LDA) (Blei, Ng, & Jordan, 2003), var-
ious methods have been proposed to model tags using generative
latent topic models. One intuitive approach is assuming that both
tags and words are generated from the same set of latent topics.
By representing both tags and descriptions as the distributions of
latent topics, this approach suggests tags according to their
likelihood given the description (Krestel, Fankhauser, & Nejdl,
2009; Si & Sun, 2009). Bundschus et al. (2009) proposed a joint
latent topic model of users, words and tags. Iwata et al. (2009) pro-
posed an LDA-based topic model, Content Relevance Model (CRM),
which aims to identify the content-related tags for suggestion.
Empirical experiments revealed that CRM outperformed both
classification methods and Corr-LDA (Blei & Jordan, 2003), a gener-
ative topic model for contents and annotations.

Most latent topic models have to pre-specify the number of
topics before training. We can either use cross-validation to deter-
mine the optimal number of topics or employ the infinite models
for automatically adjusting the number of topics during training.
Both of the solutions are usually computationally complicated.
More importantly, topic-based methods suggest tags by measuring
the topical relevance of tags and resource descriptions. The latent
topics are at the concept level (Liu, Huang, Zheng, & Sun, 2010),
which are usually too coarse-grained to precisely suggest fine-
grained tags such as named entities, e.g., the tags ‘‘Dumas’’ and
‘‘Count of Monte Cristo’’ in Table 1. To remedy the problem, Si
et al. (2010) proposed a generative model, Tag Allocation Model
(TAM), which considers the words in descriptions as the possible
topics to generate tags.

Our model is also a content-based approach so we compare
some of the content-based approach mentioned above (kNN,
Naive Bayes, CRM and TAM) to investigate the performance of
our approach.

3. Learning translation probabilities

Given a resource description, the ranking score of a tag can be
calculated from two probabilities: (1) the translation probabilities
between words and tags, (2) the probabilities of a word given the
description. In Section 3, we will present how to use two different
methods, word alignment model and mutual information, to esti-
mate the translation probabilities between words in description
and tags in annotations. We will introduce how to calculate the
probabilities of a word given the description and perform tag sug-
gestion in Section 4.

First we give formal definitions to description and tags. In this
paper, the description of a resource is the textual information of
the resource, including the title and the short introduction. The
description can be treated as a bag of words. Here we do not use
stemming techniques or lemmatization techniques to preprocess
the description. We just removed the stop-words in the descrip-
tion. Tags of a resource are labels with count information to
describe the resource.

Before introducing the methods in details, we introduce the
notation. A resource is denoted as r 2 R, where R is the set of
resources. Each resource in the training set contains a description
and an annotation containing a set of tags. The description dr of
resource r can be regarded as a bag of words wr ¼fðwi; cwiÞg

Nr
i¼1 ,

where cwi is the count of word wi, and Nr is the number of unique
words in r. The annotated tags ar of the resource r is represented as
tr ¼fðti; ctiÞg

Mr
i¼1 , where cti is the count of the tag ti, and Mr is the

number of unique tags for r.

3.1. Word alignment model (WAM)-based approach

WAM, as a traditional machine translation method, requires a
parallel training dataset consisting of a number of aligned sentence
pairs. We assume the description and the annotation of a resource
are written in two distinct languages. Thus, we prepare our parallel
training dataset by pairing descriptions and annotations. Accord-
ingly, the WAM-based approach contains two steps. First, given
a collection of annotated resources, we prepare description–
annotation pairs for using the word alignment model. Second,
given a collection of description–annotation pairs, we adopt IBM
Model-1, a widely used word alignment model, to learn the trans-
lation probabilities between words in descriptions and tags in
annotations. We will introduce the two steps separately.

3.1.1. Preparing description–annotation pairs for WAM
In a typical tag suggestion system, the length of a resource

description is usually limited to hundreds of words. In addition,
it is common for some popular resources to be annotated by multi-
ple users with thousands of tags. For example, the tag Dumas is
annotated by 2,748 users for the book in Table 1. In another
extreme, a resource may be annotated with only several tags. We
have to address the length-unbalance between a resource descrip-
tion and its corresponding annotation for two reasons. (1) When
the number of annotated tags is large, it is impossible to list all
annotated tags on the annotation side of a description–annotation
pair. The performance of word alignment models will also suffer
from the unbalanced length of aligned pairs in the parallel training
data set (Och & Ney, 2003). (2) Moreover, the annotated tags may
have different importance for the resource. It would be unfair to
treat these tags without distinction.

In this study, we propose a sampling method to prepare length-
balanced description–annotation pairs for word alignment. The
basic idea is to sample a bag of tags from the annotation according
to tag weights and make the generated bag of tags have compara-
ble length with the words in description. For example, the length of
the description in Table 1 is 54 and the number of unique tags is
21. If we list 54 words in one side and 21 tags in another side,
we will get a sentence pair with unbalanced length. Thus we
propose to sample a bag of tags with comparable length with the
words, for example, 54 tags.

We consider two parameters when sampling tags. First, we
have to select a tag weighting type for sampling. In this paper,
we investigate two straightforward weighting types, including
tag frequency (TFt) within the annotation, which considers the
local importance, and tag-frequency inverse-document-frequency
(TF-IDFt), which also considers global specification (IDFw) besides
TFt. Given a resource r, TFt and TF-IRFt of the tag t are defined as

TFt ¼
ctP

t ct
; TF-IDFt ¼

ctP
t ct
� log

jRj
jfr 2 R : ct > 0gj

� �
ð1Þ

where jr 2 R : ct > 0j indicates the number of resources that have
been annotated with the tag t.

Another parameter is the length ratio between the description
and the sampled annotation. We denote the ratio as d ¼ jwrjjtrj , where
jwrj is the number of words in the description and jtrj is the num-
ber of tags in the annotation. Still take the book in Table 1 for


3.

X. Chen et al. / Expert Systems with Applications 42 (2015) 1950–1959 1953
example, if the length of the description is 54 and the length ratio
is 10=5, then we will sample 27 tags from the annotations.

3.1.2. Learning translation probabilities for WAM
After preparing aligned description–annotation pairs for WAM,

next we will choose an appropriate WAM model to obtain the
translation probabilities between words in description and tags
in annotations. Note that the annotated tags (tr ) form a bag of
labels with no position information, thus, we select IBM Model-1
(Brown, Pietra, Pietra, & Mercer, 1993) for training, which does
not take word position information into account on both sides
for each aligned pair.

Suppose the source language is the description, and the target
language is the annotation. We use word alignment models to
learn the translation probabilities between words in descriptions
and labels in annotations. In IBM Model-1, the relationship of the
source language w ¼ wJ1 and the target language t ¼ t

I
1 is con-

nected via a hidden variable describing an alignment mapping
from source position j to target position aj:

Pr wJ1jt
I
1

� �
¼
X

aJ
1

Pr wJ1; a
J
1jt

I
1

� �
: ð2Þ

The alignment aJ1 also contains empty-word alignments aj ¼ 0
which align source words to the empty word. IBM Model-1 can be
trained using an Expectation–Maximization (EM) algorithm in an
unsupervised fashion, and obtains the translation probabilities of
two vocabularies, i.e., PrðwjtÞ, where t is a tag and w is a word.

IBM Model-1 only produces one-to-many alignments from
source language to target language. The learned model is thus
asymmetric, i.e., the model learned from description–annotation
pairs is different from the model learned from annotation–
description pairs. So we establish learned translation models in
two directions: one regards description as the source language
and annotations as the target language, and the other is in the
reverse direction of the pairs. We denote the first model as Prd2a
and the latter as Pra2d . We further define PrðtjwÞ as the harmonic
mean of the two models:

PrðtjwÞ/
k

Prd2aðtjwÞ
þ

1 � k
Pra2dðtjwÞ

� ��1
; ð3Þ

where k is the harmonic factor for combining the two models. When
k ¼ 1 or k ¼ 0, it simply uses model Prd2a or Pra2d correspondingly.

Finally we get the translation probabilities PrðtjwÞ using the
WAM model, which can be regarded as the sematic relatedness
between words and tags.

3.2. Mutual information (MI)-based approach

From Section 3.1.1, we can see that WAM has to use a sampling
technique to prepare description–annotation pairs. Unlike WAM,
MI only needs the co-occurrence information between words in
description and tags in annotations, which does not require
sampling. We obtain translation probabilities using MI as follows.

First, for each pair of a word w in the descriptions and a tag t in
the annotations, we compute their mutual information score.
Informally, mutual information divides the probability of observ-
ing w and t together in the same resource by the probabilities of
observing w and t independently. The mutual information between
a word w and a tag t is calculated as follows:

Iðw; tÞ¼
X

Xw¼0;1

X
Xt¼0;1

pðXw; XtÞ � log
PrðXw; XtÞ

PrðXwÞPrðXtÞ
ð4Þ

where Xw and Xt are binary variables indicating whether w or t is
present or absent, respectively. The estimation of the probabilities
PrðXwÞ; PrðXtÞ and PrðXw; XtÞ can be found in Karimzadehgan and
Zhai (2010).

For a word w, we set a tag co-occurrence threshold c and will
remove the mutual information between word w and tag t if
cðXw ¼ 1; Xt ¼ 1Þ6 c, where cðXw ¼ 1; Xt ¼ 1Þ is the number of
resources that contain both w and t. We set the co-occurrence
threshold c for two reasons. (1) We are usually less confident with
the translation probabilities estimated using infrequent word-tag
pairs, which are usually noisy and unimportant. With the thresh-
old, we can filter out a lot of noisy information. (2) Moreover, we
can largely reduce the computation cost of estimating mutual
information scores. Of course, when c ¼ 0, we will not remove
any word-tag pairs.

Thus, we normalize the mutual information score to obtain a
translation probability PrðtjwÞ between a word u and a tag t:

PrðtjwÞ¼
Iðw; tÞP
t0 Iðw; t

0Þ
ð5Þ

Intuitively, the probability is higher if the word u and the tag t are
more likely to co-occur. For example, we get the probabilities
Prfrevengejrevengeg¼ 0:127 and Prfmartialartsjrevengeg¼ 0:088
from MI model, standing for the word ‘‘revenge’’ is more likely to
co-occur with the tag ‘‘revenge’’ than the ‘‘martial arts’’.

3.3. Emphasize self-translation probability

As a word may not always appear as a tag in annotation, the
approaches described in Section 3.1 and Section 3.2 may under-
estimate the self-translation probabilities, i.e., it is possible that
Prðt – wjwÞ > Prðt ¼ wjwÞ. Here we propose to emphasize the
self-translation probability for two reasons. (1) Under estimation
of self-translation probabilities may lead to a situation where a
proper tag t that also appears frequently in the description may
receive a less recommendation score (i.e., Prðtjw ¼ tÞ) compared
to other tags t0 (i.e., Prðt0jw ¼ tÞ). (2) In some real tag suggest
systems, the tags that appear in the resource description are more
likely to be selected by users for annotation. Hence, we introduce a
parameter a to emphasize self-translation probabilities. This idea
can be applied to adjust the translation probabilities from any
translation models.

PrðtjwÞ¼
a þð1 � aÞPrðt ¼ wjwÞ t ¼ w
ð1 � aÞPrðtjwÞ t – w

�
ð6Þ

When a ¼ 1:0, the method will suggest tags simply according to
their importance scores in the description, whereas when a ¼ 0, it
will not emphasize the tags that appear in the description and
suggest tags only according to translation probabilities. For exam-
ple, if a ¼ 0:5 and Prfrevengejrevengeg¼ 0:127, then we will get a
emphasized probability Prfrevengejrevengeg¼ 0:5635.

4. Suggesting tags with translation probabilities

After estimating translation probabilities between words and
tags PrðtjwÞ in Section 3, we will show how to suggest tags in this
section.

Given a resource description dr , our model for tag suggestion is
a 3-step process:

1. Measure the importance score PrðwjdrÞ of each word w in
description dr .

2. Compute the ranking score of tag t by
Prðtjdr ¼ wrÞ¼
X

w2wr
PrðtjwÞPrðwjdrÞ ð7Þ

According to PrðtjdrÞ, suggest top-ranked tags to users.


Table 2
Statistical information of two datasets. D; W; L; �Nd and �Na are the number of resources,
the vocabulary of descriptions, the vocabulary of tags, the average number of words in
each description and the average number of tags in each resource, respectively.

Data D W T �Nd �Na

BOOK 70,000 174,748 46,150 211.6 3.5
BIBTEX 158,924 91,277 50,847 5.8 2.7

1954 X. Chen et al. / Expert Systems with Applications 42 (2015) 1950–1959
Since we get PrðtjwÞ in Section 3, we focus on how to measure
PrðwjdrÞ in this section. Here we investigate three methods to
compute the importance score of a word in a resource description:
TF-IDFw, TextRank and their product. TF-IDFw and TextRank are the
two most widely adopted methods to weight words in a document.

Similar to TF-IDFt mentioned in Section 3.1.1, TF-IDFw (Salton &
Buckley, 1988) considers both the local importance (TFw) and
global specification (IDFw). TextRank (Mihalcea & Tarau, 2004) is
proposed to compute term importance using graph-based algo-
rithms such as PageRank (Page, Brin, Motwani, & Winograd,
1998), which considers the semantic relations between terms as
a term graph of the given resource description. We also use the
product of TF-IDFw and TextRank to weight terms, which poten-
tially takes both global information and term relations into
account.

Finally we get the ranking of tags according to Eq. (7). Because
WAM and MI can estimate the translation probabilities between
words and tags, which can be regarded as the semantic relatedness
between words and tags, we hypothesize that our methods are bet-
ter than the baseline methods mentioned in Section 2 (kNN, Naive
Bayes, CRM and TAM). In next section we run experiments to val-
idate this hypothesis.
5. Experiments

5.1. Datasets and evaluation metrics

Datasets In our experiments, we select two real-world datasets
of social tagging systems that have diverse properties to evaluate
our methods. In Table 2 we show the detailed statistical informa-
tion of the two datasets.

The first dataset, denoted as BOOK, was obtained from a popular
Chinese book review website, www.douban.com, which contains
the descriptions of books and the tags collaboratively annotated
by users.

The second dataset, denoted as BIBTEX, was obtained from an
English online bibliography website, www.bibsonomy.org.2 The
dataset contains the descriptions for academic papers and the tags
annotated by users.

As shown in Table 2, the average length of descriptions in the
BIBTEX dataset is much shorter than the BOOK dataset. Moreover,
the BIBTEX dataset does not provide how many times each label is
used in a resource annotation.
3 In more detail, the training phase of WAM contains preparing parallel training
dataset with OðD �NaÞ and learning translation probabilities using word alignment
models with OðID �Nd �NaÞ, where I is the number of iterations for learning trigger
probabilities, and �Na is the average number of tags for each description after
5.1.1. Evaluation metrics
We use precision that measures exactness, recall that measures

completeness and F-measure, which is the harmonic mean of pre-
cision and recall, to evaluate the performance of tag suggestion
methods. For a resource, we denote the original tags (gold stan-
dard) as T a, the suggested tags as T s, and the correctly suggested
tags as T s \ T a. Precision, recall and F-measure are defined as
follows:

p ¼
jT s \ T aj
jT sj

; r ¼
jT s \ T aj
jT aj

; F ¼
2pr

p þ r
: ð8Þ

The final evaluation scores are computed by micro-averaging (i.e.,
averaging on resources of test set). We performed 5-fold cross-
validation for each method for both datasets. In the experiments,
the number of suggested tags Mr ranges from 1 to 10.
2 The dataset can be obtained from http://www.kde.cs.uni-kassel.de/bibsonomy/
dumps.
5.2. Comparing results

5.2.1. Baseline methods
We select four algorithms as the baselines for comparison:

Naive Bayes (NB) (Manning, Raghavan, & Schtze, 2008), k nearest
neighborhood (kNN) (Manning et al., 2008), Content Relevance
(CRM) Model (Iwata et al., 2009) and Tag Allocation Model (TAM)
(Si et al., 2010). Our methods are denoted as WAM and MI.

NB and kNN are two representative classification methods. NB
is a simple generative model, which models the probability of each
tag t given a description d as

PrðtjdÞ/ PrðtÞ
Y
w2d

PrðwjtÞ: ð9Þ

PrðtÞ is estimated by the frequency of the documents annotated
with the tag t. PrðwjtÞ is estimated by the frequency of the word
w in the resource descriptions annotated with the tag t. kNN is a
widely used classification method for tag suggestion, which anno-
tates tags to a resource according to the annotated tags of similar
resources measured using vector space models (Manning et al.,
2008).

CRM and TAM are selected to represent topic-based methods
for tag suggestion. CRM is an LDA-based generative model. The
number of latent topics K is the key parameter for CRM. In the
experiments, we evaluated the performance of CRM with different
K values, and here, we only present the best one obtained by set-
ting K ¼ 1; 024. TAM is also a generative model that considers
the words in descriptions as the topics to further generate
tags for the resource. We set parameters for TAM as in Si et al.
(2010).

We compare the complexity of these methods. We denote the
number of training iterations in CRM, TAM and WAM as I, and
the number of topics in CRM as K. For the training phase, the com-
plexity of NB is OðD �Nd �NaÞ; kNN is Oð1Þ; TAM is OðID �Nd �NaÞ; CRM is
OðIKD �Nd �NaÞ; WAM is OðID �Nd �NaÞ,3 and MI is OðD �Nd �NaÞ. When sug-
gesting for a given resource description with length Nd, the complex-
ity of NB is OðNdTÞ; kNN is OðD �Nd �NaÞ; CRM is OðIKNdTÞ; TAM is
OðINdTÞ; WAM is OðNdTÞ, and MI is OðNdTÞ. From the analysis, we
can see that WAM and MI are relatively simple methods for both
training and suggestion. This is especially valuable because WAM
and MI also present good effectiveness for tag suggestion compared
with other methods as we will shown later.

5.2.2. Parameter settings
For WAM, we use GIZA++ (Och & Ney, 2003)4 with the IBM

Model-1 to determine translation probabilities using description–
annotation pairs for WAM. The experimental results of WAM are
obtained by setting parameters as follows: label weighting type as
TF-IDFt, length ratio d ¼ 1, harmonic factor k ¼ 0:5 and the type of
sampling.
4 GIZA++ is freely available from code.google.com/p/giza-pp. The toolkit is widely

used for word alignment in SMT. In this paper, we use the default setting of
parameters for training.

http://www.douban.com
http://www.bibsonomy.org
http://www.kde.cs.uni-kassel.de/bibsonomy/dumps
http://www.kde.cs.uni-kassel.de/bibsonomy/dumps
http://code.google.com/p/giza-pp


0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65
0.1

0.2

0.3

0.4

0.5

0.6

0.7

Precision

R
ec

al
l

MI
WAM
TAM
CRM
kNN
NB

(a) BOOK

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Precision

R
ec

al
l

MI
WAM
TAM
CRM
kNN
NB

(b) BIBTEX

Fig. 2. Performance comparison between NB, kNN, CRM, TAM, WTM and MI for the two datasets.

Table 3
Comparison of NB, kNN, CRM, TAM, WTM and MI results for the BOOK dataset when
suggesting M ¼ 3 tags. We have used t-test to ensure the differences between other
results and the best result (in bold) in each column of the table are statistically
significant at p < 0:05.

Method Precision Recall F-measure

NB 0.271 0.302 0.247 ± 0.004

X. Chen et al. / Expert Systems with Applications 42 (2015) 1950–1959 1955
word importance scores as TF-IDFw. For MI, we set the tag
co-occurrence threshold c ¼ 0 and the type of word importance
scores as TF-IDFw. The values are used as default values by maximiz-
ing the F-measure on a development set of 1000 instances from the
website where BOOK dataset are obtained (not included int the
BOOK dataset). The influence of parameters to WAM and MI can
be found in Section 5.3.
kNN 0.280 0.314 0.258 ± 0.002
CRM 0.292 0.323 0.266 ± 0.004
TAM 0.310 0.344 0.283 ± 0.001
WAM 0.368 0.452 0.355 ± 0.002
MI 0.422 0.493 0.397 ± 0.002
5.2.3. Experiment results and analysis
In Fig. 2 we present the precision–recall curves of NB, kNN,

CRM, TAM, WAM and MI for the two datasets. Each point of a pre-
cision–recall curve represents different numbers of tags from
M ¼ 1 (bottom right, with higher precision and lower recall) to
M ¼ 10 (upper left, with higher recall but lower precision). The
closer the curve to the upper right is, the better the overall perfor-
mance of the method is.

From Fig. 2, we observe the following:

1. The method based on MI consistently performs the best for both
datasets. The method based on WAM achieves the second best
performance for both datasets. These results indicate that our
method is robust and effective for social tag suggestion.

2. The advantage of our method based on WAM is more significant
on the BOOK dataset than the baseline method. The reason is
that WAM has a good advantage in count information of tags
compared with other baseline methods.

3. Although WAM has a good count information advantage for the
BOOK dataset, MI is still better than WAM. The reason is that MI
tries to translate a word to more tags, whereas the translation
probabilities of a word in WAM always focus on only one or
two tags. This leads to the result that the tags suggested by
MI have a better coverage than WAM. We will present the
translation probability table of words in the next subsection.

4. The average length of resource description is short in BIBTEX,
which makes it difficult to determine the importance score of
words, but even for the BIBTEX dataset with no count informa-
tion of tags, our method still outperforms other methods.

To further demonstrate the performance of our word transla-
tion method and other baseline methods, in Table 3 we show the
precision, recall and F-measure of NB, kNN, CRM, TAM,WAM and
MI applied to the BOOK dataset when suggesting M ¼ 3 labels.5
Here we also show the variance of F-measure. In fact, MI achieves
the best performance when M ¼ 2, where the F-measure of MI is
5 We elected to show this number because it is near the average number of labels
for the BOOK dataset.
0.399, indicating an outperforming of both CRM (F ¼ 0:263) and
TAM (F ¼ 0:277) by more than 10%.
5.2.4. An example
In Table 4, we show the top 10 tags suggested by NB, CRM, TAM,

WTM and MI applied to the book in Table 1. The number in brack-
ets after the name of each method is the count of correctly
suggested labels. The correctly suggested tags are marked in bold
face. We elected not to show the kNN results because the tags
suggested by kNN are totally unrelated to the book due to the
insufficient finding of nearest neighbors.

From Table 4, we observe that NB, CRM and TAM, as generative
models, tend to suggest coarse-grained tags, such as ‘‘novel’’,
‘‘literature’’, ‘‘classic’’ and ‘‘France’’, and fail in suggesting fine-
grained tags such as ‘‘Alexandre Dumas’’, ‘‘Count of Monte Cristo’’,
‘‘revenge’’ and ‘‘suspense’’. On the contrary, WAM and MI succeed
in suggesting both the coarse-grained and the fine-grained tags
related to the book.

To find out how our model can suggest these fine-grained tags,
we list four important words (using TF-IDFw as weighting metric)
of the description and their corresponding tags with the highest
translation probabilities in Tables 5 and 6. The values in brackets
are the probability of tag t given word w; PrðtjwÞ. For each word,
we eliminated the tags with a probability less than 0.05. We can
see that the translation probabilities can map the words in descrip-
tions to their semantically corresponding tags in annotations. Take
the word ‘‘Count of Monte Cristo’’ in Table 5 for example, besides
the tag identical to itself, it has a high probability to the tag
‘‘Alexander Dumas’’, which indicates the tag ‘‘Alexandre Dumas’’
is highly related to the word ‘‘Count of Monte Cristo’’. In fact, the
word ‘‘Count of Monte Cristo’’ appears in 19 books (12 of them
are the different versions of ‘‘Count of Monte Cristo’’ and other
are the novels written by ‘‘Alexander Dumas’’) and 16 of them


 0.22

 0.24

 0.26

 0.28

 0.3

 0.32

 0.34

 0.36

 0.38

 0  1  2  3  4  5  6  7  8  9

F-
m

ea
su

re

Number of Suggested Tags

λ = 0.0
λ = 0.2
λ = 0.4
λ = 0.5
λ = 0.6
λ = 0.8
λ = 1.0

Fig. 3. F-measure of WAM versus the number of suggested tags for the BOOK
dataset when harmonic factor k ranges from 0.0 to 1.0.

Table 4
Top 10 tags suggested by NB, CRM, TAM, WAM and MI for the book in Table 1.

NB (+6): novel, foreign literature, literature, history, Japan, classic, France, philosophy, America, biography

CRM (+5): novel, foreign literature, literature, biography, philosophy, culture, France, British, comic, history

TAM (+5): novel, sociology, finance, foreign literature, France, literature, biography, France literature, comic, China

WAM (+7): novel, Alexandre Dumas, history, Count of Monte Cristo, foreign literature, biography, suspense, comic, America, France

MI (+7): Alexandre Dumas, novel, Count of Monte Cristo, foreign literature, France, revenge, French literature, Liang Yusheng, martial arts, Comedie Humaine

Table 5
Four important words (in bold face) in the book description in Table 1 and their corresponding tags with the highest translation probabilities in WAM.

Count of Monte Cristo: Count of Monte Cristo (0.728), Alexandre Dumas (0.270), . . .

Alexandre Dumas: Alexandre Dumas (0.966), . . .

Revenge: foreign literature (0.168), classic (0.130), martial arts (0.123), Alexandre Dumas (0.122), . . .

France: France (0.99), . . .

Table 6
Four important words (in bold face) in the book description in Table 1 and their corresponding tags with the highest translation probabilities in MI.

Count of Monte Cristo: Count of Monte Cristo (0.274), Alexandre Dumas (0.244), revenge (0.093), French literature (0.069), France (0.057), . . .

Alexandre Dumas: Alexandre Dumas (0.352), France (0.105), French literature (0.069), Count of Monte Cristo (0.067), foreign literature (0.053), revenge (0.052), . . .

Revenge: Liang Yusheng (0.154), revenge (0.127), martial arts (0.088), . . .

France: France (0.309), France literature (0.069), . . .

6 GIZA++ restricts the values of length ratio within 19 ; 9
� �

by setting the parameter
axfertility = 10. From Fig. 4, we can see when d ¼ 10, the performance becomes
uch worse as GIZA++ will cut off the sentences out of range.

1956 X. Chen et al. / Expert Systems with Applications 42 (2015) 1950–1959
are labeled with the tag ‘‘Alexander Dumas’’. This proves that our
model can capture the semantic relation between words and tags.

Note that ‘‘Count of Monte Cristo’’ and ‘‘Alexander Dumas’’
correspond to the title and the author of the book in Table 1, they
may be easily derived from other metadata of the book (although it
is not the case of our dataset). So it is more interesting to see that
our model can suggest the fine-grained tags like ‘‘revenge’’. From
Table 6, we can see that each of the words ‘‘Count of Monte Cristo’’,
‘‘Alexander Dumas’’ and ‘‘revenge’’ has a probability to the tag
‘‘revenge’’. The tag ‘‘revenge’’ is suggested jointly by combining
the scores from these important words in the description. The abil-
ity to suggest a tag jointly enables our model to suggest the tags
that are not statistically significant or even not appear in the
descriptions. Thus our model can solve the vocabulary gap problem.

5.3. Parameter influences

5.3.1. Parameter influences for WAM
We explore the parameter influences on WAM for tag sugges-

tion. The parameters include harmonic factor, length ratio, tag
weighting types, and types of word translation power. When inves-
tigating one parameter, we set the other parameters to the values
inducing the best performance as mentioned in Section 5.2. Finally,
we also investigated the influence of training data size for classifi-
cation performance. In the experiments we found that WAM
reveals similar trends for both the BOOK dataset and the BIBTEX
dataset. Thus, we only show the experimental results for the BOOK
dataset for analysis.

5.3.1.1. Harmonic factor. In Fig. 3 we investigate the influence of
harmonic factor via the curves of F-measure of WAM versus the
number of suggested tags on the BOOK dataset when harmonic
factor k ranges from 0.0 to 1.0. As shown in Section 3.1.2, the
harmonic factor k controls the proportion between model Prd2a
and Pra2d .

From Fig. 3, we observe that neither single model Prd2a (k ¼ 1:0)
nor Pra2d (k ¼ 0:0) achieves the best performance. When the two
models are combined by the harmonic mean, the performance is
consistently better, especially when k ranges from 0.2 to 0.6. This
is reasonable because IBM Model-1 constrains that only the term
in the source language that can be aligned to multiple terms in
target language, which makes the translation probability learned
by a single model asymmetric.

5.3.1.2. Length ratio. Fig. 4 shows the influence of length ratios for
WAM on the BOOK dataset. From the figure, we observe that the
performance for tag suggestion is robust as the length ratio varies,
except when the ratio breaks the default restriction of GIZA++
(i.e., d ¼ 10).6

5.3.1.3. Tag weighting types. The influence of two weighting types,
TFt and TF-IDFt, on tag suggestion when M ¼ 3 on the BOOK dataset
is shown in Table 7. TF-IDFt tends to select the tags more specific to
m

m


 0.2

 0.22

 0.24

 0.26

 0.28

 0.3

 0.32

 0.34

 0.36

 0.38

 0  1  2  3  4  5  6  7  8  9

F-
m

ea
su

re

Number of Suggested Tags

δ = 10/1
δ = 10/3
δ = 10/5

δ = 1/1
δ = 1/2
δ = 1/5

Fig. 4. F-measure of WAM versus the number of suggested tags for the BOOK
dataset when length ratio d ranges from 10=1 to 1=5.

Table 7
Evaluation results for different tag weighting types for WAM when M ¼ 3 on the
BOOK dataset.

Weighting Precision Recall F-measure

TFt 0.356 0.437 0.342 ± 0.002
TF-IDFt 0.368 0.452 0.355 ± 0.002

Table 8
Evaluation results for different methods for computing word importance scores to
WAM when M ¼ 3 for the BOOK dataset.

Weighting Precision Recall F-measure

TF-IDFw 0.368 0.452 0.355 ± 0.002
TextRank 0.345 0.424 0.332 ± 0.002
Product 0.368 0.451 0.354 ± 0.002

 0.2

 0.25

 0.3

 0.35

 0.4

 0.45

 0.5

 0.55

 0.6

 0.65

 0.1  0.2  0.3  0.4  0.5  0.6  0.7

R
ec

al
l

Precision

  8,000
 16,000
 24,000
 32,000
 40,000
 48,000
 56,000

Fig. 5. Precision–recall curves of WAM when the training data size increases from
8000 to 56,000 on the BOOK dataset.

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65
0.1

0.2

0.3

0.4

0.5

0.6

0.7

Precision

R
ec

al
l

γ=0
γ=1
γ=2
γ=5
γ=10

Fig. 6. Precision–recall curves of MI when the tag co-occurrence threshold c
increases from 0 to 10 on the BOOK dataset.

X. Chen et al. / Expert Systems with Applications 42 (2015) 1950–1959 1957
the resource, whereas TFt tends to select the most popular tags,
because the latter does not consider global information (the IDFt
part). Table 7 verifies the analysis, where TF-IDFt is slightly better
than TFt.
5.3.1.4. Methods for computing word importance scores. In Table 8,
we show the performance of WAM applied to the BOOK data-
set with different methods for computing word importance
scores. From the table, we can see that there is no significant
difference between TF-IDFw and the product of TF-IDFw and
TextRank, and TextRank performs the worst. This performance
indicates that TextRank is less competitive when measuring
word importance scores, as it does not take global information
into consideration.
5.3.1.5. Training data size. We also investigated the influence of
training data size for WAM. As shown in Fig. 5, we increased the
training data size from 8000 to 56,000 step by 8000, and performed
evaluation on 4000 resources. The figure shows that: (1) when the
training data size is small (e.g., 8000), WAM can still achieve good
performance, and (2) when the training data size increases, the
performance will be improved, but the improvement speed will
be slowed down as the training data size increases. This indicates
that WAM does not require a huge size dataset to achieve good
performance.
5.3.2. Parameter influences for MI
The parameters of the MI-based method include tag co-

occurrence threshold and types of word importance scores. When
investigating one parameter, we set the other parameters to be the
values inducing the best performance as mentioned in Section 5.2.
Similar to WAM, we only present the experimental results for the
BOOK dataset for analysis.

5.3.2.1. Tag co-occurrence threshold. Fig. 6 shows the influence of
the tag co-occurrence threshold for MI on the BOOK dataset. We
set the tag co-occurrence threshold c to different values. The figure
shows that the MI-based method achieves the best performance
with the BOOK dataset when c ¼ 0. This indicates that the set of
tags that have low co-occurrences with a word not only contains
noisy tags, but also contains proper tags that need to be suggested
by the word.

5.3.2.2. Methods for computing word importance scores. In Table 9,
we show the performance of MI applied to the BOOK dataset with
different methods for computing word importance scores. From
the table, we can see that for MI, TF-IDFw performs the best and
TextRank performs the worst. It is similar to WAM, and these


Table 9
Evaluation results for different methods for computing word importance scores with
MI when M ¼ 3 for the BOOK dataset.

Weighting Precision Recall F-measure

TF-IDFw 0.422 0.493 0.397 ± 0.002
TextRank 0.393 0.461 0.370 ± 0.002
Product 0.407 0.475 0.382 ± 0.002

1 2 3 4 5 6 7 8 9 10
0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

Number of Suggested Tags

F−
m

ea
su

re

α=0.0
α=0.1
α=0.3
α=0.5
α=0.7
α=0.9

Fig. 7. F-measure of MI versus the number of suggested tags for the BOOK dataset
when the self-translation parameter a ranges from 0.0 to 0.9.

Table 10
The evaluation results for emphasizing the self-Translation probability on MI with
different methods for computing word importance scores when M ¼ 3 with the BOOK
dataset.

Weighting Precision Recall F-measure

TF-IDFw 0.385 0.472 0.371 ± 0.001
TextRank 0.344 0.423 0.332 ± 0.002
Product 0.374 0.457 0.360 ± 0.001

1958 X. Chen et al. / Expert Systems with Applications 42 (2015) 1950–1959
results indicate that TextRank is less competitive for measuring
word importance scores, as it does not take global information into
consideration.

By analyzing the influences of parameters on WAM and MI, we
find that the word translation model is robust to parameter
variations.
5.4. Performance of emphasizing self-translation probability

In Fig. 7 we investigate the influence of self-translation param-
eter via the curves of F-measure of MI versus the number of sug-
gested tags on the two datasets when the self-translation
parameter a ranges from 0.0 to 0.9. As shown in Section 3.3, the
parameter a controls the self-translation probabilities.

From Fig. 7, we observe that MI achieve the best performance
when alpha ¼ 0:2 for the BOOK dataset, whereas alpha ¼ 0:4 on
BIBTEX dataset. These results indicate that for both dataset, for
one word, we need to emphasize the translation probability
to itself. This is reasonable because without self-translation, a
important tag may not be suggested as it does not suggest itself
using the translation probabilities. We also see that for different
dataset, the self-translation parameter is not a constant and varies
from the need of emphasizing the words in the current document.

Finally, we tested the performance of emphasizing self-
translation probability for WAM with different word translation
methods for the BOOK dataset. As shown in Table 10, emphasizing
the self-translation probability improves the performance of WAM
(in Table 8) as applied to the BOOK dataset when using TF-IDFw and
the product as the methods for computing the word trigger
powers, but decays when using TextRank. This result verifies that
TF-IDFw is the best method to measure word importance scores
for WAM, which indicates that emphasizing the tags appearing in
the descriptions may enhance the classification performance of
the word translation method.

However, the performance when emphasizing the self-
translation probability on the BIBTEX dataset decays much compared
with WAM. The F-measure of emphasizing the self-translation
probability is only F ¼ 0:229 compared with WAM F ¼ 0:267. The
main reason for the decay is that the average length of descriptions
in the BIBTEX dataset is too short to provide sufficient information to
precisely emphasize tags, and usually emphasize wrong tags and
drop correct tags.

The experimental results for emphasizing the self-translation
probability suggest that, we have to analyze the characteristics of
the tag suggestion systems to decide whether to emphasize the
tags that appear in the corresponding descriptions. It is also worth
investigating the problem when combining with collaboration-
based methods for social tag suggestion.
6. Conclusions

In this paper, we present a new perspective on social tag
suggestion and propose two methods to estimate translation prob-
abilities between words in descriptions and tags. One method is
the word alignment model in statistical machine translation
and the other method is mutual information. Based on the transla-
tion probabilities between words and tags, we propose the word
translation method. The experiments revealed that our method is
effective and efficient for social tag suggestion compared with
other baseline methods.

There are several open issues for further investigation:

1. Our model focuses on suggesting social tags according to the
resource descriptions. We will take advantages of more social
information such as user information to improve the perfor-
mance of social tag suggestion.

2. Other metadata of resources (author, title, images and videos)
can be also taken into account to improve the performance of
social tag suggestion.

3. Our model is a supervised model which requires a large collec-
tion of annotated resources. We will explore to use large-scale
unlabeled text corpora to estimate the translation probabilities
between words, which could be used to enhance the estimation
of translation probabilities between words in descriptions and
tags in annotations.

4. In this paper we suggest each tag in isolation, without consider-
ing the correlations between tags. We will investigate the hier-
atical structure and the semantic relatedness between tags to
regularize the granularity of tags.
Acknowledgments

This work is supported by the National Natural Science Founda-
tion of China (NSFC) under Grant Nos. 61170196 and 61202140.
The authors would like to thank Peng Li for his insightful
suggestions.
References

Blei, D., & Jordan, M. (2003). Modeling annotated data. In Proceedings of SIGIR (pp.
127–134).

Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. JMLR, 3, 993–1022.

http://refhub.elsevier.com/S0957-4174(14)00625-3/h0010


X. Chen et al. / Expert Systems with Applications 42 (2015) 1950–1959 1959
Brown, P., Pietra, V., Pietra, S., & Mercer, R. (1993). The mathematics of statistical
machine translation: Parameter estimation. Computational Linguistics, 19(2),
263–311.

Bundschus, M., Yu, S., Tresp, V., Rettinger, A., Dejori, M., & Kriegel, H. (2009).
Hierarchical bayesian models for collaborative tagging systems. In Proceedings
of ICDM (pp. 728–733).

Fujimura, S., Fujimura, K., & Okuda, H. (2008). Blogosonomy: Autotagging any text
using bloggers’ knowledge. In Proceedings of WI (pp. 205–212).

Herlocker, J., Konstan, J., Borchers, A., & Riedl, J. (1999). An algorithmic framework
for performing collaborative filtering. In Proceedings of SIGIR (pp. 230–237).

Herlocker, J., Konstan, J., Terveen, L., & Riedl, J. (2004). Evaluating collaborative
filtering recommender systems. ACM Transactions on Information Systems, 22(1),
5–53.

Heymann, P., Ramage, D., & Garcia-Molina, H. (2008). Social tag prediction. In
Proceedings of SIGIR (pp. 531–538).

Iwata, T., Yamada, T., & Ueda, N. (2009). Modeling social annotation data with
content relevance using a topic model. In Proceedings of NIPS (pp. 835–843).

Jaschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., & Stumme, G. (2008). Tag
recommendations in social bookmarking systems. AI Communications, 21(4),
231–247.

Karimzadehgan, M., & Zhai, C. (2010). Estimation of statistical translation models
based on mutual information for ad hoc information retrieval. In Proceedings of
SIGIR (pp. 323–330).

Katakis, I., Tsoumakas, G., & Vlahavas, I. (2008). Multilabel text classification for
automated tag suggestion. ECML PKDD discovery challenge 2008, p. 75.

Krestel, R., Fankhauser, P., & Nejdl, W. (2009). Latent dirichlet allocation for tag
recommendation. In Proceedings of ACM RecSys (pp. 61–68).

Lam, X. N., Vu, T., Le, T. D., & Duong, A. D. (2008). Addressing cold-start problem in
recommendation systems. In Proceedings of the 2nd international conference
on ubiquitous information management and communication (pp. 208–211).
ACM.

Lee, S., & Chun, A. (2007). Automatic tag recommendation for the web 2.0
blogosphere using collaborative tagging and hybrid ann semantic structures. In
Proceedings of WSEAS (pages 88–93).

Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of
ICML (Vol. 98, pp. 296–304).
Liu, Z., Chen, X., & Sun, M. (2011). A simple word trigger method for social tag
suggestion. In Proceedings of the conference on empirical methods in natural
language processing (pp. 1577–1588). Association for Computational Linguistics.

Liu, Z., Huang, W., Zheng, Y., & Sun, M. (2010). Automatic keyphrase extraction via
topic decomposition. In Proceedings of the 2010 conference on empirical methods
in natural language processing (pp. 366–376). Association for Computational
Linguistics.

Manning, C., Raghavan, P., & Schtze, H. (2008). Introduction to information retrieval.
New York, NY, USA: Cambridge University Press.

Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into texts. In Proceedings
of EMNLP (pp. 404–411).

Mishne, G. (2006). Autotag: A collaborative approach to automated tag assignment
for weblog posts. In Proceedings of WWW (pp. 953–954).

Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment
models. Computational Linguistics, 29(1), 19–51.

Ohkura, T., Kiyota, Y., & Nakagawa, H. (2006). Browsing system for weblog articles
based on automated folksonomy. In Proceedings of WWW.

Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The pagerank citation ranking:
Bringing order to the web. Technical report, Stanford Digital Library Technologies
Project.

Rendle, S., Balby Marinho, L., Nanopoulos, A., & Schmidt-Thieme, L. (2009). Learning
optimal ranking with tensor factorization for tag recommendation. In
Proceedings of KDD (pp. 727–736).

Resnick, P., & Varian, H. (1997). Recommender systems. Communications of the ACM,
40(3), 56–58.

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text
retrieval. Information processing and management, 24(5), 513–523.

Si, X., Liu, Z., & Sun, M. (2010). Modeling social annotations via latent reason
identification. IEEE Intelligent Systems.

Si, X., & Sun, M. (2009). Tag-LDA for scalable real-time tag recommendation. Journal
of Computational Information Systems, 6(1), 23–31.

Wang, M., Ni, B., Hua, X.-S., & Chua, T.-S. (2012). Assistive tagging: A survey of
multimedia tagging with human–computer joint exploration. ACM Computing
Surveys (CSUR), 44(4), 25.

Xu, Z., Fu, Y., Mao, J., & Su, D., 2006. Towards the semantic web: Collaborative tag
suggestions. In Collaborative web tagging workshop at WWW2006.

http://refhub.elsevier.com/S0957-4174(14)00625-3/h0015
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0015
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0015
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0035
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0035
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0035
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0050
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0050
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0050
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0070
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0070
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0070
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0070
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0085
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0085
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0085
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0090
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0090
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0090
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0090
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0095
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0095
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0110
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0110
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0130
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0130
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0135
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0135
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0140
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0140
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0145
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0145
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0150
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0150
http://refhub.elsevier.com/S0957-4174(14)00625-3/h0150

	Estimating translation probabilities for social tag suggestion
	1 Introduction
	2 Related work
	3 Learning translation probabilities
	3.1 Word alignment model (WAM)-based approach
	3.1.1 Preparing description–annotation pairs for WAM
	3.1.2 Learning translation probabilities for WAM

	3.2 Mutual information (MI)-based approach
	3.3 Emphasize self-translation probability

	4 Suggesting tags with translation probabilities
	5 Experiments
	5.1 Datasets and evaluation metrics
	5.1.1 Evaluation metrics

	5.2 Comparing results
	5.2.1 Baseline methods
	5.2.2 Parameter settings
	5.2.3 Experiment results and analysis
	5.2.4 An example

	5.3 Parameter influences
	5.3.1 Parameter influences for WAM
	5.3.1.1 Harmonic factor
	5.3.1.2 Length ratio
	5.3.1.3 Tag weighting types
	5.3.1.4 Methods for computing word importance scores
	5.3.1.5 Training data size

	5.3.2 Parameter influences for MI
	5.3.2.1 Tag co-occurrence threshold
	5.3.2.2 Methods for computing word importance scores


	5.4 Performance of emphasizing self-translation probability

	6 Conclusions
	Acknowledgments
	References