Extracting Lexically Divergent Paraphrases from Twitter

Wei Xu1, Alan Ritter2, Chris Callison-Burch1, William B. Dolan3 and Yangfeng Ji4
1 University of Pennsylvania, Philadelphia, PA, USA

{xwe, ccb}@cis.upenn.edu
2 The Ohio State University, Columbus, OH, USA

ritter.1492@osu.edu
3 Microsoft Research, Redmond, WA, USA

billdol@microsoft.com
4 Georgia Institute of Technology, Atlanta, GA, USA

jiyfeng@gatech.edu

Abstract

We present MULTIP (Multi-instance Learn-
ing Paraphrase Model), a new model suited
to identify paraphrases within the short mes-
sages on Twitter. We jointly model para-
phrase relations between word and sentence
pairs and assume only sentence-level annota-
tions during learning. Using this principled la-
tent variable model alone, we achieve the per-
formance competitive with a state-of-the-art
method which combines a latent space model
with a feature-based supervised classifier. Our
model also captures lexically divergent para-
phrases that differ from yet complement previ-
ous methods; combining our model with pre-
vious work significantly outperforms the state-
of-the-art. In addition, we present a novel an-
notation methodology that has allowed us to
crowdsource a paraphrase corpus from Twit-
ter. We make this new dataset available to the
research community.

1 Introduction

Paraphrases are alternative linguistic expressions of
the same or similar meaning (Bhagat and Hovy,
2013). Twitter engages millions of users, who nat-
urally talk about the same topics simultaneously
and frequently convey similar meaning using diverse
linguistic expressions. The unique characteristics
of this user-generated text presents new challenges
and opportunities for paraphrase research (Xu et al.,
2013b; Wang et al., 2013). For many applications,
like automatic summarization, first story detection
(Petrović et al., 2012) and search (Zanzotto et al.,
2011), it is crucial to resolve redundancy in tweets

(e.g. oscar nom’d doc ↔ Oscar-nominated docu-
mentary).

In this paper, we investigate the task of determin-
ing whether two tweets are paraphrases. Previous
work has exploited a pair of shared named entities
to locate semantically equivalent patterns from re-
lated news articles (Shinyama et al., 2002; Sekine,
2005; Zhang and Weld, 2013). But short sentences
in Twitter do not often mention two named entities
(Ritter et al., 2012) and require nontrivial general-
ization from named entities to other words. For ex-
ample, consider the following two sentences about
basketball player Brook Lopez from Twitter:

◦ That boy Brook Lopez with a deep 3
◦ brook lopez hit a 3 and i missed it

Although these sentences do not have many words in
common, the identical word “3” is a strong indicator
that the two sentences are paraphrases.

We therefore propose a novel joint word-sentence
approach, incorporating a multi-instance learning
assumption (Dietterich et al., 1997) that two sen-
tences under the same topic (we highlight topics in
bold) are paraphrases if they contain at least one
word pair (we call it an anchor and highlight with
underscores; the words in the anchor pair need not
be identical) that is indicative of sentential para-
phrase. This at-least-one-anchor assumption might
be ineffective for long or randomly paired sentences,
but holds up better for short sentences that are tem-
porally and topically related on Twitter. Moreover,
our model design (see Figure 1) allows exploitation
of arbitrary features and linguistic resources, such
as part-of-speech features and a normalization lex-


(a)

(b)

Figure 1: (a) a plate representation of the MULTIP model (b) an example instantiation of MULTIP for the
pair of sentences “Manti bout to be the next Junior Seau” and “Teo is the little new Junior Seau”, in which
a new American football player Manti Te’o was being compared to a famous former player Junior Seau.
Only 4 out of the total 6×5 word pairs, z1 - z30, are shown here.

icon, to discriminatively determine word pairs as
paraphrastic anchors or not.

Our graphical model is a major departure from
popular surface- or latent- similarity methods (Wan
et al., 2006; Guo and Diab, 2012; Ji and Eisenstein,
2013, and others). Our approach to extract para-
phrases from Twitter is general and can be com-
bined with various topic detecting solutions. As a
demonstration, we use Twitter’s own trending topic
service1 to collect data and conduct experiments.
While having a principled and extensible design,
our model alone achieves performance on par with
a state-of-the-art ensemble approach that involves
both latent semantic modeling and supervised classi-
fication. The proposed model also captures radically
different paraphrases from previous approaches; a
combined system shows significant improvement
over the state-of-the-art.

This paper makes the following contributions:

1) We present a novel latent variable model for
paraphrase identification, that specifically ac-
commodates the very short context and di-
vergent wording in Twitter data. We exper-
imentally compare several representative ap-
proaches and show that our proposed method

1More information about Twitter’s trends:
https://support.twitter.com/articles/
101125-faqs-about-twitter-s-trends

yields state-of-the-art results and identifies
paraphrases that are complementary to previ-
ous methods.

2) We develop an efficient crowdsourcing method
and construct a Twitter Paraphrase Corpus of
about 18,000 sentence pairs, as a first common
testbed for the development and comparison of
paraphrase identification and semantic similar-
ity systems. We make this dataset available to
the research community.2

2 Joint Word-Sentence Paraphrase Model

We present a new latent variable model that jointly
captures paraphrase relations between sentence pairs
and word pairs. It is very different from previous ap-
proaches in that its primary design goal and motiva-
tion is targeted towards short, lexically diverse text
on the social web.

2.1 At-least-one-anchor Assumption
Much previous work on paraphrase identification
has been developed and evaluated on a specific
benchmark dataset, the Microsoft Research Para-
phrase Corpus (Dolan et al., 2004), which is de-

2The dataset and code are made available at: SemEval-2015
shared task http://alt.qcri.org/semeval2015/
task1/ and https://github.com/cocoxu/
twitterparaphrase/


Corpus Examples

News

◦ Revenue in the first quarter of the year dropped 15 percent from the same period
a year earlier.
◦ With the scandal hanging over Stewart’s company, revenue in the first quarter of
the year dropped 15 percent from the same period a year earlier.

(Dolan and
Brockett,

2005)

◦ The Senate Select Committee on Intelligence is preparing a blistering report on
prewar intelligence on Iraq.
◦ American intelligence leading up to the war on Iraq will be criticized by a pow-
erful US Congressional committee due to report soon, officials said today.
◦ Can Klay Thompson wake up
◦ Cmon Klay need u to get it going

Twitter
(This Work)

◦ Ezekiel Ansah wearing 3D glasses wout the lens
◦ Wait Ezekiel ansah is wearing 3d movie glasses with the lenses knocked out
◦ Marriage equality law passed in Rhode Island
◦ Congrats to Rhode Island becoming the 10th state to enact marriage equality

Table 1: Representative examples from paraphrase corpora. The average sentence length is 11.9 words in
Twitter vs. 18.6 in the news corpus.

rived from news articles. Twitter data is very dif-
ferent, as shown in Table 1. We observe that among
tweets posted around the same time about the same
topic (e.g. a named entity), sentential paraphrases
are short and can often be “anchored” by lexical
paraphrases. This intuition leads to the at-least-one-
anchor assumption we stated in the introduction.

The anchor could be a word the two sentences
share in common. It also could be a pair of different
words. For example, the word pair “next ‖ new” in
two tweets about a new player Manti Te’o to a fa-
mous former American football player Junior Seau:

◦ Manti bout to be the next Junior Seau
◦ Teo is the little new Junior Seau

Further note that not every word pair of similar
meaning indicates sentence-level paraphrase. For
example, the word “3”, shared by two sentences
about movie “Iron Man” that refers to the 3rd se-
quel of the movie, is not a paraphrastic anchor:

◦ Iron Man 3 was brilliant fun
◦ Iron Man 3 tonight see what this is like

Therefore, we use a discriminative model at the
word-level to incorporate various features, such as
part-of-speech features, to determine how probable
a word pair is a paraphrase anchor.

2.2 Multi-instance Learning Paraphrase Model
(MULTIP)

The at-least-one-anchor assumption naturally leads
to a multi-instance learning problem (Dietterich
et al., 1997), where the learner only observes labels
on bags of instances (i.e. sentence-level paraphrases
in this case) instead of labels on each individual in-
stance (i.e. word pair).

We formally define an undirected graphical model
of multi-instance learning for paraphrase identifica-
tion – MULTIP. Figure 1 shows the proposed model
in plate form and gives an example instantiation.
The model has two layers, which allows joint rea-
soning between sentence-level and word-level com-
ponents.

For each pair of sentences si = (si1,si2), there
is an aggregate binary variable yi that represents
whether they are paraphrases, and which is observed
in the labeled training data. Let W(sik) be the set
of words in the sentence sik , excluding the topic
names. For each word pair wj = (wj1,wj2) ∈
W(si1) × W(si2), there exists a latent variable zj
which denotes whether the word pair is a paraphrase
anchor. In total there are m = |W(si1)|× |W(si2)|
word pairs, and thus zi = z1,z2, ...,zj, ...,zm.
Our at-least-one-anchor assumption is realized by
a deterministic-or function; that is, if there exists at
least one j such that zj = 1, then the sentence pair


is a paraphrase.
Our conditional paraphrase identification model is

defined as follows:

P(zi,yi|wi;θ) =
m∏
j=1

φ(zj,wj;θ)×σ(zi,yi)

=

m∏
j=1

exp(θ ·f(zj,wj))×σ(zi,yi)
(1)

where f(zj,wj) is a vector of features extracted for
the word pair wj, θ is the parameter vector, and σ
is the factor that corresponds to the deterministic-or
constraint:

σ(zi,yi) =



1 if yi = true∧∃j : zj = 1
1 if yi = false∧∀j : zj = 0
0 otherwise

(2)

2.3 Learning
To learn the parameters of the word-level paraphrase
anchor classifier, θ, we maximize likelihood over the
sentence-level annotations in our paraphrase corpus:

θ∗ = arg max
θ

P(y|w;θ)

= arg max
θ

∏
i

∑
zi

P(zi,yi|wi;θ)
(3)

An iterative gradient-ascent approach is used to
estimate θ using perceptron-style additive updates
(Collins, 2002; Liang et al., 2006; Zettlemoyer and
Collins, 2007; Hoffmann et al., 2011). We define an
update based on the gradient of the conditional log
likelihood using Viterbi approximation, as follows:

∂ log P(y|w;θ)
∂θ

= EP(z|w,y;θ)(
∑
i

f(zi, wi))

− EP(z,y|w;θ)(
∑
i

f(zi, wi))

≈
∑
i

f(z∗i , wi)−
∑
i

f(z′i, wi)

(4)

where we define the feature sum for each sentence
f(zi, wi) =

∑
j f(zj,wj) over all word pairs.

These two above expectations are approximated
by solving two simple inference problems as maxi-
mizations:

z∗ = arg max
z

P(z|w, y;θ)

y′, z′ = arg max
y,z

P(z, y|w;θ)
(5)

Input: a training set {(si,yi)|i = 1...n}, where i is
an index corresponding to a particular sentence
pair si, and yi is the training label.

1: initialize parameter vector θ ← 0
2: for i ← 1 to n do
3: extract all possible word pairs wi =

w1,w2, ...,wm and their features from the
sentence pair si

4: end for
5: for l ← 1 to maximum iterations do
6: for i ← 1 to n do
7: (y′i, z

′
i) ← arg maxyi,zi P(zi,yi|wi;θ)

8: if y′i 6= yi then
9: z∗i ← arg maxzi P(zi|wi,yi;θ)

10: θ ← θ + f(z∗i , wi)−f(z
′
i, wi)

11: end if
12: end for
13: end for
14: return model parameters θ

Figure 2: MULTIP Learning Algorithm

Computing both z′ and z∗ are rather straightfor-
ward under the structure of our model and can be
solved in time linear in the number of word pairs.
The dependencies between z and y are defined as
deterministic-or factors σ(zi,yi), which when sat-
isfied do not affect the overall probability of the
solution. Each sentence pair is independent con-
ditioned on the parameters. For z′, it is sufficient
to independently compute the most likely assign-
ment z′i for each word pair, ignoring the determin-
istic dependencies. y′i is then set by aggregating
all z′i through the deterministic-or operation. Sim-
ilarly, we can find the exact solution for z∗, the
most likely assignment that respects the sentence-
level training label y. For a positive training in-
stance, we simply find its highest scored word pair
wτ by the word-level classifier, then set z∗τ = 1 and
z∗j = arg maxx∈0,1 φ(x,wj;θ) for all j 6= τ; for
a negative example, we set z∗i = 0. The time com-
plexity of both inferences for one sentence pair is
O(|W(s)|2), where |W(s)|2 is the number of word
pairs.

In practice, we use online learning instead of opti-
mizing the full objective. The detailed learning algo-
rithm is presented in Figure 2. Following Hoffmann
et al. (2011), we use 50 iterations in the experiments.


2.4 Feature Design

At the word-level, our discriminative model allows
use of arbitrary features that are similar to those in
monolingual word alignment models (MacCartney
et al., 2008; Thadani and McKeown, 2011; Yao
et al., 2013a,b). But unlike discriminative mono-
lingual word alignment, we only use sentence-level
training labels instead of word-level alignment
annotation. For every word pair, we extract the
following features:

String Features that indicate whether the two
words, their stemmed forms and their normalized
forms are the same, similar or dissimilar. We
used the Morpha stemmer (Minnen et al., 2001),3

Jaro-Winkler string similarity (Winkler, 1999) and
the Twitter normalization lexicon by Han et al.
(2012).

POS Features that are based on the part-of-speech
tags of the two words in the pair, specifying whether
the two words have same or different POS tags
and what the specific tags are. We use the Twitter
Part-Of-Speech tagger developed by Derczynski
et al. (2013). We add new fine-grained tags for
variations of the eight words: “a”, “be”, “do”,
“have”, “get”, “go”, “follow” and “please”. For
example, we use a tag HA for words “have”, “has”
and “had”.

Topical Features that relate to the strength of
a word’s association to the topic. This feature
identifies the popular words in each topic, e.g. “3”
in tweets about basketball game, “RIP” in tweets
about a celebrity’s death. We use G2 log-likelihood-
ratio statistic, which has been frequently used in
NLP, as a measure of word associations (Dunning,
1993; Moore, 2004). The significant scores are
computed for each trend on an average of about
1500 sentences and converted to binary features for
every word pair, indicating whether the two words
are both significant or not.

Our topical features are novel and were not
used in previous work. Following Riedel et al.
(2010) and Hoffmann et al. (2011), we also incor-
porate conjunction features into our system for bet-
ter accuracy, namely Word+POS, Word+Topical and
Word+POS+Topical features.

3https://github.com/knowitall/morpha

3 Experiments

3.1 Data

It is nontrivial to gather a gold-standard dataset
of naturally occurring paraphrases and non-
paraphrases efficiently from Twitter, since this
requires pairwise comparison of tweets and faces
a very large search space. To make this annotation
task tractable, we design a novel and efficient
crowdsourcing method using Amazon Mechanical
Turk. Our entire data collection process is de-
tailed in Section §4, with several experiments that
demonstrate annotation quality and efficiency.

In total, we constructed a Twitter Paraphrase Cor-
pus of 18,762 sentence pairs and 19,946 unique sen-
tences. The training and development set consists
of 17,790 sentence pairs posted between April 24th
and May 3rd, 2014 from 500+ trending topics (ex-
cluding hashtags). Our paraphrase model and data
collection approach is general and can be combined
with various Twitter topic detecting solutions (Diao
et al., 2012; Ritter et al., 2012). As a demonstra-
tion, we use Twitter’s own trends service since it
is easily available. Twitter trending topics are de-
termined by an unpublished algorithm, which finds
words, phrases and hashtags that have had a sharp
increase in popularity, as opposed to overall vol-
ume. We use case-insensitive exact matching to lo-
cate topic names in the sentences.

Each sentence pair was annotated by 5 different
crowdsourcing workers. For the test set, we ob-
tained both crowdsourced and expert labels on 972
sentence pairs from 20 randomly sampled Twitter
trending topics between May 13th and June 10th.
Our dataset is more realistic and balanced, contain-
ing 79% non-paraphrases vs. 34% in the benchmark
Microsoft Paraphrase Corpus of news data. As noted
in (Das and Smith, 2009), the lack of natural non-
paraphrases in the MSR corpus creates bias towards
certain models.

3.2 Baselines

We use four baselines to compare with our proposed
approach for the sentential paraphrase identification
task. For the first baseline, we choose a super-
vised logistic regression (LR) baseline used by Das
and Smith (2009). It uses simple n-gram (also in
stemmed form) overlapping features but shows very


Method F1 Precision Recall
Random 0.294 0.208 0.500
WTMF (Guo and Diab, 2012)* 0.583 0.525 0.655
LR (Das and Smith, 2009)** 0.630 0.629 0.632
LEXLATENT 0.641 0.663 0.621
LEXDISCRIM (Ji and Eisenstein, 2013) 0.645 0.664 0.628
MULTIP 0.724 0.722 0.726
Human Upperbound 0.823 0.752 0.908

Table 2: Performance of different paraphrase identification approaches on Twitter data. *An enhanced
version that uses additional 1.6 million sentences from Twitter. ** Reimplementation of a strong baseline
used by Das and Smith (2009).

competitive performance on the MSR corpus.
The second baseline is a state-of-the-art unsu-

pervised method, Weighted Textual Matrix Factor-
ization (WTMF),4 which is specially developed for
short sentences by modeling the semantic space of
both words that are present in and absent from
the sentences (Guo and Diab, 2012). The origi-
nal model was learned from WordNet (Fellbaum,
2010), OntoNotes (Hovy et al., 2006), Wiktionary,
the Brown corpus (Francis and Kucera, 1979). We
enhance the model with 1.6 million sentences from
Twitter as suggested by Guo et al. (2013).

Ji and Eisenstein (2013) presented a state-of-
the-art ensemble system, which we call LEXDIS-
CRIM.5 It directly combines both discriminatively-
tuned latent features and surface lexical features into
a SVM classifier. Specifically, the latent representa-
tion of a pair of sentences ~v1 and ~v2 is converted into
a feature vector, [~v1+ ~v2, |~v1−~v2|], by concatenating
the element-wise sum ~v1 + ~v2 and absolute different
|~v1 − ~v2|.

We also introduce a new baseline, LEXLATENT,
which is a simplified version of LEXDISCRIM and
easy to reproduce. It uses the same method to com-
bine latent features and surface features, but com-
bines the open-sourced WTMF latent space model
and the logistic regression model from above in-
stead. It achieves similar performance as LEXDIS-
CRIM on our dataset (Table 2).

4The source code and data for WTMF is available at:
http://www.cs.columbia.edu/˜weiwei/code.
html

5The parsing feature was removed because it was not helpful
on our Twitter dataset.

3.3 System Performance

For evaluation of different systems, we compute
precision-recall curves and report the highest F1
measure of any point on the curve, on the test dataset
of 972 sentence pairs against the expert labels. Ta-
ble 2 shows the performance of different systems.
Our proposed MULTIP, a principled latent variable
model alone, achieves competitive results with the
state-of-the-art system that combines discriminative
training and latent semantics.

In Table 2, we also show the agreement levels of
labels derived from 5 non-expert annotations on Me-
chanical Turk, which can be considered as an up-
perbound for automatic paraphrase recognition task
performed on this data set. The annotation quality
of our corpus is surprisingly good given the fact that
the definition of paraphrase is rather inexact (Bhagat
and Hovy, 2013); the inter-rater agreement between
expert annotators on news data is only 0.83 as re-
ported by Dolan et al. (2004).

F1 Prec Recall
MULTIP 0.724 0.722 0.726
- String features 0.509 0.448 0.589
- POS features 0.496 0.350 0.851
- Topical features 0.715 0.694 0.737

Table 3: Feature ablation by removing each individ-
ual feature group from the full set.

To assess the impact of different features on the
model’s performance, we conduct feature ablation
experiments, removing one group of features at a
time. The results are shown in Table 3. Both string


Para? Sentence Pair from Twitter MULTIP LEXLATENT
YES ◦ The new Ciroc flavor has arrived rank=12 rank=266

◦ Ciroc got a new flavor comin out
YES ◦ Roberto Mancini gets the boot from Man City rank=64 rank=452

◦ Roberto Mancini has been sacked by Manchester City
with the Blues saying

YES ◦ I want to watch the purge tonight rank=136 rank=11
◦ I want to go see The Purge who wants to come with

NO ◦ Somebody took the Marlins to 20 innings rank= 8 rank=54
◦ Anyone who stayed 20 innings for the marlins

NO ◦ WORLD OF JENKS IS ON AT 11 rank=167 rank=9
◦ World of Jenks is my favorite show on tv

Table 4: Example system outputs; rank is the position in the list of all candidate paraphrase pairs in the test
set ordered by model score. MULTIP discovers lexically divergent paraphrases while LEXLATENT prefers
more overall sentence similarity. Underline marks the word pair(s) with highest estimated probability as
paraphrastic anchor(s) for each sentence pair.

0.0 0.2 0.4 0.6 0.8 1.0

0.
0

0.
2

0.
4

0.
6

0.
8

1.
0

Recall

P
re
ci
si
on

MULTIP
LEXLATENT

0.0 0.2 0.4 0.6 0.8 1.0

0.
0

0.
2

0.
4

0.
6

0.
8

1.
0

Recall

P
re
ci
si
on

MULTIP−PE
LEXLATENT

Figure 3: Precision and recall curves. Our MULTIP model alone achieves competitive performance with the
LEXLATENT system that combines latent space model and feature-based supervised classifier. The two
approaches have complementary strengths, and achieves significant improvement when combined together
(MULTIP-PE).

and POS features are essential for system perfor-
mance, while topical features are helpful but not as
crucial.

Figure 3 presents precision-recall curves and
shows the sensitivity and specificity of each model
in comparison. In the first half of the curve (recall
< 0.5), MULTIP model makes bolder and less ac-
curate decisions than LEXLATENT. However, the
curve for MULTIP model is more flat and shows con-

sistently better precision at the second half (recall >
0.5) as well as a higher maximum F1 score. This re-
sult reflects our design concept of MULTIP, which is
intended to pick up sentential paraphrases with more
divergent wordings aggressively. LEXLATENT, as
a combined system, considers sentence features in
both surface and latent space and is more conserva-
tive. Table 4 further illustrates this difference with
some example system outputs.


3.4 Product of Experts (MULTIP-PE)
Our MULTIP model and previous similarity-based
approaches have complementary strengths, so we
experiment with combining MULTIP (Pm) and
LEXLATENT (Pl) through a product of experts
(Hinton, 2002):

P(y|s1,s2) =
Pm(y|s1,s2)×Pl(y|s1,s2)∑
y Pm(y|s1,s2)×Pl(y|s1,s2)

(6)

The resulting system MULTIP-PE provides con-
sistently better precision and recall over the
LEXLATENT model, as shown on the right in
Figure 3. The MULTIP-PE system outperforms
LEXLATENT significantly according to a paired t-
test with ρ less than 0.05. Our proposed MUL-
TIP takes advantage of Twitter’s specific properties
and provides complementary information to previ-
ous approaches. Previously, Das and Smith (2009)
has also used a product of experts to combine a lex-
ical and a syntax-based model together.

4 Constructing Twitter Paraphrase
Corpus

We now turn to describing our data collection and
annotation methodology. Our goal is to construct a
high quality dataset that contains representative ex-
amples of paraphrases and non-paraphrases in Twit-
ter. Since Twitter users are free to talk about any-
thing regarding any topic, a random pair of sen-
tences about the same topic has a low chance (less
than 8%) of expressing the same meaning. This
causes two problems: a) it is expensive to obtain
paraphrases via manual annotation; b) non-expert
annotators tend to loosen the criteria and are more
likely to make false positive errors. To address
these challenges, we design a simple annotation task
and introduce two selection mechanisms to select
sentences which are more likely to be paraphrases,
while preserving diversity and representativeness.

4.1 Raw Data from Twitter
We crawl Twitter’s trending topics and their associ-
ated tweets using public APIs.6 According to Twit-
ter, trends are determined by an algorithm which

6More information about Twitter’s APIs: https://dev.
twitter.com/docs/api/1.1/overview

expert=0

  expert=1

expert=2

expert=3

expert=4

expert=5

tu
rk

=0
tu

rk
=1

tu
rk

=2
tu

rk
=3

tu
rk

=4
tu

rk
=5

Figure 4: A heat-map showing overlap between
expert and crowdsourcing annotation. The inten-
sity along the diagonal indicates good reliability of
crowdsourcing workers for this particular task; and
the shift above the diagonal reflects the difference
between the two annotation schemas. For crowd-
sourcing (turk), the numbers indicate how many an-
notators out of 5 picked the sentence pair as para-
phrases; 0,1 are considered non-paraphrases; 3,4,5
are paraphrases. For expert annotation, all 0,1,2
are non-paraphrases; 4,5 are paraphrases. Medium-
scored cases are discarded in training and testing in
our experiments.

identifies topics that are immediately popular, rather
than those that have been popular for longer periods
of time or which trend on a daily basis. We tokenize
and split each tweet into sentences.7

4.2 Task Design on Mechanical Turk
We show the annotator an original sentence, then
ask them to pick sentences with the same mean-
ing from 10 candidate sentences. The original and
candidate sentences are randomly sampled from the
same topic. For each such 1 vs. 10 question, we ob-
tain binary judgements from 5 different annotators,
paying each annotator $0.02 per question. On aver-
age, each question takes one annotator about 30 ∼
45 seconds to answer.

4.3 Annotation Quality
We remove problematic annotators by checking
their Cohen’s Kappa agreement (Artstein and Poe-

7We use the toolkit developed by O’Connor et al. (2010):
https://github.com/brendano/tweetmotif


Reggie Miller
Amber

Robert Woods
Candice

The Clippers
Ryu

Jeff Green
Harvick

Milwaukee
Klay

Huck
Morning

Momma Dee
Dortmund

Ronaldo
Netflix
GWB

Dwight Howard
Facebook

U.S.

0.0 0.2 0.4 0.6 0.8

filtered
random

Percentage of Positive Judgements

Tr
e

n
d

in
g

 T
o

p
ic

s

Figure 5: The proportion of paraphrases (percentage of positive votes from annotators) vary greatly across
different topics. Automatic filtering in Section 4.4 roughly doubles the paraphrase yield.

sio, 2008) with other annotators. We also compute
inter-annotator agreement with an expert annotator
on 971 sentence pairs. In the expert annotation, we
adopt a 5-point Likert scale to measure the degree
of semantic similarity between sentences, which is
defined by Agirre et al. (2012) as follows:

5: Completely equivalent, as they mean the same
thing;

4: Mostly equivalent, but some unimportant details
differ;

3: Roughly equivalent, but some important informa-
tion differs/missing.

2: Not equivalent, but share some details;
1: Not equivalent, but are on the same topic;
0: On different topics.

Although the two scales of expert and crowd-
sourcing annotation are defined differently, their
Pearson correlation coefficient reaches 0.735 (two-
tailed significance 0.001). Figure 4 shows a heat-
map representing the detailed overlap between the
two annotations. It suggests that the graded simi-
larity annotation task could be reduced to a binary
choice in a crowdsourcing setup.

4.4 Automatic Summarization Inspired
Sentence Filtering

We filter the sentences within each topic to se-
lect more probable paraphrases for annotation. Our

method is inspired by a typical problem in extractive
summarization, that the salient sentences are likely
redundant (paraphrases) and need to be removed
in the output summaries. We employ the scoring
method used in SumBasic (Nenkova and Vander-
wende, 2005; Vanderwende et al., 2007), a simple
but powerful summarization system, to find salient
sentences. For each topic, we compute the probabil-
ity of each word P(wi) by simply dividing its fre-
quency by the total number of all words in all sen-
tences. Each sentence s is scored as the average of
the probabilities of the words in it, i.e.

Salience(s) =
∑
wi∈s

P(wi)

|{wi|wi ∈ s}|
(7)

We then rank the sentences and pick the original
sentence randomly from top 10% salient sentences
and candidate sentences from top 50% to present to
the annotators.

In a trial experiment of 20 topics, the filtering
technique double the yield of paraphrases from 152
to 329 out of 2000 sentence pairs over naı̈ve ran-
dom sampling (Figure 5 and Figure 6). We also use
PINC (Chen and Dolan, 2011) to measure the qual-
ity of paraphrases collected (Figure 7). PINC was
designed to measure n-gram dissimilarity between
two sentences, and in essence it is the inverse of
BLEU. In general, the cases with high PINC scores
include more complex and interesting rephrasings.


5 4 3 2 1

0

100

200

300

400
random
filtered
MAB

Number of Annotators Judging as Paraphrases (out of 5)

N
um

be
r o

f S
en

te
nc

e 
Pa

irs
 (o

ut
 o

f 2
00

0)

Figure 6: Numbers of paraphrases collected by dif-
ferent methods. The annotation efficiency (3,4,5
are regarded as paraphrases) is significantly im-
proved by the sentence filtering and Multi-Armed
Bandits (MAB) based topic selection.

5 4 3 2 1 0

0.75

0.80

0.85

0.90

0.95

1.00
random
filtered
MAB

Number of Annotators Judging as Paraphrases (out of 5)

P
IN

C
 (l

ex
ic

al
 d

is
si

m
ila

rit
y)

Figure 7: PINC scores of paraphrases collected.
The higher the PINC, the more significant the re-
wording. Our proposed annotation strategy quadru-
ples paraphrase yield, while not greatly reducing
diversity as measured by PINC.

4.5 Topic Selection using Multi-Armed Bandits
(MAB) Algorithm

Another approach to increasing paraphrase yield is
to choose more appropriate topics. This is partic-
ularly important because the number of paraphrases
varies greatly from topic to topic and thus the chance
to encounter paraphrases during annotation (Fig-
ure 5). We treat this topic selection problem as a
variation of the Multi-Armed Bandit (MAB) prob-
lem (Robbins, 1985) and adapt a greedy algorithm,
the bounded �-first algorithm, of Tran-Thanh et al.
(2012) to accelerate our corpus construction.

Our strategy consists of two phases. In the first
exploration phase, we dedicate a fraction of the to-
tal budget, �, to explore randomly chosen arms of
each slot machine (trending topic on Twitter), each
m times. In the second exploitation phase, we sort
all topics according to their estimated proportion
of paraphrases, and sequentially annotate d(1−�)B

l−m e
arms that have the highest estimated reward until
reaching the maximum l = 10 annotations for any
topic to insure data diversity.

We tune the parameters m to be 1 and � to be be-
tween 0.35 ∼ 0.55 through simulation experiments,
by artificially duplicating a small amount of real an-
notation data. We then apply this MAB algorithm
in the real-world. We explore 500 random topics
and then exploited 100 of them. The yield of para-
phrases rises to 688 out of 2000 sentence pairs by

using MAB and sentence filtering, a 4-fold increase
compared to only using random selection (Figure 6).

5 Related Work

Automatic Paraphrase Identification has been
widely studied (Androutsopoulos and Malakasiotis,
2010; Madnani and Dorr, 2010). The ACL Wiki
gives an excellent summary of various techniques.8

Many recent high-performance approaches use sys-
tem combination (Das and Smith, 2009; Madnani
et al., 2012; Ji and Eisenstein, 2013). For exam-
ple, Madnani et al. (2012) combines multiple sophis-
ticated machine translation metrics using a meta-
classifier. An earlier attempt on Twitter data is that
of Xu et al. (2013b). They limited the search space
to only the tweets that explicitly mention a same
date and a same named entity, however there remain
a considerable amount of mislabels in their data.9

Zanzotto et al. (2011) also experimented with SVM
tree kernel methods on Twitter data.

Departing from the previous work, we propose
a latent variable model to jointly infer the corre-
spondence between words and sentences. It is re-
lated to discriminative monolingual word alignment
(MacCartney et al., 2008; Thadani and McKeown,

8http://aclweb.org/aclwiki/index.php?
title=Paraphrase_Identification_(State_of_
the_art)

9The data is released by Xu et al. (2013b) at: https://
github.com/cocoxu/twitterparaphrase/


2011; Yao et al., 2013a,b), but different in that the
paraphrase task requires additional sentence align-
ment modeling with no word alignment data. Our
approach is also inspired by Fung and Cheung’s
(2004a; 2004b) work on bootstrapping bilingual par-
allel sentence and word translations from compara-
ble corpora.

Multiple Instance Learning (Dietterich et al.,
1997) has been used by different research groups
in the field of information extraction (Riedel et al.,
2010; Hoffmann et al., 2011; Surdeanu et al., 2012;
Ritter et al., 2013; Xu et al., 2013a). The idea is
to leverage structured data as weak supervision for
tasks such as relation extraction. This is done, for
example, by making the assumption that at least one
sentence in the corpus which mentions a pair of en-
tities (e1,e2) participating in a relation (r) expresses
the proposition: r(e1,e2).

Crowdsourcing Paraphrase Acquisition: Buzek
et al. (2010) and Denkowski et al. (2010) focused
specifically on collecting paraphrases of text to be
translated to improve machine translation quality.
Chen and Dolan (2011) gathered a large-scale para-
phrase corpus by asking Mechanical Turk workers
to caption the action in short video segments. Sim-
ilarly, Burrows et al. (2012) asked crowdsourcing
workers to rewrite selected excerpts from books.
Ling et al. (2014) crowdsourced bilingual parallel
text using Twitter as the source of data.

In contrast, we design a simple crowdsourcing
task requiring only binary judgements on sentences
collected from Twitter. There are several advantages
as compared to existing work: a) the corpus also
covers a very diverse range of topics and linguistic
expressions, especially colloquial language, which
is different from and thus complements previous
paraphrase corpora; b) the paraphrase corpus col-
lected contains a representative proportion of both
negative and positive instances, while lack of good
negative examples was an issue in the previous re-
search (Das and Smith, 2009); c) this method is scal-
able and sustainable due to the simplicity of the task
and real-time, virtually unlimited text supply from
Twitter.

6 Conclusions

This paper introduced MULTIP, a joint word-
sentence model to learn paraphrases from tempo-
rally and topically grouped messages in Twitter.
While simple and principled, our model achieves
performance competitive with a state-of-the-art en-
semble system combining latent semantic represen-
tations and surface similarity. By combining our
method with previous work as a product-of-experts
we outperform the state-of-the-art. Our latent-
variable approach is capable of learning word-level
paraphrase anchors given only sentence annotations.
Because our graphical model is modular and ex-
tensible (for example it should be possible to re-
place the deterministic-or with other aggregators),
we are optimistic this work might provide a path
towards weakly supervised word alignment models
using only sentence-level annotations.

In addition, we presented a novel and efficient
annotation methodology which was used to crowd-
source a unique corpus of paraphrases harvested
from Twitter. We make this resource available to the
research community.

Acknowledgments

The author would like to thank editor Sharon Gold-
water and three anonymous reviewers for their
thoughtful comments, which substantially improved
this paper. We also thank Ralph Grishman, Sameer
Singh, Yoav Artzi, Mark Yatskar, Chris Quirk, Ani
Nenkova and Mitch Marcus for their feedback.

This material is based in part on research spon-
sored by the NSF under grant IIS-1430651, DARPA
under agreement number FA8750-13-2-0017 (the
DEFT program) and through a Google Faculty Re-
search Award to Chris Callison-Burch. The U.S.
Government is authorized to reproduce and dis-
tribute reprints for governmental purposes. The
views and conclusions contained in this publication
are those of the authors and should not be interpreted
as representing official policies or endorsements of
DARPA or the U.S. Government. Yangfeng Ji is
supported by a Google Faculty Research Award
awarded to Jacob Eisenstein.


References

Agirre, E., Diab, M., Cer, D., and Gonzalez-Agirre,
A. (2012). Semeval-2012 task 6: A pilot on se-
mantic textual similarity. In Proceedings of the
First Joint Conference on Lexical and Computa-
tional Semantics (*SEM).

Androutsopoulos, I. and Malakasiotis, P. (2010).
A survey of paraphrasing and textual entailment
methods. Journal of Artificial Intelligence Re-
search, 38.

Artstein, R. and Poesio, M. (2008). Inter-coder
agreement for computational linguistics. Compu-
tational Linguistics, 34(4).

Bhagat, R. and Hovy, E. (2013). What is a para-
phrase? Computational Linguistics, 39(3).

Burrows, S., Potthast, M., and Stein, B. (2012).
Paraphrase acquisition via crowdsourcing and
machine learning. Transactions on Intelligent
Systems and Technology (ACM TIST).

Buzek, O., Resnik, P., and Bederson, B. B. (2010).
Error driven paraphrase annotation using Me-
chanical Turk. In Proceedings of the Workshop on
Creating Speech and Language Data with Ama-
zon’s Mechanical Turk.

Chen, D. L. and Dolan, W. B. (2011). Collecting
highly parallel data for paraphrase evaluation. In
Proceedings of the 49th Annual Meeting of the As-
sociation for Computational Linguistics (ACL).

Collins, M. (2002). Discriminative training methods
for hidden Markov models: Theory and experi-
ments with perceptron algorithms. In Proceed-
ings of the Conference on Empirical Methods on
Natural Language Processing (EMNLP).

Das, D. and Smith, N. A. (2009). Paraphrase identi-
fication as probabilistic quasi-synchronous recog-
nition. In Proceedings of the Joint Conference
of the 47th Annual Meeting of the Association
for Computational Linguistics and the 4th Inter-
national Joint Conference on Natural Language
Processing of the Asian Federation of Natural
Language Processing (ACL-IJCNLP).

Denkowski, M., Al-Haj, H., and Lavie, A. (2010).
Turker-assisted paraphrasing for English-Arabic
machine translation. In Proceedings of the Work-

shop on Creating Speech and Language Data with
Amazon’s Mechanical Turk.

Derczynski, L., Ritter, A., Clark, S., and Bontcheva,
K. (2013). Twitter part-of-speech tagging for all:
Overcoming sparse and noisy data. In Proceed-
ings of the Recent Advances in Natural Language
Processing (RANLP).

Diao, Q., Jiang, J., Zhu, F., and Lim, E.-P. (2012).
Finding bursty topics from microblogs. In Pro-
ceedings of the 50th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL).

Dietterich, T. G., Lathrop, R. H., and Lozano-Pérez,
T. (1997). Solving the multiple instance prob-
lem with axis-parallel rectangles. Artificial Intel-
ligence, 89(1).

Dolan, B., Quirk, C., and Brockett, C. (2004). Un-
supervised construction of large paraphrase cor-
pora: Exploiting massively parallel news sources.
In Proceedings of the 20th International Confer-
ence on Computational Linguistics (COLING).

Dolan, W. and Brockett, C. (2005). Automatically
constructing a corpus of sentential paraphrases. In
Proceedings of the 3rd International Workshop on
Paraphrasing.

Dunning, T. (1993). Accurate methods for the statis-
tics of surprise and coincidence. Computational
Linguistics, 19(1).

Fellbaum, C. (2010). WordNet. In Theory and Ap-
plications of Ontology: Computer Applications.
Springer.

Francis, W. N. and Kucera, H. (1979). Brown corpus
manual. Brown University.

Fung, P. and Cheung, P. (2004a). Mining very-non-
parallel corpora: Parallel sentence and lexicon ex-
traction via bootstrapping and EM. In Proceed-
ings of the Conference on Empirical Methods in
Natural Language Processing (EMNLP).

Fung, P. and Cheung, P. (2004b). Multi-level boot-
strapping for extracting parallel sentences from a
quasi-comparable corpus. In Proceedings of the
International Conference on Computational Lin-
guistics (COLING).

Guo, W. and Diab, M. (2012). Modeling sentences
in the latent space. In Proceedings of the 50th


Annual Meeting of the Association for Computa-
tional Linguistics (ACL).

Guo, W., Li, H., Ji, H., and Diab, M. (2013). Link-
ing tweets to news: A framework to enrich short
text data in social media. In Proceedings of the
51th Annual Meeting of the Association for Com-
putational Linguistics (ACL).

Han, B., Cook, P., and Baldwin, T. (2012). Auto-
matically constructing a normalisation dictionary
for microblogs. In Proceedings of the Confer-
ence on Empirical Methods on Natural Language
Processing and Computational Natural Language
Learning (EMNLP-CoNLL).

Hinton, G. E. (2002). Training products of experts
by minimizing contrastive divergence. Neural
Computation, 14(8).

Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer,
L. S., and Weld, D. S. (2011). Knowledge-based
weak supervision for information extraction of
overlapping relations. In Proceedings of the 49th
Annual Meeting of the Association for Computa-
tional Linguistics (ACL).

Hovy, E., Marcus, M., Palmer, M., Ramshaw, L.,
and Weischedel, R. (2006). OntoNotes: the 90%
solution. In Proceedings of the Human Language
Technology Conference - North American Chap-
ter of the Association for Computational Linguis-
tics Annual Meeting (HLT-NAACL).

Ji, Y. and Eisenstein, J. (2013). Discriminative
improvements to distributional sentence similar-
ity. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing
(EMNLP).

Liang, P., Bouchard-Côté, A., Klein, D., and Taskar,
B. (2006). An end-to-end discriminative approach
to machine translation. In Proceedings of the 21st
International Conference on Computational Lin-
guistics and the 44th annual meeting of the Asso-
ciation for Computational Linguistics (COLING-
ACL).

Ling, W., Marujo, L., Dyer, C., Alan, B., and Isabel,
T. (2014). Crowdsourcing high-quality parallel
data extraction from Twitter. In Proceedings of
the Ninth Workshop on Statistical Machine Trans-
lation (WMT).

MacCartney, B., Galley, M., and Manning, C.
(2008). A phrase-based alignment model for
natural language inference. In Proceedings of
the Conference on Empirical Methods in Natural
Language Processing (EMNLP).

Madnani, N. and Dorr, B. J. (2010). Generating
phrasal and sentential paraphrases: A survey of
data-driven methods. Computational Linguistics,
36(3).

Madnani, N., Tetreault, J., and Chodorow, M.
(2012). Re-examining machine translation met-
rics for paraphrase identification. In Proceedings
of the Conference of the North American Chapter
of the Association for Computational Linguistics
- Human Language Technologies (NAACL-HLT).

Minnen, G., Carroll, J., and Pearce, D. (2001). Ap-
plied morphological processing of english. Natu-
ral Language Engineering, 7(03).

Moore, R. C. (2004). On log-likelihood-ratios and
the significance of rare events. In Proceedings of
the Conference on Empirical Methods in Natural
Language Processing (EMNLP).

Nenkova, A. and Vanderwende, L. (2005). The im-
pact of frequency on summarization. Technical
report, Microsoft Research. MSR-TR-2005-101.

O’Connor, B., Krieger, M., and Ahn, D. (2010).
Tweetmotif: Exploratory search and topic sum-
marization for Twitter. In Proceedings of the 4th
International AAAI Conference on Weblogs and
Social Media (ICWSM).

Petrović, S., Osborne, M., and Lavrenko, V. (2012).
Using paraphrases for improving first story detec-
tion in news and Twitter. In Proceedings of the
Conference of the North American Chapter of the
Association for Computational Linguistics - Hu-
man Language Technologies (NAACL-HLT).

Riedel, S., Yao, L., and McCallum, A. (2010). Mod-
eling relations and their mentions without labeled
text. In Proceedigns of the European Conference
on Machine Learning and Principles and Practice
of Knowledge Discovery in Databases (ECML-
PKDD).

Ritter, A., Mausam, Etzioni, O., and Clark, S.
(2012). Open domain event extraction from Twit-
ter. In Proceedings of the 18th International Con-


ference on Knowledge Discovery and Data Min-
ing (SIGKDD).

Ritter, A., Zettlemoyer, L., Mausam, and Etzioni, O.
(2013). Modeling missing data in distant super-
vision for information extraction. Transactions
of the Association for Computational Linguistics
(TACL).

Robbins, H. (1985). Some aspects of the sequen-
tial design of experiments. In Herbert Robbins
Selected Papers. Springer.

Sekine, S. (2005). Automatic paraphrase discovery
based on context and keywords between NE pairs.
In Proceedings of the 3rd International Workshop
on Paraphrasing.

Shinyama, Y., Sekine, S., and Sudo, K. (2002). Au-
tomatic paraphrase acquisition from news articles.
In Proceedings of the 2nd International Confer-
ence on Human Language Technology Research
(HLT).

Surdeanu, M., Tibshirani, J., Nallapati, R., and Man-
ning, C. D. (2012). Multi-instance multi-label
learning for relation extraction. In Proceedings
of the 50th Annual Meeting of the Association for
Computational Linguistics (ACL).

Thadani, K. and McKeown, K. (2011). Optimal and
syntactically-informed decoding for monolingual
phrase-based alignment. In Proceedings of the
49th Annual Meeting of the Association for Com-
putational Linguistics - Human Language Tech-
nologies (ACL-HLT).

Tran-Thanh, L., Stein, S., Rogers, A., and Jennings,
N. R. (2012). Efficient crowdsourcing of un-
known experts using multi-armed bandits. In Pro-
ceedings of the European Conference on Artificial
Intelligence (ECAI).

Vanderwende, L., Suzuki, H., Brockett, C., and
Nenkova, A. (2007). Beyond SumBasic: Task-
focused summarization with sentence simplifica-
tion and lexical expansion. Information Process-
ing & Management, 43.

Wan, S., Dras, M., Dale, R., and Paris, C. (2006).
Using dependency-based features to take the
“para-farce” out of paraphrase. In Proceedings
of the Australasian Language Technology Work-
shop.

Wang, L., Dyer, C., Black, A. W., and Trancoso,
I. (2013). Paraphrasing 4 microblog normaliza-
tion. In Proceedings of the Conference on Em-
pirical Methods on Natural Language Processing
(EMNLP).

Winkler, W. E. (1999). The state of record link-
age and current research problems. Technical re-
port, Statistical Research Division, U.S. Census
Bureau.

Xu, W., Hoffmann, R., Zhao, L., and Grishman, R.
(2013a). Filling knowledge base gaps for distant
supervision of relation extraction. In Proceedings
of the 51st Annual Meeting of the Association for
Computational Linguistics (ACL).

Xu, W., Ritter, A., and Grishman, R. (2013b). Gath-
ering and generating paraphrases from Twitter
with application to normalization. In Proceed-
ings of the Sixth Workshop on Building and Using
Comparable Corpora (BUCC).

Yao, X., Van Durme, B., Callison-Burch, C., and
Clark, P. (2013a). A lightweight and high perfor-
mance monolingual word aligner. In Proceedings
of the 49th Annual Meeting of the Association for
Computational Linguistics (ACL).

Yao, X., Van Durme, B., and Clark, P. (2013b).
Semi-markov phrase-based monolingual align-
ment. In Proceedings of the Conference on Em-
pirical Methods on Natural Language Processing
(EMNLP).

Zanzotto, F. M., Pennacchiotti, M., and Tsiout-
siouliklis, K. (2011). Linguistic redundancy in
Twitter. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing
(EMNLP).

Zettlemoyer, L. S. and Collins, M. (2007). On-
line learning of relaxed CCG grammars for pars-
ing to logical form. In Proceedings of the 2007
Joint Conference on Empirical Methods in Natu-
ral Language Processing and Computational Nat-
ural Language Learning (EMNLP-CoNLL).

Zhang, C. and Weld, D. S. (2013). Harvesting paral-
lel news streams to generate paraphrases of event
relations. In Proceedings of the Conference on
Empirical Methods in Natural Language Process-
ing (EMNLP).