Modeling Missing Data in Distant Supervision for Information Extraction

Alan Ritter
Machine Learning Department

Carnegie Mellon University
rittera@cs.cmu.edu

Luke Zettlemoyer, Mausam
Computer Sci. & Eng.

University of Washington
{lsz,mausam}@cs.washington.edu

Oren Etzioni
Vulcan Inc.
Seattle, WA

orene@vulcan.com

Abstract

Distant supervision algorithms learn informa-
tion extraction models given only large read-
ily available databases and text collections.
Most previous work has used heuristics for
generating labeled data, for example assum-
ing that facts not contained in the database
are not mentioned in the text, and facts in the
database must be mentioned at least once. In
this paper, we propose a new latent-variable
approach that models missing data. This pro-
vides a natural way to incorporate side in-
formation, for instance modeling the intuition
that text will often mention rare entities which
are likely to be missing in the database. De-
spite the added complexity introduced by rea-
soning about missing data, we demonstrate
that a carefully designed local search approach
to inference is very accurate and scales to
large datasets. Experiments demonstrate im-
proved performance for binary and unary re-
lation extraction when compared to learning
with heuristic labels, including on average a
27% increase in area under the precision re-
call curve in the binary case.

1 Introduction

This paper addresses the issue of missing data (Lit-
tle and Rubin, 1986) in the context of distant super-
vision. The goal of distant supervision is to learn
to process unstructured data, for instance to extract
binary or unary relations from text (Bunescu and
Mooney, 2007; Snyder and Barzilay, 2007; Wu and
Weld, 2007; Mintz et al., 2009; Collins and Singer,
1999), using a large database of propositions as a

Person EMPLOYER
Bibb Latané UNC Chapel Hill
Tim Cook Apple
Susan Wojcicki Google

True Positive “Bibb Latané, a professor at the
University of North Carolina at
Chapel Hill, published the theory in
1981.”

False Positive “Tim Cook praised Apple’s record
revenue...”

False Negative “John P. McNamara, a professor
at Washington State University’s
Department of Animal Sciences...”

Figure 1: A small hypothetical database and heuris-
tically labeled training data for the EMPLOYER rela-
tion.

distant source of supervision. In the case of binary
relations, the intuition is that any sentence which
mentions a pair of entities (e1 and e2) that partici-
pate in a relation, r, is likely to express the proposi-
tion r(e1,e2), so we can treat it as a positive training
example of r. Figure 1 presents an example of this
process.

One question which has received little attention in
previous work is how to handle the situation where
information is missing, either from the text corpus,
or the database. As an example, suppose the pair of
entities (John P. McNamara, Washington State Uni-
versity) is absent from the EMPLOYER relation. In
this case, the sentence in Figure 1 (and others which
mention the entity pair) is effectively treated as a
negative example of the relation. This is an issue

367

Transactions of the Association for Computational Linguistics, 1 (2013) 367–378. Action Editor: Kristina Toutanova.
Submitted 7/2013; Revised 8/2013; Published 10/2013. c©2013 Association for Computational Linguistics.


of practical concern, as most databases of interest
are highly incomplete - this is the reason we need to
extend them by extracting information from text in
the first place.

We need to be cautious in how we handle miss-
ing data in distant supervision, because this is a
case where data is not missing at random (NMAR).
Whether a proposition is observed or missing in the
text or database depends heavily on its truth value:
given that it is true we have some chance to ob-
serve it, however we do not observe those which
are false. To address this challenge, we propose a
joint model of extraction from text and the process
by which propositions are observed or missing in
both the database and text. Our approach provides
a natural way to incorporate side information in the
form of a missing data model. For instance, popular
entities such as Barack Obama already have good
coverage in Freebase, so new extractions are more
likely to be errors than those involving rare entities
with poor coverage.

Our approach to missing data is general and can
be combined with various IE solutions. As a proof of
concept, we extend MultiR (Hoffmann et al., 2011),
a recent model for distantly supervised information
extraction, to explicitly model missing data. These
extensions complicate the MAP inference problem
which is used as a subroutine in learning. This
motivated us to explore a variety of approaches to
inference in the joint extraction and missing data
model. We explore both exact inference based on
A* search and efficient approximate inference using
local search. Our experiments demonstrate that with
a carefully designed set of search operators, local
search produces optimal solutions in most cases.

Experimental results demonstrate large perfor-
mance gains over the heuristic labeling strategy on
both binary relation extraction and weakly super-
vised named entity categorization. For example our
model obtains a 27% increase in area under the pre-
cision recall curve on the sentence-level relation ex-
traction task.

2 Related Work

There has been much interest in distantly su-
pervised1 training of relation extractors using

1also referred to as weakly supervised

databases. For example, Craven and Kumlien (1999)
build a heuristically labeled dataset, using the Yeast
Protein Database to label Pubmed abstracts with
the subcellular-localization relation. Wu and Weld
(2007) heuristically annotate Wikipedia articles with
facts mentioned in the infoboxes, enabling auto-
mated infobox generation for articles which do not
yet contain them. Benson et. al. (2011) use a
database of music events taking place in New York
City as a source of distant supervision to train event
extractors from Twitter. Mintz et. al. (2009) used
a set of relations from Freebase as a distant source
of supervision to learn to extract information from
Wikipedia. Ridel et. al. (2010), Hoffmann et.
al. (2011), and Surdeanu et. al. (2012) presented
a series of models casting distant supervision as a
multiple-instance learning problem (Dietterich et al.,
1997).

Recent work has begun to address the challenge
of noise in heuristically labeled training data gen-
erated by distant supervision, and proposed a va-
riety of strategies for correcting erroneous labels.
Takamatsu et al. (2012) present a generative model
of the labeling process, which is used as a pre-
processing step for improving the quality of labels
before training relation extractors. Independently,
Xu et. al. (2013) analyze a random sample of
1834 sentences from the New York Times, demon-
strating that most entity pairs expressing a Freebase
relation correspond to false negatives. They apply
pseudo-relevance feedback to add missing entries
in the knowledge base before applying the MultiR
model (Hoffmann et al., 2011). Min et al. (2013)
extend the MIML model of Surdeanu et. al. (2012)
using a semi-supervised approach assuming a fixed
proportion of true positives for each entity pair.

The Min et al. (2013) approach is perhaps the
most closely related of the recent approaches for dis-
tant supervision. However, there are a number of
key differences: (1) They impose a hard constraint
on the proportion of true positive examples for each
entity pair, whereas we jointly model relation extrac-
tion and missing data in the text and KB. (2) They
only handle the case of missing information in the
database and not in the text. (3) Their model, based
on Surdeanu (2012), uses hard discriminative EM
to tune parameters, whereas we use perceptron-style
updates. (4) We evaluate various inference strategies

368


for exact and approximate inference.
The issue of missing data has been extensively

studied in the statistical literature (Little and Rubin,
1986; Gelman et al., 2003). Most methods for han-
dling missing data assume that variables are missing
at random (MAR): whether a variable is observed
does not depend on its value. In situations where
the MAR assumption is violated (for example dis-
tantly supervised information extraction), ignoring
the missing data mechanism will introduce bias. In
this case it is necessary to jointly model the process
of interest (e.g. information extraction) in addition
to the missing data mechanism.

Another line of related work is iterative semantic
bootstrapping (Brin, 1999; Agichtein and Gravano,
2000). Carlson et. al. (2010) exploit constraints be-
tween relations to reduce semantic drift in the boot-
strapping process; such constraints are potentially
complementary to our approach of modeling miss-
ing data.

3 A Latent Variable Model for Distantly
Supervised Relation Extraction

In this section we review the MultiR model (due to
Hoffmann et. al. (2011)) for distant supervision
in the context of extracting binary relations. This
model is extended to handle missing data in Section
4. We focus on binary relations to keep discussions
concrete; unary relation extraction is also possible.

Given a set of sentences, s = s1,s2, . . . ,sn,
which mention a specific pair of entities (e1 and
e2) our goal is to correctly predict which relation is
mentioned in each sentence, or “NA” if none of the
relations under consideration are mentioned. Un-
like the standard supervised learning setup, we do
not observe the latent sentence-level relation men-
tion variables, z = z1,z2, . . . ,zn.2 Instead we only
observe aggregate binary variables for each rela-
tion, d = d1,d2, . . . ,dk, which indicate whether
the proposition rj(e1,e2) is present in the database
(Freebase). Of course the question which arises is:
how do we relate the aggregate-level variables, dj,
to the sentence-level relation mentions, zi? A sensi-
ble answer to this question is a simple deterministic-
OR function. The deterministic-OR states that if

2These variables indicate which relation is mentioned be-
tween e1 and e2 in each sentence.

there exists at least one i such that zi = j, then
dj = 1. For example, if at least one sentence men-
tions that “Barack Obama was born in Honolulu”,
then that fact is true in aggregate, if none of the sen-
tences mentions the relation, then the fact is assumed
false. The model also makes the converse assump-
tion: if Freebase contains the relation BIRTHLOCA-
TION(Barack Obama, Honolulu), then we must ex-
tract it from at least one sentence. A summary of
this model, which is due to Hoffmann et. al. (2011),
is presented in Figure 2.

3.1 Learning
To learn the parameters of the sentence-level rela-
tion mention classifier, θ, we maximize the likeli-
hood of the facts observed in Freebase conditioned
on the sentences in our text corpus:

θ∗ = arg max
θ
P(d|s;θ)

= arg max
θ

∏

e1,e2

∑

z

P(z,d|s;θ)

Here the conditional likelihood of a given entity pair
is defined as follows:

P(z,d|s;θ) =
n∏

i=1

φ(zi,si;θ)×
k∏

j=1

ω(z,dj)

=
n∏

i=1

eθ·f(zi,si) ×
k∏

j=1

1¬dj⊕∃i:j=zi

Where 1x is an indicator variable which takes the
value 1 if x is true and 0 otherwise, the ω(z,dj)
factors are hard constraints corresponding to the
deterministic-OR function, and f(zi,si) is a vector
of features extracted from sentence si and relation
zi.

An iterative gradient-ascent based approach is
used to tune θ using a latent-variable perceptron-
style additive update scheme (Collins, 2002; Liang
et al., 2006; Zettlemoyer and Collins, 2007). The
gradient of the conditional log likelihood, for a sin-
gle pair of entities, e1 and e2, is as follows:3

∂ log P(d|s;θ)
∂θ

= EP(z|s,d;θ)



∑

j

f(sj,zj)




−EP(z,d|s;θ)



∑

j

f(sj,zj)




3For details see Koller and Friedman (2009), Chapter 20.

369


𝑠1 𝑠2 𝑠3 
… 𝑠𝑛 

𝑧1 𝑧2 𝑧3 
… 𝑧𝑛 

𝑑1 𝑑2 𝑑𝑘 
… 

Local Extractors 

Deterministic OR 

(Barack Obama, Honolulu) 

Sentences 

Aggregate Relations 

(Born-In, Lived-In, children, etc…) 

𝑃 𝑧𝑖 = 𝑟 𝑠𝑖 ∝ exp⁡(𝜃 ⋅ 𝑓 𝑠𝑖, 𝑟 ) Relation mentions 

Figure 2: MultiR (Hoffmann et. al. 2011)

𝑠1 𝑠2 𝑠3 
… 𝑠𝑛 

𝑧1 𝑧2 𝑧3 
… 𝑧𝑛 

𝑡1 𝑡2 𝑡𝑘 
… 

Mentioned in DB 𝑑1 𝑑2 𝑑𝑘 
… 

Mentioned in Text 

Figure 3: DNMAR

These expectations are too difficult to compute in
practice, so instead they are approximated as maxi-
mizations. Computing this approximation to the gra-
dient requires solving two inference problems corre-
sponding to the two maximizations:

z∗DB = arg max
z
P(z|s,d;θ)

z∗ = arg max
z
P(z,d|s;θ)

The MAP solution for the second term is easy to
compute: because d and z are deterministically re-
lated, we can simply find the highest scoring re-
lation, r, for each sentence, si, according to the
sentence-level factors, φ, independently. The first
term, is more difficult, however, as this requires find-
ing the best assignment to the sentence-level hidden
variables z = z1 . . .zn conditioned on the observed
sentences and facts in the database. Hoffmann et.
al. (2011) show how this reduces to a well-known
weighted edge cover problem which can be solved
exactly in polynomial time.

4 Modeling Missing Data

The model presented in Section 3 makes two as-
sumptions which correspond to hard constraints:

1. If a fact is not found in the database it cannot
be mentioned in the text.

2. If a fact is in the database, it must be mentioned
in at least one sentence.

These assumptions drive the learning, however if
there is information missing from either the text or
the database this leads to errors in the training data
(false positives, and false negatives respectively).

In order to gracefully handle the problem of miss-
ing data, we propose to extend the model presented
in Section 3 by splitting the aggregate level vari-
ables, d, into two parts: t which represents whether
a fact is mentioned in the text (in at least one sen-
tence), and d′ which represents whether the fact
is mentioned in the database. We introduce pair-
wise potentials ψ(tj,dj) which penalize disagree-
ment between tj and dj, that is:

ψ(tj,dj) =





−αMIT if tj = 0 and dj = 1
−αMID if tj = 1 and dj = 0
0 otherwise

Where αMIT (Missing In Text) and αMID (Missing
In Database) are parameters of the model which can
be understood as penalties for missing information
in the text and database respectively. We refer to
this model as DNMAR (for Distant Supervision with
Data Not Missing At Random). A graphical model
representation is presented in Figure 3.

This model can be understood as relaxing the
two hard constraints mentioned above into soft con-
straints. As we show in Section 7, simply relaxing
these hard constraints into soft constraints and set-
ting the two parameters αMIT, and αMID by hand on
development data results in a large improvement to
precision at comparable recall over MultiR on two
different applications of distant supervision: binary
relation extraction and named entity categorization.

Inference in this model becomes more challeng-
ing however, because the constrained inference
problem no longer reduces to a weighted edge cover
problem as before. In Section 5, we present an infer-
ence technique for the new model which is time and

370


memory efficient and almost always finds an exact
MAP solution.

The learning proceeds analogously to what was
described in section 3.1, with the exception that we
now maximize over the additional aggregate-level
hidden variables t, which have been introduced. As
before, MAP inference is a subroutine in learning,
both for the unconstrained case corresponding to the
second term (which is again trivial to compute), and
for the constrained case which is more challenging
as it no longer reduces to a weighted edge cover
problem as before.

5 MAP Inference

The only difference in the new inference problem is
the addition of t; z and t are deterministically re-
lated, so we can simply find a MAP assignment to z,
from which t follows. The resulting inference prob-
lem can be viewed as optimization under soft con-
straints, where the objective includes terms for each
fact not in Freebase which is extracted from the text:
−αMID, and an effective reward for extracting a fact
which is contained in Freebase: αMIT.

The solution to the MAP inference problem is the
value of z which maximizes the following objective:

z∗DB = arg max
z
P(z|d;θ,α)

= arg max
z

n∑

i=1

θ ·f(zi,si) (1)

+

k∑

j=1

(
αMIT1dj∧∃i:j=zi −αMID1¬dj∧∃i:j=zi

)

Whether we choose to set the parameters αMIT
and αMID to fixed values (Section 4), or incorporate
side information through a missing data model (Sec-
tion 6), inference becomes more challenging than
in the model where facts observed in Freebase are
treated as hard constraints (Section 3); the hard con-
straints are equivalent to setting αMID = αMIT = ∞.

We now present exact and approximate ap-
proaches to inference. Standard search methods
such as A* and branch and bound have high com-
putation and memory requirements and are there-
fore only feasible on problems with few variables;
they are, however, guaranteed to find an optimal so-
lution.4 Approximate methods scale to large prob-

4Each entity pair defines an inference problem where the

lem sizes, but we loose the guarantee of finding an
optimal solution. After showing how to find guaran-
teed exact solutions for small problem sizes (e.g. up
to 200 variables), we present an inference algorithm
based on local search which is empirically shown to
find optimal solutions in almost every case by com-
paring its solutions to those found by A*.

5.1 A* Search
We cast exact MAP inference in the DNMAR model
as an application of A* search. Each partial hypoth-
esis, h, in the search space corresponds to a par-
tial assignment of the first m variables in z; to ex-
pand a hypothesis, we generate k new hypotheses,
where k is the total number of relations. Each new
hypothesis h′ contains the same partial assignment
to z1, . . . ,zm as h, with each h′ having a different
value of zm+1 = r.

A* operates by maintaining a priority queue of
hypotheses to expand, with each hypothesis’ priority
determined by an admissible heuristic. The heuristic
represents an upper bound on the score of the best
solution with h’s partial variable assignment under
the objective from Equation 1. In general, a tighter
upper bound corresponds to a better heuristic and
faster solutions. To upper bound our objective, we
start with the φ(zi,si) factors from the partial as-
signment. Unassigned variables (i > k), are set to
their maximum possible value, zi = maxr φ(r,si)
independently. Next to account for the effect the
aggregate ψ(tj,dj) factors on the unassigned vari-
ables, we consider independently changing each
unassigned zi variable for each ψ(tj,dj) factor to
improve the overall score. This approach can lead to
inconsistencies, but provides us with a good upper
bound for the best possible solution with a partial
assignment to z1, . . . ,zk.

5.2 Local Search
While A* is guaranteed to find an exact solution, its
time and memory requirements prohibit use on large
problems involving many variables. As a more scal-
able alternative we propose a greedy hill climbing
method (Russell et al., 1996), which starts with a full
assignment to z, and repeatedly moves to the best
neighboring solution z′ according to the objective in

number of variables is equal to the number of sentences which
mention the pair.

371


Equation 1. The neighborhood of z is defined by a
set of search operators. If none of the neighboring
solutions has a higher score, then we have reached
a (local) maximum at which point the algorithm ter-
minates with the current solution which may or may
not correspond to a global maximum. This process
is repeated using a number of random restarts, and
the best local maximum is returned as the solution.

Search Operators: We start with a standard
search operator, which considers changing each
relation-mention variable, zi, individually to maxi-
mize the overall score. At each iteration, all zis are
considered, and the one which produces the largest
improvement to the overall score is changed to form
the neighboring solution, z′. Unfortunately, this
definition of the solution neighborhood is prone to
poor local optima because it is often required to
traverse many low scoring states before changing
one of the aggregate variables, tj, and achieving a
higher score from the associated aggregate factor,
ψ(tj,dj). For example, consider a case where the
proposition r(e1,e2) is not in Freebase, but is men-
tioned many times in the text, and imagine the cur-
rent solution contains no mention zi = r. Any
neighboring solution which assigns a mention to
r will include the penalty αMID, which could out-
weigh the benefit from changing any individual zi
to r: φ(r,si) −φ(zi,si). If multiple mentions were
changed to r however, together they could outweigh
the penalty for extracting a fact not in Freebase, and
produce an overall higher score.

To avoid the problem of getting stuck in local
optima, we propose an additional search operator
which considers changing all variables, zi, which
are currently assigned to a specific relation r, to a
new relation r′, resulting in an additional (k − 1)2
possible neighbors, in addition to the n × (k − 1)
neighbors which come from the standard search op-
erator. This aggregate-level search operator allows
for more global moves which help to avoid local op-
tima, similar to the type-level sampling approach for
MCMC (Liang et al., 2010).

At each iteration, we consider all n × (k − 1) +
(k−1)2 possible neighboring solutions generated by
both search operators, and pick the one with biggest
overall improvement, or terminate the algorithm if
no improvements can be made over the current so-
lution. 20 random restarts were used for each infer-

ence problem. We found this approach to almost al-
ways find an optimal solution. In over 100,000 prob-
lems with 200 or fewer variables from the New York
Times dataset used in Section 7, an optimal solu-
tion was missed in only 3 cases which was verified
by comparing against optimal solutions found using
A*. Without including the aggregate-level search
operator, local search almost always gets stuck in a
local maximum.

6 Incorporating Side Information

In Section 4, we relaxed the hard constraints made
by MultiR, which allows for missing information
in either the text or database, enabling errors in
the distantly supervised training data to be naturally
corrected as a side-effect of learning. We made
the simplifying assumption, however, that all facts
are equally likely to be missing from the text or
database, which is encoded in the choice of 2 fixed
parameters αMIT, and αMID. Is it possible to im-
prove performance by incorporating side informa-
tion in the form of a missing data model (Little and
Rubin, 1986), taking into account how likely each
fact is to be observed in the text and the database
conditioned on its truth value? In our setting, the
missing data model corresponds to choosing the val-
ues of αMIT and αMID dynamically based on the en-
tities and relations involved.

Popular Entities: Consider two entities: Barack
Obama, the 44th president of the United States, and
Donald Parry, a professional rugby league footballer
of the 1980s.5 Since Obama is much more well-
known than Parry, we wouldn’t be very surprised to
see information missing from Freebase about Parry,
but it would seem odd if true propositions were miss-
ing about Obama.

We can encode these intuitions by choosing
entity-specific values of αMID:

α
(e1,e2)
MID = −γ min (c(e1),c(e2))

where c(ei) is the number of times ei appears in
Freebase, which is used as an estimate of its cov-
erage.

Well Aligned Relations: Given that a pair of en-
tities, e1 and e2, participating in a Freebase relation,

5http://en.wikipedia.org/wiki/Donald_
Parry

372


r, appear together in a sentence si, the chance that
si expresses r varies greatly depending on r. For
example, if a sentence mentions a pair of entities
which participate in both the COUNTRYCAPITOL
relation and the LOCATIONCONTAINS relation (for
example Moscow and Russia), it is more likely that
the a random sentence will express LOCATIONCON-
TAINS than COUNTRYCAPITOL.

We can encode this preference for matching cer-
tain relations over others by setting αrMIT on a
per-relation basis. We choose a different value
of αrMIT for each relation based on quick inspec-
tion of the data, and estimating the number of true
positives. Relations such as contains, place lived,
and nationality which contain a large number of
true positive matches are assigned a large value of
αrMIT = γlarge, those with a medium number such as
capitol, place of death and administrative divisions
were assigned a medium value γmedium, and those
relations with few matches were assigned a small
value γsmall. These 3 parameters were tuned on held
out development data.

7 Experiments

In Section 5, we presented a scalable approach to
inference in the DNMAR model which almost al-
ways finds an optimal solution. Of course the real
question is: does modeling missing data improve
performance at extracting information from text? In
this section we present experimental results showing
large improvements in both precision and recall on
two distantly supervised learning tasks: binary rela-
tion extraction and named entity categorization.

7.1 Binary Relation Extraction

For the sake of comparison to previous work we
evaluate performance on the New York Times text,
features and Freebase relations developed by Riedel
et. al. (2010) which was also used by Hoffmann et.
al. (2011). This dataset is constructed by extracting
named entities from 1.8 million New York Times ar-
ticles, which are then match against entities in Free-
base. Sentences which contain pairs of entities par-
ticipating in one or more relations are then used as
training examples for those relations. The sentence-
level features include word sequences appearing in
context with the pair of entities, in addition to part

of speech sequences, and dependency paths from the
Malt parser (Nivre et al., 2004).

7.1.1 Baseline
To evaluate the effect of modeling missing data

in distant supervision, we compare against the Mul-
tiR model for distant supervision (Hoffmann et al.,
2011), a state of the art approach for binary rela-
tion extraction which is the most similar previous
work, and models facts in Freebase as hard con-
straints disallowing the possibility of missing infor-
mation in either the text or the database. To make
our experiment as controlled as possible and rule-
out the possibility of differences in performance due
to implementation details, we compare against our
own re-implementation of MultiR which reproduces
Hoffmann et. al.’s performance, and shares as much
code as possible with the DNMAR model.

7.1.2 Experimental Setup
We evaluate binary relation extraction using two

evaluations. We first evaluate on a sentence-level
extraction task using a manually annotated dataset
provided by Hoffmann et. al. (2011).6 This dataset
consists of sentences paired with human judgments
on whether each expresses a specific relation. Sec-
ondly, we perform an automatic evaluation which
compares propositions extracted from text against
held-out data from Freebase.

7.1.3 Results
Sentential Extraction: Figure 4 presents preci-

sion and recall curves for the sentence-level relation
extraction task on the same manually annotated data
presented by Hoffmann et. al. (2011). By explic-
itly modeling the possibility of missing information
in both the text and the database we achieve a 17%
increase in area under the precision recall curve. In-
corporating additional side information in the form
of a missing data model, as described in Section 6,
produces even better performance, yielding a 27%
increase over the baseline in area under the curve.
We also compare against the system described by
Xu et. al. (2013) (hereinafter called Xu13). To do
this, we trained our implementation of MultiR on

6http://raphaelhoffmann.com/mr/

373


0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

recall

p
re

ci
si

o
n

MultiR
Xu13
DNMAR
DNMAR*

Figure 4: Overall precision and Recall at the
sentence-level extraction task comparing against
human judgments. DNMAR∗ incorporates side-
information as discussed in Section 6.

the labels predicted by their Pseudo-relevance Feed-
back model. 7 The differences between each pair of
systems, except DNMAR and Xu138, is significant
with p-value less than 0.05 according to a paired t-
test assuming a normal distribution.

Per-relation precision and recall curves are pre-
sented in Figure 6. For certain relations, for instance
/location/us state/capital, there simply isn’t enough
overlap between the information contained in Free-
base and facts mentioned in the text to learn any-
thing useful. For these relations, entity pair matches
are unlikely to actually express the relation; for in-
stance, in the following sentence from the data:

NHPF , which has its Louisiana office
in Baton Rouge , gets the funds ...

although Baton Rouge is the capital of Louisiana,
the /location/us state/capital relation is not ex-
pressed in this sentence. Another interesting ob-
servation which we can make from Figure 6,
is that the benefit from modeling missing data

7We thank Wei Xu for making this data available.
8DNMAR has a 1.3% increase in AUC over Xu13, though

this difference is not significant according to a paired t-test.
DNMAR* achieves a 10% increase in AUC over Xu13 which
is significant.

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

recall

p
re

ci
si

o
n

MultiR
DNMAR
DNMAR*

Figure 5: Aggregate-level automatic evaluation
comparing against held-out data from Freebase.
DNMAR∗ incorporates side-information as dis-
cussed in Section 6.

varies from one relation to another. Some re-
lations, for instance /people/person/place of birth,
have relatively good coverage in both Freebase and
the text, and therefore we do not see as much
gain from modeling missing data. Other rela-
tions, such as /location/location/contains, and /peo-
ple/person/place lived have poorer coverage making
our missing data model very beneficial.

Aggregate Extraction: Following previous
work, we evaluate precision and recall against held-
out data from Freebase in Figure 5. As mentioned by
Mintz et. al. (2009), this automatic evaluation un-
derestimates precision because many facts correctly
extracted from the text are missing in the database
and therefore judged as incorrect. Riedel et. al.
(2013) further argues that this evaluation is biased
because frequent entity pairs are more likely to con-
tain facts in Freebase, so systems which rank extrac-
tions involving popular entities higher will achieve
better performance independently of how accurate
their predictions are. Indeed in Figure 5 we see that
the precision of our system which models missing
data is generally lower than the system which as-
sumes no data is missing from Freebase, although
we do roughly double the recall. By better modeling

374


0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

business_company_founders

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

business_person_company

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

location_country_administrative_divisions

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

location_country_capital

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

location_location_contains

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

location_neighborhood_neighborhood_of

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

location_us_state_capital

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

people_deceased_person_place_of_death

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

people_person_children

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

people_person_nationality

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

people_person_place_lived

recall

p
re

ci
si

o
n

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

people_person_place_of_birth

recall

p
re

ci
si

o
n

Figure 6: Per-relation precision and recall on the sentence-level relation extraction task. The dashed line
corresponds to MultiR, DNMAR is the solid line, and DNMAR*, which incorporates side-information, is
represented by the dotted line.

375


missing data we achieve lower precision on this au-
tomatic held-out evaluation as the system using hard
constraints is explicitly trained to predict facts which
occur in Freebase (not those which are mentioned in
the text but unlikely to appear in the database).

7.2 Named Entity Categorization

As mentioned previously, the problem of missing
data in distant (weak) supervision is a very general
issue; so far we have investigated this problem in the
context of extracting binary relations using distant
supervision. We now turn to the problem of weakly
supervised named entity recognition (Collins and
Singer, 1999; Talukdar and Pereira, 2010).

7.2.1 Experimental Setup

To demonstrate the effect of modeling missing
data in the distantly supervised named entity cate-
gorization task, we adapt the MultiR and DNMAR
models to the Twitter named entity categorization
dataset which was presented by Ritter et. al. (2011).
The models described so far are applied unchanged:
rather than modeling a set of relations in Freebase
between a pair of entities, e1 and e2, we now model
a set of possible Freebase categories associated with
a single entity e. This is a natural extension of dis-
tant supervision from binary to unary relations. The
unlabeled data and features described by Ritter et.
al. (2011) are used for training the model, and their
manually annotated Twitter named entity dataset is
used for evaluation.

7.2.2 Results

Precision and recall at weakly supervised named
entity categorization comparing MultiR against DN-
MAR is presented in Figure 7. We observe substan-
tial improvement in precision at comparable recall
by explicitly modeling the possibility of missing in-
formation in the text and database. The missing data
model leads to a 107% increase in area under the
precision-recall curve (from 0.16 to 0.34), but still
falls short of the results presented by Ritter et. al.
(2011). Intuitively this makes sense, because the
model used by Ritter et. al. is based on latent Dirich-
let allocation which is better suited to this highly am-
biguous unary relation data.

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

recall

p
re

ci
si

o
n

NER_MultiR
NER_DNMAR

Figure 7: Precision and Recall at the named entity
categorization task

8 Conclusions

In this paper we have investigated the problem of
missing data in distant supervision; we introduced
a joint model of information extraction and miss-
ing data which relaxes the hard constraints used in
previous work to generate heuristic labels, and pro-
vides a natural way to incorporate side information
through a missing data model. Efficient inference
breaks in the new model, so we presented an ap-
proach based on A* search which is guaranteed to
find exact solutions, however exact inference is not
computationally tractable for large problems. To ad-
dress the challenge of large problem sizes, we pro-
posed a scalable inference algorithm based on local
search, which includes a set of aggregate search op-
erators allowing for long-distance jumps in the so-
lution space to avoid local maxima; this approach
was experimentally demonstrated to find exact so-
lutions in almost every case. Finally we evaluated
the performance of our model on the tasks of binary
relation extraction and named entity categorization
showing large performance gains in each case.

In future work we would like to apply our ap-
proach to modeling missing data to additional mod-
els, for instance the model of Surdeanu et. al. (2012)
and Ritter et. al. (2011), and also explore new miss-
ing data models.

376


Acknowledgements

The authors would like to thank Dan Weld, Chris
Quirk, Raphael Hoffmann and the anonymous re-
viewers for helpful comments. Thanks to Wei Xu
for providing data. This research was supported
in part by ONR grant N00014-11-1-0294, DARPA
contract FA8750-09- C-0179, a gift from Google, a
gift from Vulcan Inc., and carried out at the Univer-
sity of Washington’s Turing Center.

References
Eugene Agichtein and Luis Gravano. 2000. Snowball:

Extracting relations from large plain-text collections.
In Proceedings of the fifth ACM conference on Digital
libraries.

Edward Benson, Aria Haghighi, and Regina Barzilay.
2011. Event discovery in social media feeds. In Pro-
ceedings of ACL.

Sergey Brin. 1999. Extracting patterns and relations
from the world wide web. In The World Wide Web
and Databases.

Razvan Bunescu and Raymond Mooney. 2007. Learning
to extract relations from the web using minimal super-
vision. In ACL.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr
Settles, Estevam R Hruschka Jr, and Tom M Mitchell.
2010. Toward an architecture for never-ending lan-
guage learning. In Proceedings of AAAI.

Michael Collins and Yoram Singer. 1999. Unsupervised
models for named entity classification. In EMNLP.

Michael Collins. 2002. Discriminative training meth-
ods for hidden markov models: theory and experi-
ments with perceptron algorithms. In Proceedings of
EMNLP.

Mark Craven, Johan Kumlien, et al. 1999. Constructing
biological knowledge bases by extracting information
from text sources. In ISMB.

Thomas G Dietterich, Richard H Lathrop, and Tomás
Lozano-Pérez. 1997. Solving the multiple instance
problem with axis-parallel rectangles. Artificial Intel-
ligence.

Andrew Gelman, John B Carlin, Hal S Stern, and Don-
ald B Rubin. 2003. Bayesian data analysis. CRC
press.

Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke
Zettlemoyer, and Daniel S. Weld. 2011. Knowledge-
based weak supervision for information extraction of
overlapping relations. In Proceedings of ACL-HLT.

D. Koller and N. Friedman. 2009. Probabilistic Graphi-
cal Models: Principles and Techniques. MIT Press.

Percy Liang, Alexandre Bouchard-Côté, Dan Klein, and
Ben Taskar. 2006. An end-to-end discriminative ap-
proach to machine translation. In Proceedings of ACL.

Percy Liang, Michael I Jordan, and Dan Klein. 2010.
Type-based mcmc. In Proceedings of ACL.

Roderick J A Little and Donald B Rubin. 1986. Statis-
tical analysis with missing data. John Wiley & Sons,
Inc., New York, NY, USA.

Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and
David Gondek. 2013. Distant supervision for rela-
tion extraction with an incomplete knowledge base. In
Proceedings of NAACL-HLT.

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky.
2009. Distant supervision for relation extraction with-
out labeled data. In Proceedings of ACL-IJCNLP.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2004.
Memory-based dependency parsing. In Proceedings
of CoNLL.

Sebastian Riedel, Limin Yao, and Andrew McCallum.
2010. Modeling relations and their mentions without
labeled text. In Proceedings of ECML/PKDD.

Sebastian Riedel, Limin Yao, Andrew McCallum, and
Benjamin M Marlin. 2013. Relation extraction with
matrix factorization and universal schemas. In Pro-
ceedings of NAACL-HLT.

Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.
2011. Named entity recognition in tweets: An experi-
mental study. Proceedings of EMNLP.

Stuart J. Russell, Peter Norvig, John F. Candy, Jiten-
dra M. Malik, and Douglas D. Edwards. 1996. Ar-
tificial intelligence: a modern approach.

Benjamin Snyder and Regina Barzilay. 2007. Database-
text alignment via structured multilabel classification.
In Proceedings of IJCAI.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and
Christopher D. Manning. 2012. Multi-instance multi-
label learning for relation extraction. In Proceedings
of EMNLP-Conll.

Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa.
2012. Reducing wrong labels in distant supervision
for relation extraction. In Proceedings ACL.

Partha Pratim Talukdar and Fernando Pereira. 2010.
Experiments in graph-based semi-supervised learning
methods for class-instance acquisition. In Proceedings
of ACL.

Fei Wu and Daniel S. Weld. 2007. Autonomously se-
mantifying wikipedia. In Proceedings of CIKM.

Wei Xu, Raphael Hoffmann Le Zhao, and Ralph Grish-
man. 2013. Filling knowledge base gaps for distant
supervision of relation extraction. In Proceedings of
ACL.

Luke S. Zettlemoyer and Michael Collins. 2007. Online
learning of relaxed ccg grammars for parsing to logical
form. In EMNLP-CoNLL.

377


378