Edinburgh Research Explorer 
 
 
Learning Typed Entailment Graphs with Global Soft Constraints

Citation for published version:
Hosseini, SMJ, Chambers, N, Reddy, S, Holt, XR, Cohen, S, Johnson, M & Steedman, M 2018, 'Learning
Typed Entailment Graphs with Global Soft Constraints', Transactions of the Association for Computational
Linguistics, vol. 6, pp. 703-717. https://doi.org/10.1162/tacl_a_00250

Digital Object Identifier (DOI):
10.1162/tacl_a_00250

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Peer reviewed version

Published In:
Transactions of the Association for Computational Linguistics

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.

Download date: 06. Apr. 2021

https://doi.org/10.1162/tacl_a_00250
https://doi.org/10.1162/tacl_a_00250
https://www.research.ed.ac.uk/portal/en/publications/learning-typed-entailment-graphs-with-global-soft-constraints(f9e6cde4-6adb-4f78-aeab-44ff87367405).html


Learning Typed Entailment Graphs with Global Soft Constraints

Mohammad Javad Hosseini?§ Nathanael Chambers?? Siva Reddy† Xavier R. Holt‡
Shay B. Cohen? Mark Johnson‡ and Mark Steedman?
?University of Edinburgh §The Alan Turing Institute, UK

??United States Naval Academy †Stanford University ‡Macquarie University
javad.hosseini@ed.ac.uk, nchamber@usna.edu, sivar@stanford.edu

{xavier.ricketts-holt,mark.johnson}@mq.edu.au
{scohen,steedman}@inf.ed.ac.uk

Abstract

This paper presents a new method for learn-
ing typed entailment graphs from text. We
extract predicate-argument structures from
multiple-source news corpora, and compute
local distributional similarity scores to learn
entailments between predicates with typed
arguments (e.g., person contracted disease).
Previous work has used transitivity con-
straints to improve local decisions, but these
constraints are intractable on large graphs.
We instead propose a scalable method that
learns globally consistent similarity scores
based on new soft constraints that consider
both the structures across typed entailment
graphs and inside each graph. Learning
takes only a few hours to run over 100K
predicates and our results show large im-
provements over local similarity scores on
two entailment datasets. We further show
improvements over paraphrases and entail-
ments from the Paraphrase Database, and
prior state-of-the-art entailment graphs. We
show that the entailment graphs improve
performance in a downstream task.

1 Introduction

Recognizing textual entailment and paraphrasing
is critical to many core natural language process-
ing applications such as question-answering and
semantic parsing. The surface form of a sentence
that answers a question such as “Does Verizon
own Yahoo?” frequently does not directly cor-
respond to the form of the question, but is rather
a paraphrase or an expression such as “Verizon
bought Yahoo”, that entails the answer. The lack
of a well-established form-independent semantic
representation for natural language is the most im-
portant single obstacle to bridging the gap between
queries and text resources.

This paper seeks to learn meaning postulates
(e.g., buying entails owning) that can be used to

t3=person,t4=location

visit

arrive	in
leave

Leave	for

t1=company,t2=company

own

's	acquisition	of

buy

Figure 1: Examples of typed entailment graphs
for arguments of types company,company and per-
son,location.

augment the standard form-dependent semantics.
Our immediate goal is to learn entailment rules be-
tween typed predicates with two arguments, where
the type of each predicate is determined by the
types of its arguments. We construct typed entail-
ment graphs, with typed predicates as nodes and
entailment rules as edges. Figure 1 shows simple
examples of such graphs with arguments of types
company,company and person,location.

Entailment relations are detected by computing
a similarity score between the typed predicates
based on the distributional inclusion hypothesis,
which states that a word (predicate) u entails an-
other word (predicate) v if in any context that u
can be used so can be v (Dagan et al., 1999; Gef-
fet and Dagan, 2005; Herbelot and Ganesalingam,
2013; Kartsaklis and Sadrzadeh, 2016). Most pre-
vious work has taken a “local learning” approach
(Lin, 1998; Weeds and Weir, 2003; Szpektor and
Dagan, 2008; Schoenmackers et al., 2010), i.e.,
learning entailment rules independently from each
other.

One problem facing local learning approaches
is that many correct edges are not identified be-
cause of data sparsity and many wrong edges
are spuriously identified as valid entailments. A
“global learning” approach, where dependencies
between entailment rules are taken into account,
can improve the local decisions significantly. Be-
rant et al. (2011) imposed transitivity constraints
on the entailments, such that the inclusion of rules


i→j and j→k implies that of i→k. While they
showed transitivity constraints to be effective in
learning entailment graphs, the Integer Linear Pro-
gramming (ILP) solution of Berant et al. (2011)
is not scalable beyond a few hundred nodes. In
fact, the problem of finding a maximally weighted
transitive subgraph of a graph with arbitrary edge
weights is NP-hard (Berant et al., 2011).

This paper instead proposes a scalable solution
that does not rely on transitivity closure, but in-
stead uses two global soft constraints that main-
tain structural similarity both across and within
each typed entailment graph (Figure 2). We intro-
duce an unsupervised framework to learn globally
consistent similarity scores given local similarity
scores (§4). Our method is highly parallelizable
and takes only a few hours to apply to more than
100K predicates.1,2

Our experiments (§6) show that the global
scores improve significantly over local scores
and outperform state-of-the-art entailment graphs
on two standard entailment rule datasets (Berant
et al., 2011; Holt, 2018). We ultimately intend the
typed entailment graphs to provide a resource for
entailment and paraphrase rules for use in seman-
tic parsing and open domain question-answering,
as has been done for similar resources such as the
Paraphrase Database (PPDB; Ganitkevitch et al.,
2013; Pavlick et al., 2015) in Wang et al. (2015);
Dong et al. (2017).3 With that end in view, we
have included a comparison with PPDB in our
evaluation on the entailment datasets. We also
show that the learned entailment rules improve
performance on a question-answering task (§7)
with no tuning or prior knowledge of the task.

2 Related Work

Our work is closely related to Berant et al. (2011),
where entailment graphs are learned by imposing
transitivity constraints on the entailment relations.
However, the exact solution to the problem is not
scalable beyond a few hundred predicates, while
the number of predicates that we capture is two
orders of magnitude larger (§5). Hence, it is nec-
essary to resort to approximate methods based on

1We performed our experiments on a 32-core 2.3 GHz
machine with 256GB of RAM.

2Our code, extracted binary relations and
the learned entailment graphs are available at
https://github.com/mjhosseini/entGraph.

3Predicates inside each clique in the entailment graphs
are considered to be paraphrases.

assumptions concerning the graph structure. Be-
rant et al. (2012) and Berant et al. (2015) propose
Tree-Node-Fix (TNF), an approximation method
that scales better by additionally assuming the en-
tailment graphs are “Forest Reducible", where a
predicate cannot entail two (or more) predicates j
and k such that neither j→k nor k→j (FRG as-
sumption). However, the FRG assumption is not
correct for many real-world domains. For exam-
ple, a person visiting a place entails both arriving
at that place and leaving that place, while the lat-
ter do not necessarily entail each other. Our work
injects two other types of prior knowledge about
the structure of the graph that are less expensive to
incorporate and yield better results on entailment
rule datasets.

Abend et al. (2014) learn entailment relations
over multi-word predicates with different levels of
compositionality. Pavlick et al. (2015) add variety
of relations, including entailment, to phrase pairs
in PPDB. This includes a broader range of entail-
ment relations such as lexical entailment. In con-
trast to our method, these works rely on supervised
data and take a local learning approach.

Another related strand of research is link pre-
diction (Socher et al., 2013; Bordes et al., 2013;
Riedel et al., 2013; Yang et al., 2015; Trouil-
lon et al., 2016; Dettmers et al., 2018), where
the source data are extractions from text, facts in
knowledge bases, or both. Unlike our work, which
directly learns entailment relations between pred-
icates, these methods aim at predicting the source
data, i.e., whether two entities have a particular
relationship. The common wisdom is that en-
tailment relations are by-product of these meth-
ods (Riedel et al., 2013). However, this assump-
tion has not usually been explicitly evaluated.
Explicit entailment rules provide explainable re-
sources that can be used in downstream tasks.
Our experiments show that our method signifi-
cantly outperforms a state-of-the-art link predic-
tion method.

3 Computing Local Similarity Scores

We first extract binary relations as predicate-
argument pairs using a Combinatory Categorial
Grammar (CCG; Steedman, 2000) semantic parser
(§3.1). We map the arguments to their Wikipedia
URLs using a named entity linker (§3.2). We ex-
tract types such as person and disease for each ar-
gument (§3.2). We then compute local similarity


scores between predicate pairs (§3.3).

3.1 Relation Extraction

The semantic parser of Reddy et al. (2014),
GraphParser, is run on the NewsSpike corpus
(Zhang and Weld, 2013) to extract binary re-
lations between a predicate and its arguments
from sentences. GraphParser uses CCG syn-
tactic derivations and λ-calculus to convert sen-
tences to neo-Davisonian semantics, a first-order
logic that uses event identifiers (Parsons, 1990).
For example, for the sentence, Obama visited
Hawaii in 2012, GraphParser produces the logi-
cal form ∃e.visit1(e, Obama)∧visit2(e, Hawaii)∧
visitin(e, 2012), where e denotes an event. We
will consider a relation for each pair of ar-
guments, hence, there will be three rela-
tions for the above sentence: visit1,2 with ar-
guments (Obama,Hawaii), visit1,in with argu-
ments (Obama,2012) and visit2,in with arguments
(Hawaii,2012). We currently only use extracted
relations that involve two named entities or one
named entity and a noun. We constrain the rela-
tions to have at least one named entity to reduce
ambiguity in finding entailments.

We perform a few automatic post-processing
steps on the output of the parser. First, we normal-
ize the predicates by lemmatization of their head
words. Passive predicates are mapped to active
ones and we extract negations and particle verb
predicates. Next, we discard unary relations and
relations involving coordination of arguments. Fi-
nally, whenever we see a relation between a sub-
ject and an object, and a relation between object
and a third argument connected by a prepositional
phrase, we add a new relation between the subject
and the third argument by concatenating the rela-
tion name with the object. For example, for the
sentence China has a border with India, we ex-
tract a relation have border1,with between China
and India. We perform a similar process for PPs
attached to VPs. Most of the light verbs and multi-
word predicates will be extracted by the above
post-processing (e.g., take care1,of ) which will re-
cover many salient ternary relations.

While entailments and paraphrasing can bene-
fit from n-ary relations, e.g., person visits a lo-
cation in a time, we currently follow previous
work (Lewis and Steedman, 2013a; Berant et al.,
2015) in confining our attention to binary rela-
tions, leaving the construction of n-ary graphs to

future work.

3.2 Linking and Typing Arguments

Entailment and paraphrasing depend on context.
While using exact context is impractical in form-
ing entailment graphs, many authors have used the
type of the arguments to disambiguate polysemous
predicates (Berant et al., 2011, 2015; Lewis and
Steedman, 2013a; Lewis, 2014). Typing also re-
duces the size of the entailment graphs.

Since named entities can be referred to in many
different ways, we use a named entity linking
tool to normalize the named entities. In the ex-
periments below, we use AIDALight (Nguyen
et al., 2014), a fast and accurate named entity
linker, to link named entities to their Wikipedia
URLs (if any). We thus type all entities that
can be grounded in Wikipedia. We first map the
Wikipedia URL of the entities to Freebase (Bol-
lacker et al., 2008). We select the most notable
type of the entity from Freebase and map it to
FIGER types (Ling and Weld, 2012) such as build-
ing, disease, person and location, using only the
first level of the FIGER type hierarchy.4 For exam-
ple, instead of event/sports_event, we use event as
type. If an entity cannot be grounded in Wikipedia
or its Freebase type does not have a mapping to
FIGER, we assign the default type thing to it.

3.3 Local Distributional Similarities

For each typed predicate (e.g., visit1,2 with types
person,location), we extract a feature vector. We
use as feature types the set of argument pair strings
(e.g., Obama-Hawaii) that instantiate the binary
relations of the predicates. The value of each fea-
ture is the pointwise mutual information (PMI) be-
tween the predicate and the feature. We use the
feature vectors to compute three local similarity
scores (both symmetric and directional) between
typed predicates: Weeds (Weeds and Weir, 2003),
Lin (Lin, 1998), and Balanced Inclusion (BInc;
Szpektor and Dagan, 2008) similarities.

4 Learning Globally Consistent
Entailment Graphs

We learn globally consistent similarity scores
based on local similarity scores. The global scores
will be used to form typed entailment graphs.

449 types out of 113 FIGER types


4.1 Problem Formulation

Let T be a set of types and P be a set of predicates.
We denote by V̄ (t1, t2) the set of typed predicates
p(:t1, :t2), where t1, t2 ∈ T and p ∈ P . Each
p(:t1, :t2) ∈ V̄ (t1, t2) takes as input arguments of
types t1 and t2. An example of a typed predicate is
win1,2(:team,:event) that can be instantiated with
win1,2(Seahawks:team,Super Bowl:event).

We define V (t1, t2) = V̄ (t1, t2) ∪ V̄ (t2, t1).
We often denote elements of V (t1, t2) by i, j
and k, where each element is a typed predicate
as above. For an i=p(:t1, :t2) ∈ V (t1, t2), we
denote by π(i)=p, τ1(i)=t1 and τ2(i)=t2. We
compute distributional similarities between predi-
cates with the same argument types. We denote by
W0(t1, t2) ∈ [0, 1]|V (t1,t2)|×|V (t1,t2)| the (sparse)
matrix containing all local similarity scores w0ij
between predicates i and j with types t1 and t2,
where |V (t1, t2)| is the size of V (t1, t2).5

Predicates can entail each other with the same
argument order (direct) or in the reverse order,
i.e., p(:t1, :t2) might entail q(:t1, :t2) or q(:t2, :t1).
For the graphs with the same types (e.g.,
t1=t2=person), we keep two copies of the pred-
icates one for each of the possible orderings. This
allows us to model entailments with reverse argu-
ment orders, e.g., is son of1,2(:person1,:person2)
→ is parent of1,2(:person2,:person1).

We define V =
⋃
t1,t2

V (t1, t2), the set
of all typed predicates, and W0 as a block-
diagonal matrix consisting of all the local sim-
ilarity matrices W0(t1, t2). Similarly, we de-
fine W(t1, t2) and W as the matrices consisting
of globally consistent similarity scores wij we
wish to learn. The global similarity scores are
used to form entailment graphs by thresholding
W. For a δ > 0, we define typed entailment
graphs as Gδ(t1, t2) =

(
V (t1, t2),Eδ(t1, t2)

)
,

where V (t1, t2) are the nodes and E(t1, t2) =
{(i,j)|i,j ∈ V (t1, t2),wij ≥ δ} are the edges
of the entailment graphs.

4.2 Learning Algorithm

Existing approaches to learn entailment graphs
from text miss many correct edges because of data
sparsity, i.e., the lack of explicit evidence in the
corpus that a predicate i entails another predicate
j. The goal of our method is to use evidence

5For each similarity measure, we define one separate ma-
trix and run the learning algorithm separately, but for simplic-
ity of notation, we do not show the similarity measure names.

t3=living_thing,t4=diseaset1=government_agency,t2=event

!(trigger,(t1,t2),(t3,t4))

t5=medicine,t6=disease(B)

treat

cause

cure

useful for

trigger

cause

trigger

(A)

Figure 2: Learning entailments that are consistent (A)
across different but related typed entailment graphs and
(B) within each graph. 0 ≤ β ≤ 1 determines how
much different graphs are related. The dotted edges are
missing, but will be recovered by considering relation-
ships shown by across-graph (red) and within-graph
(light blue) connections.

from the existing edges that have been assigned
high confidence to predict missing ones, and re-
move spurious edges. We propose two global soft
constraints that maintain structural similarity both
across and within each typed entailment graph.
The constraints are based on the following two ob-
servations.

First, it is standard to learn a separate typed en-
tailment graph for each (plausible) type-pair be-
cause arguments provide necessary disambigua-
tion for predicate meaning (Berant et al., 2011,
2012; Lewis and Steedman, 2013a,b; Berant et al.,
2015). However, many entailment relations for
which we have direct evidence only in a few sub-
graphs may in fact apply over many others (Fig-
ure 2A). For example, we may not have found di-
rect evidence that mentions of a living_thing (e.g.,
a virus) triggering a disease are accompanied by
mentions of the living_thing causing that disease
(because of data sparsity), whereas we have found
that mentions of a government_agency triggering
an event are reliably accompanied by mentions of
causing that event. While we show that typing is
necessary to learning entailments (§6), we propose
to learn all typed entailment graphs jointly.

Second, we encourage paraphrase predicates
(where i→j and j→i) to have the same patterns
of entailment (Figure 2B), i.e. to entail and be
entailed by the same predicates, global soft con-
straints that we call paraphrase resolution. Using
these soft constraint, a missing entailment (e.g.,
medicine treats disease → medicine is useful for
disease) can be identified by considering the en-


J(W ≥ 0, ~β ≥ 0) = LwithinGraph + LcrossGraph + LpResolution + λ1‖W‖1 (1)

LwithinGraph =
∑
i,j∈V

(wij −w0ij)
2 (2)

LcrossGraph =
1

2

∑
i,j∈V

∑
(i′,j′)∈
N(i,j)

β
(
π(i),

(
τ1(i),τ2(i)

)
,
(
τ1(i

′),τ2(i
′)
))

(wij −wi′j′ )2 +
λ2
2
‖~1 − ~β‖22 (3)

LpResolution =
1

2

∑
t1,t2∈T

∑
i,j,k∈V (t1,t2)
k 6=i,k 6=j

Iε(wij)Iε(wji)
[
(wik −wjk)2 + (wki −wkj)2

]
(4)

Figure 3: The objective function to jointly learn global scores W and the compatibility function β, given local
scores W0. LwithinGraph encourages global and local scores to be close; LcrossGraph encourages similarities to be
consistent between different typed entailment graphs; LpResolution encourages paraphrase predicates to have the
same pattern of entailment. We use an `1 regularization penalty to remove entailments with low confidence.

tailments of a paraphrase predicate (e.g., medicine
cures disease → medicine is useful for disease).

Sharing entailments across different typed en-
tailment graphs is only semantically correct for
some predicates and types. In order to learn
when we can generalize an entailment from one
graph to another, we define a compatibility func-
tion β : P × (T×T) × (T×T) → [0, 1]. The
function is defined for a predicate and two type
pairs (Figure 2A). It specifies the extent of com-
patibility for a single predicate between different
typed entailment graphs, with 1 being completely
compatible and 0 being irrelevant. In particu-
lar β

(
p, (t1, t2), (t

′
1, t
′
2)
)

determines how much
we expect the outgoing edges of p(:t1, :t2) and
p(:t′1, :t

′
2) to be similar. We constrain β to be sym-

metric between t1, t2 and t′1, t
′
2 as compatibility of

outgoing edges of p(:t1, :t2) with p(:t′1, :t
′
2) should

be the same as p(:t′1, :t
′
2) with p(:t1, :t2). We de-

note by ~β, a vectorization consisting of the values
of β for all possible input predicates and types.

Note that the global similarity scores W and
the compatibility function ~β are not known in ad-
vance. Given local similarity scores W0, we learn
W and ~β jointly. We minimize the loss func-
tion defined in Eq. 1 which consists of three soft
constraints defined below and an `1 regularization
term (Figure 3).
LwithinGraph. Eq. 2 encourages global scores

wij to be close to local scores w0ij, so that the
global scores will not stray too far from the origi-
nal scores.
LcrossGraph. Eq. 3 encourages each predicate’s

entailments to be similar across typed entailment

graphs (Figure 2A) if the predicates have similar
neighbors. We penalize the difference of entail-
ments in two different graphs, when the compat-
ibility function is high. For each pair of typed
predicates (i,j) ∈ V (t1, t2), we define a set of
neighbors (predicates with different types):

N(i,j) =
{

(i′,j′) ∈ V (t′1, t
′
2)|t
′
1, t
′
2 ∈ T,

(i′,j′) 6= (i,j),π(i) = π(i′),π(j) = π(j′),

a(i,j) = a(i′,j′)
}
, (5)

where a(i,j) is true if the argument orders of i and
j match, and false otherwise. For each (i′,j′) ∈
N(i,j), we penalize the difference of entailments
by adding the term β(·)(wij − wi′j′ )2. We add a
prior term on ~β as λ2‖~1−~β‖22, where ~1 is a vector
of the same size as ~β with all 1s. Without the prior
term (i.e., λ2=0), all the elements of ~β will be-
come zero. Increasing λ2 will keep (some of the)
elements of ~β non-zero and encourages communi-
cations between related graphs.
LpResolution. Eq. 4 denotes the paraphrase reso-

lution global soft constraints that encourage para-
phrase predicates to have the same patterns of en-
tailments (Figure 2B). The function Iε(x) equals x
if x > ε and zero, otherwise.6 Unlike LcrossGraph
in Eq. 3, Eq. 4 operates on the edges within each
graph. If both wij and wji are high, their incoming
and outgoing edges from/to nodes k are encour-
aged to be similar. We name this global constraint,

6In our experiments, we set ε = .3. Smaller values of ε
yield similar results, but learning is slower.


paraphrase resolution, since it might add missing
links (e.g., i→k) if i and j are paraphrases of each
other and j→k, or break the paraphrase relation,
if the incoming and outgoing edges are very dif-
ferent.

We impose an `1 penalty on the elements of
W as λ1‖W‖1, where λ1 is a nonnegative tuning
hyperparameter that controls the strength of the
penalty applied to the elements of W. This term
removes entailments with low confidence from the
entailment graphs. Note that Eq. 1 has W0 and
average of W0 across different typed entailment
graphs (§5.4) as its special cases. The former is
achieved by setting λ1=λ2=0 and ε=1 and the lat-
ter by λ1=0, λ2=∞ and ε=1. We do not explicitly
weight the different components of the loss func-
tion, as the effect of LcrossGraph and LpResolution
can be controlled by λ2 and ε, respectively.

Eq. 1 can be interpreted as an inference prob-
lem in a Markov Random Field (MRF) (Kinder-
mann and Snell, 1980), where the nodes of the
MRF are the global scores wij and the parame-
ters β

(
p, (t1, t2), (t

′
1, t
′
2)
)

. The MRF will have
five log-linear factor types: one unary factor type
for LwithinGraph, one three-variable factor type for
the first term of LcrossGraph and a unary factor type
for the prior on ~β, one four-variable factor type
for LpResolution and a unary factor type for the `1
regularization term. Figure 2 shows an example
factor graph (unary factors are not shown for sim-
plicity).

We learn W and ~β jointly using a message
passing approach based on the Block Coordinate
Descent method (Xu and Yin, 2013) . We ini-
tialize W = W0. Assuming that we know the
global similarity scores W, we learn how much
the entailments are compatible between different
types (~β) and vice versa. Given W fixed, each
wij sends messages to the corresponding β(·) el-
ements, which will be used to update ~β. Given ~β
fixed, we do one iteration of learning for each wij.
Each β(·) and wij elements send messages to the
related elements in W, which will be in turn up-
dated. Based on the update rules (Appendix A),
we always have wij ≤ 1 and ~β ≤~1.

Each iteration of the learning method takes
O
(
‖W‖0|T|2 +

∑
i∈V (‖wi:‖0+‖w:i‖0)

2
)

time,
where ‖W‖0 is the number of nonzero elements
of W (number of edges in the current graph), |T|
is the number of types and ‖wi:‖0 (‖w:i‖0) is the
number of nonzero elements of the ith row (col-

umn) of the matrix (out-degree and in-degree of
the node i).7 In practice, learning converges af-
ter 5 iterations of full updates. The method is
highly parallelizable, and our efficient implemen-
tation does the learning in only a few hours.

5 Experimental Setup

We extract binary relations from a multiple-source
news corpus (§5.1) and compute local and global
scores. We form entailment graphs based on the
similarity scores and test our model on two entail-
ment rules datasets (§5.2). We then discuss pa-
rameter tuning (§5.3) and baseline systems (§5.4).

5.1 Training Corpus: Multiple-Source News

We use the multiple-source NewsSpike corpus of
Zhang and Weld (2013). NewsSpike was deliber-
ately built to include different articles from differ-
ent sources describing identical news stories. They
scraped RSS news feeds from January-February
2013 and linked them to full stories collected
through a web search of the RSS titles. The cor-
pus contains 550K news articles (20M sentences).
Since this corpus contains multiple sources cover-
ing the same events, it is well-suited to our purpose
of learning entailment and paraphrase relations.

We extracted 29M binary relations using the
procedure in Section 3.1. In our experiments, we
used two cutoffs within each typed subgraph to re-
duce the effect of noise in the corpus: (1) remove
any argument-pair that is observed with less than
C1=3 unique predicates; (2) remove any predi-
cate that is observed with less than C2=3 unique
argument-pairs. This leaves us with |P |=101K
unique predicates in 346 entailment graphs. The
maximum graph size is 53K nodes8 and the to-
tal number of non-zero local scores in all graphs
is 66M. In the future, we plan to test our method
on an even larger corpus, but preliminary exper-
iments suggest that data sparsity will persist re-
gardless of the corpus size, due to the power law
distribution of the terms. We compared our ex-
tractions qualitatively with Stanford Open IE (Et-
zioni et al., 2011; Angeli et al., 2015). Our CCG-
based extraction generated noticeably better rela-

7In our experiments, the total number of edges is ≈
.01|V |2 and most of predicate pairs are seen in less than 20
subgraphs, instead of |T|2.

8There are 4 graphs with more than 20K nodes, 3 graphs
with 10K to 20K nodes, and 16 graphs with 1K to 10K nodes.


tions for longer sentences with long-range depen-
dencies such as those involving coordination.

5.2 Evaluation Entailment Datasets
Levy/Holt’s Entailment Dataset Levy and Da-
gan (2016) proposed a new annotation method
(and a new dataset) for collecting relational in-
ference data in context. Their method removes
a major bias in other inference datasets such as
Zeichner’s (Zeichner et al., 2012), where candi-
date entailments were selected using a directional
similarity measure. Levy & Dagan form ques-
tions of the type which city (qtype), is located near
(qrel), mountains (qarg)? and provide possible an-
swers of the form Kyoto (aanswer), is surrounded
by (arel), mountains (aarg). Annotators are shown
a question with multiple possible answers, where
aanswer is masked by qtype to reduce the bias to-
wards world knowledge. If the annotator indicates
the answer as True (False), it is interpreted that the
predicate in the answer entails (does not entail) the
predicate in the question.

While the Levy entailment dataset removes
bias, a recent evaluation identified high labeling
error rate for entailments that hold only in one di-
rection (Holt, 2018). Holt analyzed 150 positive
examples and showed that 33% of the claimed en-
tailments are correct only in the opposite direc-
tion, while 15% do not entail in any direction.
Holt (2018) designed a task to crowd-annotate
the dataset by a) adding the reverse entailment
(q→a) for each original positive entailment (a→q)
in Levy’s dataset; and b) directly asking the an-
notators if a positive example (or its reverse) is
an entailment or not (as opposed to relying on a
factoid question). We test our method on this re-
annotated dataset of 18,407 examples (3,916 pos-
itive and 14,491 negative), which we refer to as
Levy/Holt.9 We run our CCG based binary rela-
tion extraction on the examples and perform our
typing procedure (§3.2) on aanswer (e.g., Kyoto)
and aarg (e.g., mountains) to find the types of the
arguments. We split the re-annotated dataset into
dev (30%) and test (70%) such that all the exam-
ples with the same qtype and qrel are assigned to
only one of the sets.

Berant’s Entailment Dataset Berant et al.
(2011) annotated all the edges of 10 typed entail-
ment graphs based on the predicates in their cor-
pus. The dataset contains 3,427 edges (positive),

9www.github.com/xavi-ai/relational-implication-dataset

and 35,585 non-edges (negative). We evaluate our
method on all the examples of Berant’s entailment
dataset. The types of this dataset do not match
with FIGER types, but we perform a simple hand-
mapping between their types and FIGER types.10

5.3 Parameter Tuning

We selected λ1=.01 and ε=.3 based on prelim-
inary experiments on the dev set of Levy/Holt’s
dataset. The hyperparameter λ2 is selected from
{0, 0.01, 0.1, 0.5, 1, 1.5, 2, 10,∞}.11 We do not
tune λ2 for Berant’s dataset. We instead use the
selected value based on the Levy/Holt dev set. In
all our experiments, we remove any local score
w0ij < .01. We show precision-recall curves by
changing the threshold δ on the similarity scores.

5.4 Comparison

We test our model by ablation of the global soft
constraints LcrossGraph and LpResolution, testing
simple baselines to resolve sparsity and compar-
ing to the state-of-the-art resources. We also com-
pare with two distributional approaches that can
be used to predict predicate similarity. We com-
pare the following models and resources.

CG_PR is our novel model with both global
soft constraints LcrossGraph and LpResolution. CG
is our model without LpResolution. Local is the lo-
cal distributional similarities without any change.

AVG is the average of local scores across all the
entailment graphs that contain both predicates in
an entailment of interest. We set λ2 = ∞ which
forces all the values of ~β to be 1, hence resulting in
a uniform average of local scores. Untyped scores
are local scores learned without types. We set the
cutoffs C1=20 and C2=20 to have a graph with
total number of edges similar to the typed entail-
ment graphs.

ConvE scores are cosine similarities of low-
dimensional predicate representations learned by
ConvE (Dettmers et al., 2018), a state-of-the-art
model for link prediction. ConvE is a multi-layer
convolutional network model that is highly pa-
rameter efficient. We learn 200-dimensional vec-
tors for each predicate (and argument) by apply-
ing ConvE to the set of extractions of the above
untyped graph. We learned embeddings for each
predicate and its reverse to handle examples where
the argument order of the two predicates are differ-

1010 mappings in total (e.g., animal to living_thing).
11The selected value was usually around 1.5.


ent. Additionally, we tried TransE (Bordes et al.,
2013), another link prediction method which de-
spite of its simplicity, produces very competitive
results in knowledge base completion. However,
we do not present its full results as they were
worse than ConvE.12

PPDB is based on the Paraphrase Database
(PPDB) of Pavlick et al. (2015). We accept an
example as entailment if it is labeled as a para-
phrase or entailment in the PPDB XL lexical or
phrasal collections.13 Berant_ILP is based on the
entailment graphs of Berant et al. (2011).14 For
Berant’s dataset, we directly compared our results
to the ones reported in Berant et al. (2011). For
Levy/Holt’s dataset, we used publicly available
entailment rules derived from Berant et al. (2011)
that gives us one point of precision and recall in
the plots. While the rules are typed and can be ap-
plied in a context sensitive manner, ignoring the
types and applying the rules out of context yields
much better results (Levy and Dagan, 2016). This
is attributable to both the non-standard types used
by Berant et al. (2011) and also the general data
sparsity issue.

In all our experiments, we first test a set of
rule-based constraints introduced by Berant et al.
(2011) on the examples before the prediction by
our methods. In the experiments on Levy/Holt’s
dataset, in order to maintain compatibility with
Levy and Dagan (2016), we also run the lemma
based heuristic process used by them before ap-
plying our methods.We do not apply the lemma
based process on Berant’s dataset in order to com-
pare with Berant et al’s (2011) reported results di-
rectly. In experiments with CG_PR and CG, if the
typed entailment graph corresponding to an exam-
ple does not have one or both predicates, we resort
to the average score between all typed entailment
graphs.

6 Results and Discussion

To test the efficacy of our globally consistent en-
tailment graphs, we compare them with the base-
line systems in Section 6.1. We test the effect of
approximating transitivity constraints in Section

12We also tried the average of GloVe embeddings (Pen-
nington et al., 2014) of the words in each predicate, but the
results were worse than ConvE.

13We also tested the largest collection (XXXL) , but the
precision was very low on Berant’s dataset (below 30%).

14We also tested (Berant et al., 2015), but do not report
the results as they are very similar.

local untyped AVG CG CG_PR
LEVY/HOLT’S dataset

BInc .076 .127 .157 .162 .165
Lin .074 .120 .146 .151 .149

Weed .073 .115 .143 .149 .147
ConvE - .112 - - -

BERANT’S dataset
BInc .138 .167 .144 .177 .179
Lin .147 .158 .172 .186 .189

Weed .146 .154 .171 .184 .187
ConvE - .144 - - -

Table 1: Area under precision-recall curve (for pre-
cision > 0.5) for different variants of similarity mea-
sures: local, untyped, AVG, crossGraph (CG) and
crossGraph + pResolution (CG_PR). We report results
on two datasets. Bold indicates stat significance (see
text).

6.2. Section 6.3 concerns error analysis.

6.1 Globally Consistent Entailment Graphs

We test our method using three distributional
similarity measures: Weeds similarity (Weeds
and Weir, 2003), Lin similarity (Lin, 1998) and
Balanced Inclusion (BInc; Szpektor and Dagan,
2008). The first two similarity measures are sym-
metric,15 while BInc is directional. Figures 4A
and 4B show precision-recall curves of the differ-
ent methods on Levy/Holt’s and Berant’s datasets,
respectively, using BInc. We show the full curve
for BInc as it is directional and on the development
portion of Levy/Holt’s dataset, it yields better re-
sults than Weeds and Lin.

In addition, Table 1 shows the area under the
precision-recall curve (AUC) for all variants of
the three similarity measures. Note that each
method covers a different range of precisions and
recalls. We compute AUC for precisions in the
range [0.5, 1], because predictions with precision
better than random guess are more important for
end applications such as question-answering and
semantic parsing. For each similarity measure, we
tested statistical significance between the methods
using bootstrap resampling with 10K experiments
(Efron and Tibshirani, 1985; Koehn, 2004). In Ta-
ble 1, the best result for each dataset and similarity
measure is boldfaced. If the difference of another
model with the best result is not significantly dif-
ferent with p-value < .05, the second model is also
boldfaced.

15Weeds similarity is the harmonic average of Weeds pre-
cision and Weeds recall, hence a symmetric measure.


(A)
Pr

ec
is

io
n

Recall Recall

Levy/Holt’s dataset Berant’s dataset
Pr

ec
is

io
n

(D)(C)

(B)

Figure 4: Comparison of globally consistent entailment graphs to the baselines on Levy/Holt’s (A) and Berant’s
(B) datasets. The results are compared to graphs learned by Forest Reducible Graph Assumption on Levy/Holts’s
(C) and Berant’s (D) datasets.

Among the distributional similarities based on
BInc, BInc_CG_PR outperforms all the other
models in both datasets. In comparison to BInc
score’s AUC, we observe more than 100% im-
provement on Levy/Holt’s dataset and about 30%
improvement on Berant’s. Given the consistent
gains, our proposed model appears to alleviate
the data sparsity and the noise inherent to lo-
cal scores. Our method also outperforms PPDB
and Berant_ILP on both datasets. The second
best performing model is BInc_CG, which im-
proves the results significantly, especially on Be-
rant’s dataset, over the BInc_AVG (AUC of .177
vs .144). This confirms that learning what subset
of entailments should be generalized across differ-
ent typed entailment graphs (~β) is effective.

The untyped models yield a single large entail-
ment graph. It contains (noisy) edges that are not
found in smaller typed entailment graphs. Despite
the noise, untyped models for all three similarity

measures still perform better than the typed ones
in terms of AUC. However, they do worse in the
high-precision range. For example, BInc_untyped
is worse than BInc for precision > 0.85. The AVG
models do surprisingly well (only about 0.5 to 3.5
below CG_PR in terms of AUC), but note that only
a subset of the typed entailment graphs might have
(untyped) predicates p and q of interest (usually
not more than 10 typed entailment graphs out of
367 graphs). Therefore, the AVG models are gen-
erally expected to outperform the untyped ones
(with only one exception in our experiments), as
typing has refined the entailments and averaging
just improves the recall. Comparison of CG_PR
with CG models confirms that explicitly encour-
aging paraphrase predicates to have the same pat-
terns of entailment is effective. It improves the
results for BInc score, which is a directional sim-
ilarity measure. We also tested applying the para-
phrase resolution soft constraints alone, but the


differences with the local scores were not statis-
tically significant. This suggests that the para-
phrase resolution is more helpful when similarities
are transferred between graphs, as this can cause
inconsistencies around the predicates with trans-
ferred similarities, which are then resolved by the
paraphrase resolution constraints.

The results of the distributional representations
learned by ConvE are worse than most other meth-
ods. We attribute this outcome to the fact that
a) while entailment relations are directional, these
methods are symmetric; b) the learned embed-
dings are optimized for tasks other than entailment
or paraphrase detection; and c) the embeddings
are learned regardless of argument types. How-
ever, even the BInc_untyped baseline outperforms
ConvE, showing that it is important to use a di-
rectional measure that directly models entailment.
We hypothesize that learning predicate represen-
tations based on the distributional inclusion hy-
potheses which do not have the above limitations
might yield better results.

6.2 Effect of Transitivity Constraints
Our largest graph has 53K nodes, we thus tested
approximate methods instead of the ILP to close
entailment relations under transitivity (§2). The
approximate TNF method of Berant et al. (2011)
did not scale to the size of our graphs with moder-
ate sparsity parameters. Berant et al. (2015) also
present a heuristic method, High-To-Low Forest
Reducible Graph (HTL-FRG), which gets slightly
better results than TNF on their dataset, and which
scales to graphs of the size we work with.16

We applied the HTL-FRG method to
the globally consistent similarity scores
(BInc_CG_PR_HTL) and changed the threshold
on the scores to get a precision-recall curve.
Figures 4C and 4D show the results of this
method on Levy/Holt’s and Berant’s datasets. Our
experiments show, in contrast to the results of
Berant et al. (2015), that the HTL-FRG method
leads to worse results when applied to our global
scores. This result is caused both by the use of
heuristic methods in place of globally optimizing
via ILP, and by the removal of many valid edges
arising from the fact that the FRG assumption is
not correct for many real-world domains.

16TNF did not converge after two weeks for threshold
δ = .04. For δ = .12 (precisions higher than 80%), it
converged, but with results slightly worse than HTL-FRG on
both datasets.

Error type Example
False Positive

Spurious
correlation
(57%)

Microsoft released Internet Ex-
plorer → Internet Explorer was
developed by Microsoft

Relation nor-
malization
(31%)

The pain may be relieved by as-
pirin → The pain can be treated
with aspirin

Lemma based
process &
parsing (12%)

President Kennedy came to Texas
→ President Kennedy came from
Texas
False Negative

Sparsity
(93%)

Cape town lies at the foot of
mountains → Cape town is lo-
cated near mountains

Wrong label
& parsing
(7%)

Horses are imported from Aus-
tralia → Horses are native to Aus-
tralia

Table 2: Examples of different error categories and
relative frequencies. The cause of errors is boldfaced.

6.3 Error Analysis
We analyzed 100 false positive (FP) and 100 false
negative (FN) randomly selected examples (using
BInc_CG_ST results on Levy/Holt’s dataset and at
the precision level of Berant_ILP, i.e. 0.76). We
present our findings in Table 2. Most of the FN
errors are due to data sparsity, but a few errors are
due to wrong labeling of the data and parsing er-
rors. More than half of the FP errors are because of
spurious correlations in the data that are captured
by the similarity scores, but are not judged to con-
stitute entailment by the human judges. About one
third of the FP errors are because of the normal-
ization we currently perform on the relations, e.g.,
we remove modals and auxiliaries. The remain-
ing errors are mostly due to parsing and our use of
Levy and Dagan’s (2016) lemma based heuristic
process.

7 Extrinsic Evaluation

To further test the utility of explicit entailment
rules, we evaluate the learned rules on an ex-
trinsic task: answer selection for machine read-
ing comprehension on NewsQA, a dataset that
contains questions about CNN articles (Trischler
et al., 2017). Machine reading comprehension
is usually evaluated by posing questions about a
text passage and then assessing the answers of
a system (Trischler et al., 2017). The datasets
that are used for this task are often in the form
of (document,question,answer) triples, where an-


The board hailed Romney for his solid credentials. Who praised Mitt Romney’s credentials?
Researchers announced this week that they’ve found a new
gene, ALS6, which is responsible for . . .

Which gene did the ALS association dis-
cover ?

One out of every 17 children under 3 years old in America
has a food allergy, and some will outgrow their sensitivities.

How many Americans suffer from food
allergies?

The reported compromise could itself run afoul of European
labor law, opening the way for foreign workers . . .

What law might the deal break?

. . . Barnes & Noble CEO William Lynch said as he unveiled
his company ’s Nook Tablet on Monday.

Who launched the Nook Tablet?

The report said opium has accounted for more than half of
Afghanistan ’s gross domestic product in 2007.

What makes up half of Afghanistans
GDP ?

Table 3: Examples where explicit entailment relations improve the rankings. The related words are boldfaced.

swer is a short span of the document. Answer
selection is an important task where the goal is
to select the sentence(s) that contain the answer.
We show improvements by adding knowledge
from our learned entailments without changing the
graphs or tuning them to this task in any way.

Inverse sentence frequency (ISF) is a strong
baseline for answer selection (Trischler et al.,
2017). The ISF score between a sentence Si
and a question Q is defined as ISF(Si,Q) =∑

w∈Si∩Q IDF(w), where IDF(w) is the inverse
document frequency of the word w by considering
each sentence in the whole corpus as one docu-
ment. The state-of-the-art methods for answer se-
lection use ISF and by itself it already does quite
well (Trischler et al., 2017; Narayan et al., 2018).
We propose to extend the ISF score with entail-
ment rules. We define a new score

ISFEnt(Si,Q) = αISF(Si,Q)

+ (1 −α)|{r1 ∈ Si,r2 ∈ Q : r1 → r2}|,

where α ∈ [0, 1] is a hyper-parameter and r1 and
r2 denote relations in the sentence and the ques-
tion, respectively. The intuition is that if a sen-
tence such as “Luka Modric sustained a fracture
to his right fibula” is a paraphrase of or entails
the answer of a question such as “What does Luka
Modric suffer from?”, it will contain the answer
span. We consider an entailment decision be-
tween two typed predicates if their global similar-
ity BInc_CG_PR is higher than a threshold δ.

We also considered entailments between unary
relations (one argument) by leveraging our learned
binary entailments. We split each binary entail-
ment into two potential unary entailments. For
example, the entailment visit1,2(:person,:location)
→ arrive1,in(:person,:location), is split

ACC MRR MAP
ISF 36.18 48.99 48.57

ISFEnt 37.61 50.06 49.63

Table 4: Results (in percentage) for answer selection
on the NewsQA dataset.

into visit1(:person) → arrive1(:person) and
visit2(:location) → arrivein(:location). We
computed unary similarity scores by averaging
over all related binary scores. This is particularly
helpful when one argument is not present (e.g.,
adjuncts or Wh questions) or does not exactly
match between the question and the answer.

We test the proposed answer selection score on
NewsQA, a dataset that contains questions about
CNN articles (Trischler et al., 2017). The dataset
is collected in a way that encourages lexical and
syntactic divergence between questions and doc-
uments. The crowdworkers who wrote questions
saw only a news article headline and its summary
points, but not the full article. This process en-
courages curiosity about the contents of the full
article and prevents questions that are simple re-
formulations of article sentences (Trischler et al.,
2017). This is a more realistic and suitable setting
to test paraphrasing and entailment capabilities.

We use the development set of the dataset (5165
samples) to tune α and δ and report results on
the test set (5124 examples) in Table 4. We ob-
serve about 1.4% improvement in accuracy (ACC)
and 1% improvement in Mean Reciprocal Rank
(MRR) and Mean Average Precision (MAP), con-
firming that entailment rules are helpful for an-
swer selection.17 Table 3 shows some of the ex-

17The accuracy results of Narayan et al. (2018) are not
consistent with their own MRR and MAP (ACC>MRR in
come cases), as they break ties between ISF scores differ-


wij = 1(cij > λ1)(cij −λ1)/τij (6)

cij = w
0
ij +

∑
(i′,j′)∈N(i,j)

β(·)wi′j′ −1(wij > ε)Iε(wji)
∑

k∈V (τ1(i),τ2(i))

[
(wik −wjk)2 + (wki −wkj)2

]
+ 2

∑
k∈V (τ1(i),τ2(i))

Iε(wjk)Iε(wkj)wik + Iε(wik)Iε(wki)wkj (7)

τij = 1 +
∑

(i′,j′)∈N(i,j)

β(·) + 2
∑

k∈V (τ1(i),τ2(i))

Iε(wjk)Iε(wkj) + Iε(wik)Iε(wki) (8)

β(·) = I0
(

1 −
( ∑
j∈V (τ1(i),τ2(i))

∑
(i′,j′)∈N(i,j)

(wij −wi′j′ )2
)
/λ2

)
. (9)

Figure 5: The update rules for wij and β(·).

amples where ISFEnt ranks the correct sentences
higher than ISF. These examples are very chal-
lenging for methods that do not have entailment
and paraphrasing knowledge, and illustrate the se-
mantic interpretability of the entailment graphs.

We also performed a similar evaluation on
the Stanford Natural Language Inference dataset
(SNLI; Bowman et al., 2015) and obtained 1%
improvement over a basic neural network archi-
tecture that models sentences with an n-layered
LSTM (Conneau et al., 2017). However, we did
not get improvements over the state of the art re-
sults because only a few of the SNLI examples re-
quire external knowledge of predicate entailments.
Most examples require reasoning capabilities such
as A∧B → B and simple lexical entailments such
as boy → person, which are often present in the
training set.

8 Conclusions and Future Work

We have introduced a scalable framework to learn
typed entailment graphs directly from text. We
use global soft constraints to learn globally con-
sistent entailment scores for entailment relations.
Our experiments show that generalizing in this
way across different but related typed entail-
ment graphs significantly improves performance
over local similarity scores on two standard text-
entailment datasets. We show around 100% in-
crease in AUC on Levy/Holt’s dataset and 30%
on Berant’s dataset. The method also outper-
forms PPDB and the prior state-of-the-art entail-
ment graph-building approach due to Berant et al.

ently when computing ACC compared to MRR and MAP. See
also http://homepages.inf.ed.ac.uk/scohen/
acl18external-errata.pdf.

(2011). Paraphrase Resolution further improves
the results. We have in addition showed the util-
ity of entailment rules on answer selection for ma-
chine reading comprehension.

In the future, we plan to show that the global
soft constraints developed in this paper can be
extended to other structural properties of entail-
ment graphs such as transitivity. Future work
might also look at entailment relation learning
and link prediction tasks jointly. The entailment
graphs can be used to improve relation extrac-
tion, similar to Eichler et al. (2017), but cover-
ing more relations. In addition, we intend to col-
lapse cliques in the entailment graphs to para-
phrase clusters with a single relation identifier, and
to replace the form-dependent lexical semantics of
the CCG parser with these form-independent rela-
tions (Lewis and Steedman, 2013a) and to use the
entailment graphs to derive meaning postulates for
use in tasks such as question-answering and con-
struction of knowledge-graphs from text (Lewis
and Steedman, 2014).

Appendix A

Figure 5 shows the update rules of the learning al-
gorithm. The global similarity scores wij are up-
dated using Eq. 6, where cij and τij are defined
in Eq. 7 and Eq. 8, respectively. 1(x) equals 1
if the condition x is satisfied and zero, otherwise.
The compatibility functions β(·) are updated using
Eq. 9.

Acknowledgements

We thank Thomas Kober and Li Dong for help-
ful comments and feedback on the work, Reg-

http://homepages.inf.ed.ac.uk/scohen/acl18external-errata.pdf
http://homepages.inf.ed.ac.uk/scohen/acl18external-errata.pdf


gie Long for preliminary experiments on ope-
nIE extractions, and Ronald Cardenas for provid-
ing baseline code for the NewsQA experiments.
The authors would also like to thank Katrin Erk
and the three anonymous reviewers for their valu-
able feedback. This work was supported in part
by the Alan Turing Institute under the EPSRC
grant EP/N510129/1. The experiments were made
possible by Microsoft’s donation of Azure cred-
its to The Alan Turing Institute. The research
was supported in part by ERC Advanced Fellow-
ship GA 742137 SEMANTAX, a Google faculty
award, a Bloomberg L.P. Gift award, and a Uni-
versity of Edinburgh/Huawei Technologies award
to Steedman. Chambers was supported in part
by the National Science Foundation under Grant
IIS-1617952. Steedman and Johnson were sup-
ported by the Australian Research Council’s Dis-
covery Projects funding scheme (project number
DP160102156).

References

Omri Abend, Shay B. Cohen, and Mark Steedman.
2014. Lexical Inference over Multi-Word Pred-
icates: A Distributional Approach. In Proceed-
ings of the 52nd Annual Meeting of the Associa-
tion for Computational Linguistics, pages 644–
654.

Gabor Angeli, Melvin Johnson Premkumar, and
Christopher D. Manning. 2015. Leveraging
Linguistic Structure for Open Domain Informa-
tion Extraction. In Proceedings of the 53rd An-
nual Meeting of the Association for Computa-
tional Linguistics, pages 344–354.

Jonathan Berant, Noga Alon, Ido Dagan, and Ja-
cob Goldberger. 2015. Efficient Global Learn-
ing of Entailment Graphs. Computational Lin-
guistics, 42:221–263.

Jonathan Berant, Ido Dagan, Meni Adler, and Ja-
cob Goldberger. 2012. Efficient Tree-Based
Approximation for Entailment Graph Learning.
In Proceedings of the 50th Annual Meeting of
the Association for Computational Linguistics,
pages 117–125.

Jonathan Berant, Jacob Goldberger, and Ido Da-
gan. 2011. Global Learning of Typed Entail-
ment Rules. In Proceedings of the 49th Annual

Meeting of the Association for Computational
Linguistics, pages 610–619.

Kurt Bollacker, Colin Evans, Praveen Paritosh,
Tim Sturge, and Jamie Taylor. 2008. Freebase:
A Collaboratively Created Graph Database for
Structuring Human Knowledge. In Proceedings
of the ACM SIGMOD international conference
on Management of data, pages 1247–1250.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-
Duran, Jason Weston, and Oksana Yakhnenko.
2013. Translating Embeddings for Modeling
Multi-Relational Data. In Advances in neural
information processing systems, pages 2787–
2795.

Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
Large Annotated Corpus for Learning Natural
Language Inference. In Proceedings of the
Conference on Empirical Methods in Natural
Language Processing, pages 632–642.

Alexis Conneau, Douwe Kiela, Holger Schwenk,
Loïc Barrault, and Antoine Bordes. 2017. Su-
pervised Learning of Universal Sentence Rep-
resentations from Natural Language Inference
Data. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Process-
ing, pages 670–680.

Ido Dagan, Lillian Lee, and Fernando C.N.
Pereira. 1999. Similarity-Based Models of
Word Cooccurrence Probabilities. Machine
learning, 34(1-3):43–69.

Tim Dettmers, Minervini Pasquale, Stenetorp
Pontus, and Sebastian Riedel. 2018. Convolu-
tional 2D Knowledge Graph Embeddings. In
Proceedings of the 32th AAAI Conference on
Artificial Intelligence, pages 1811–1818.

Li Dong, Jonathan Mallinson, Siva Reddy, and
Mirella Lapata. 2017. Learning to Paraphrase
for Question Answering. In Proceedings of the
Conference on Empirical Methods in Natural
Language Processing, pages 875–886.

Bradley Efron and Robert Tibshirani. 1985. The
Bootstrap Method for Assessing Statistical Ac-
curacy. Behaviormetrika, 12(17):1–35.


Kathrin Eichler, Feiyu Xu, Hans Uszkoreit, and
Sebastian Krause. 2017. Generating Pattern-
Based Entailment Graphs for Relation Extrac-
tion. In Proceedings of the 6th Joint Confer-
ence on Lexical and Computational Semantics
(* SEM 2017), pages 220–229.

Oren Etzioni, Anthony Fader, Janara Christensen,
Stephen Soderland, and Mausam Mausam.
2011. Open Information Extraction: The Sec-
ond Generation. In Proceedings of the 22nd In-
ternational Joint Conference on Artificial Intel-
ligence, pages 3–10.

Juri Ganitkevitch, Benjamin Van Durme, and
Chris Callison-Burch. 2013. PPDB: The Para-
phrase Database. In Proceedings of the 2013
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, pages 758–
764.

Maayan Geffet and Ido Dagan. 2005. The Distri-
butional Inclusion Hypotheses and Lexical En-
tailment. In Proceedings of the 43rd Annual
Meeting on Association for Computational Lin-
guistics, pages 107–114.

Aurélie Herbelot and Mohan Ganesalingam. 2013.
Measuring Semantic Content in Distributional
Vectors. In Proceedings of the 51st Annual
Meeting of the Association for Computational
Linguistics, pages 440–445.

Xavier R. Holt. 2018. Probabilistic Models of
Relational Implication. Master’s thesis, Mac-
quarie University.

Dimitri Kartsaklis and Mehrnoosh Sadrzadeh.
2016. Distributional Inclusion Hypothesis for
Tensor-based Composition. In Proceedings of
the 26th International Conference on Compu-
tational Linguistics: Technical Papers, pages
2849–2860.

Ross Kindermann and J Laurie Snell. 1980.
Markov Random Fields and their Applications,
volume 1. American Mathematical Society.

Philipp Koehn. 2004. Statistical Significance
Tests for Machine Translation Evaluation. In
Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing,
pages 388–395.

Omer Levy and Ido Dagan. 2016. Annotating Re-
lation Inference in Context via Question An-
swering. In Proceedings of the 54th Annual
Meeting of the Association for Computational
Linguistics, pages 249–255.

Mike Lewis. 2014. Combined Distributional and
Logical Semantics. Ph.D. thesis, University of
Edinburgh.

Mike Lewis and Mark Steedman. 2013a. Com-
bined Distributional and Logical Semantics.
Transactions of the Association for Computa-
tional Linguistics, 1:179–192.

Mike Lewis and Mark Steedman. 2013b. Unsu-
pervised Induction of Cross-Lingual Semantic
Relations. In Proceedings of the Conference on
Empirical Methods in Natural Language Pro-
cessing, pages 681–692.

Mike Lewis and Mark Steedman. 2014. Combin-
ing Formal and Distributional Models of Tem-
poral and Intensional Semantics. In Proceed-
ings of the ACL Workshop on Semantic Parsing,
pages 28–32.

Dekang Lin. 1998. Automatic Retrieval and Clus-
tering of Similar Words. In Proceedings of
the 36th Annual Meeting of the Association for
Computational Linguistics, pages 768–774.

Xiao Ling and Daniel S. Weld. 2012. Fine-
Grained Entity Recognition. In Proceedings of
the National Conference of the Association for
Advancement of Artificial Intelligence, pages
94–100.

Shashi Narayan, Ronald Cardenas, Nikos Pa-
pasarantopoulos, Shay B. Cohen, Mirella Lap-
ata, Jiangsheng Yu, and Yi Chang. 2018. Doc-
ument Modeling with External Attention For
Sentence Extraction. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
tational Linguistics, pages 2020–2030.

Dat Ba Nguyen, Johannes Hoffart, Martin
Theobald, and Gerhard Weikum. 2014. AIDA-
light: High-Throughput Named-Entity Disam-
biguation. In Workshop on Linked Data on the
Web, pages 1–10.

Terence Parsons. 1990. Events in the Semantics of
English: A Study in Subatomic Semantics. MIT
Press, Cambridge, MA.


Ellie Pavlick, Pushpendre Rastogi, Juri Gan-
itkevitch, Benjamin Van Durme, and Chris
Callison-Burch. 2015. PPDB 2.0: Better Para-
phrase Ranking, Fine-Grained Entailment Rela-
tions, Word Embeddings, and Style Classifica-
tion. In Proceedings of the 53rd Annual Meet-
ing of the Association for Computational Lin-
guistics, pages 425–430.

Jeffrey Pennington, Richard Socher, and Christo-
pher D. Manning. 2014. GloVe: Global Vec-
tors for Word Representation. In Proceedings
of the Conference on Empirical Methods in Nat-
ural Language Processing, pages 1532–1543.

Siva Reddy, Mirella Lapata, and Mark Steed-
man. 2014. Large-Scale Semantic Parsing with-
out Question-Answer Pairs. Transactions of
the Association for Computational Linguistics,
2:377–392.

Sebastian Riedel, Limin Yao, Andrew McCallum,
and Benjamin M. Marlin. 2013. Relation Ex-
traction with Matrix Factorization and Univer-
sal Schemas. In Proceedings of the Conference
of the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Language Technologies, pages 74–84.

Stefan Schoenmackers, Oren Etzioni, Daniel S.
Weld, and Jesse Davis. 2010. Learning First-
Order Horn Clauses From Web Text. In Pro-
ceedings of the Conference on Empirical Meth-
ods in Natural Language Processing, pages
1088–1098.

Richard Socher, Danqi Chen, Christopher D. Man-
ning, and Andrew Ng. 2013. Reasoning with
Neural Tensor Networks for Knowledge Base
Completion. In Advances in neural information
processing systems, pages 926–934.

Mark Steedman. 2000. The Syntactic Process.
MIT Press, Cambridge, MA.

Idan Szpektor and Ido Dagan. 2008. Learning En-
tailment Rules for Unary Templates. In Pro-
ceedings of the 22nd International Conference
on Computational Linguistics, pages 849–856.

Adam Trischler, Tong Wang, Xingdi Yuan, Justin
Harris, Alessandro Sordoni, Philip Bachman,
and Kaheer Suleman. 2017. NewsQA: A Ma-
chine Comprehension Dataset. In Proceedings

of the 2nd Workshop on Representation Learn-
ing for NLP, pages 191–200.

Théo Trouillon, Johannes Welbl, Sebastian Riedel,
Éric Gaussier, and Guillaume Bouchard. 2016.
Complex Embeddings for Simple Link Pre-
diction. In Proceedings of the 33rd Interna-
tional Conference on International Conference
on Machine Learning, pages 2071–2080.

Yushi Wang, Jonathan Berant, and Percy Liang.
2015. Building a Semantic Parser Overnight.
In Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistics,
pages 1332–1342.

Julie Weeds and David Weir. 2003. A Gen-
eral Framework for Distributional Similarity.
In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing,
pages 81–88.

Yangyang Xu and Wotao Yin. 2013. A Block
Coordinate Descent Method for Regularized
Multiconvex Optimization with Applications to
Nonnegative Tensor Factorization and Com-
pletion. SIAM Journal on imaging sciences,
6(3):1758–1789.

Bishan Yang, Wen-tau Yih, Xiaodong He, Jian-
feng Gao, and Li Deng. 2015. Embedding En-
tities and Relations for Learning and Inference
in Knowledge Bases. In Proceedings of the In-
ternational Conference on Learning Represen-
tations.

Naomi Zeichner, Jonathan Berant, and Ido Dagan.
2012. Crowdsourcing Inference-Rule Evalua-
tion. In Proceedings of the 50th Annual Meet-
ing of the Association for Computational Lin-
guistics, pages 156–160.

Congle Zhang and Daniel S. Weld. 2013. Harvest-
ing Parallel News Streams to Generate Para-
phrases of Event Relations. In Proceedings of
the Conference on Empirical Methods in Natu-
ral Language Processing, pages 1776–1786.