Domain-Targeted, High Precision Knowledge Extraction

Bhavana Dalvi Mishra Niket Tandon

Allen Institute for Artificial Intelligence
2157 N Northlake Way Suite 110, Seattle, WA 98103
{bhavanad,nikett,peterc}@allenai.org

Peter Clark

Abstract

Our goal is to construct a domain-targeted,
high precision knowledge base (KB), contain-
ing general (subject,predicate,object) state-
ments about the world, in support of a down-
stream question-answering (QA) application.
Despite recent advances in information extrac-
tion (IE) techniques, no suitable resource for
our task already exists; existing resources are
either too noisy, too named-entity centric, or
too incomplete, and typically have not been
constructed with a clear scope or purpose.
To address these, we have created a domain-
targeted, high precision knowledge extraction
pipeline, leveraging Open IE, crowdsourcing,
and a novel canonical schema learning algo-
rithm (called CASI), that produces high pre-
cision knowledge targeted to a particular do-
main - in our case, elementary science. To
measure the KB’s coverage of the target do-
main’s knowledge (its “comprehensiveness”
with respect to science) we measure recall
with respect to an independent corpus of do-
main text, and show that our pipeline produces
output with over 80% precision and 23% re-
call with respect to that target, a substantially
higher coverage of tuple-expressible science
knowledge than other comparable resources.
We have made the KB publicly available1.

1 Introduction
While there have been substantial advances in
knowledge extraction techniques, the availability of
high precision, general knowledge about the world,

1This KB named as “Aristo Tuple KB” is available for down-
load at http://data.allenai.org/tuple-kb

remains elusive. Specifically, our goal is a large,
high precision body of (subject,predicate,object)
statements relevant to elementary science, to sup-
port a downstream QA application task. Although
there are several impressive, existing resources that
can contribute to our endeavor, e.g., NELL (Carlson
et al., 2010), ConceptNet (Speer and Havasi, 2013),
WordNet (Fellbaum, 1998), WebChild (Tandon et
al., 2014), Yago (Suchanek et al., 2007), FreeBase
(Bollacker et al., 2008), and ReVerb-15M (Fader et
al., 2011), their applicability is limited by both
• limited coverage of general knowledge (e.g.,

FreeBase and NELL primarily contain knowl-
edge about Named Entities; WordNet uses only
a few (< 10) semantic relations)
• low precision (e.g., many ConceptNet asser-

tions express idiosyncratic rather than general
knowledge)

Our goal in this work is to create a domain-targeted
knowledge extraction pipeline that can overcome
these limitations and output a high precision KB
of triples relevant to our end task. Our approach
leverages existing techniques of open information
extraction (Open IE) and crowdsourcing, along with
a novel schema learning algorithm.

There are three main contributions of this work.
First, we present a high precision extraction pipeline
able to extract (subject,predicate,object) tuples rele-
vant to a domain with precision in excess of 80%.
The input to the pipeline is a corpus, a sense-
disambiguated domain vocabulary, and a small set
of entity types. The pipeline uses a combination of
text filtering, Open IE, Turker annotation on sam-
ples, and precision prediction to generate its output.

233

Transactions of the Association for Computational Linguistics, vol. 5, pp. 233–246, 2017. Action Editor: Patrick Pantel.
Submission batch: 11/2016; Revision batch: 2/2017; Published 7/2017.

c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


Second, we present a novel canonical schema in-
duction method (called CASI) that identifies clus-
ters of similar-meaning predicates, and maps them
to the most appropriate general predicate that cap-
tures that canonical meaning. Open IE, used in
the early part of our pipeline, generates triples con-
taining a large number of predicates (expressed as
verbs or verb phrases), but equivalences and gen-
eralizations among them are not captured. Syn-
onym dictionaries, paraphrase databases, and verb
taxonomies can help identify these relationships,
but only partially so because the meaning of a
verb often shifts as its subject and object vary,
something that these resources do not explicitly
model. To address this challenge, we have devel-
oped a corpus-driven method that takes into account
the subject and object of the verb, and thus can
learn argument-specific mapping rules, e.g., the rule
“(x:Animal,found in,y:Location) → (x:Animal,live
in,y:Location)” states that if some animal is found
in a location then it also means the animal lives in
the location. Note that ‘found in’ can have very dif-
ferent meaning in the schema “(x:Substance,found
in,y:Material). The result is a KB whose general
predicates are more richly populated, still with high
precision.

Finally, we contribute the science KB itself as a
resource publicly available2 to the research commu-
nity. To measure how “complete” the KB is with re-
spect to the target domain (elementary science), we
use an (independent) corpus of domain text to char-
acterize the target science knowledge, and measure
the KB’s recall at high (>80%) precision over that
corpus (its “comprehensiveness” with respect to sci-
ence). This measure is similar to recall at the point
P=80% on the PR curve, except measured against a
domain-specific sample of data that reflects the dis-
tribution of the target domain knowledge. Compre-
hensiveness thus gives us an approximate notion of
the completeness of the KB for (tuple-expressible)
facts in our target domain, something that has been
lacking in earlier KB construction research. We
show that our KB has comprehensiveness (recall
of domain facts at >80% precision) of 23% with
respect to science, a substantially higher coverage

2Aristo Tuple KB is available for download at http://
allenai.org/data/aristo-tuple-kb

of tuple-expressible science knowledge than other
comparable resources. We are making the KB pub-
licly available.

Outline
We discuss the related work in Section 2. In Sec-

tion 3, we describe the domain-targeted pipeline, in-
cluding how the domain is characterized to the al-
gorithm and the sequence of filters and predictors
used. In Section 4, we describe how the relation-
ships between predicates in the domain are identi-
fied and the more general predicates further pop-
ulated. Finally in Section 5, we evaluate our ap-
proach, including evaluating its comprehensiveness
(high-precision coverage of science knowledge).

2 Related Work

There has been substantial, recent progress in
knowledge bases that (primarily) encode knowledge
about Named Entities, including Freebase (Bol-
lacker et al., 2008), Knowledge Vault (Dong et al.,
2014), DBPedia (Auer et al., 2007), and others that
hierarchically organize nouns and named entities,
e.g., Yago (Suchanek et al., 2007). While these
KBs are rich in facts about named entities, they are
sparse in general knowledge about common nouns
(e.g., that bears have fur). KBs covering general
knowledge have received less attention, although
there are some notable exceptions constructed using
manual methods, e.g., WordNet (Fellbaum, 1998),
crowdsourcing, e.g., ConceptNet (Speer and Havasi,
2013), and, more recently, using automated meth-
ods, e.g., WebChild (Tandon et al., 2014). While
useful, these resources have been constructed to tar-
get only a small set of relations, providing only lim-
ited coverage for a domain of interest.

To overcome relation sparseness, the paradigm
of Open IE (Banko et al., 2007; Soderland et al.,
2013) extracts knowledge from text using an open
set of relationships, and has been used to success-
fully build large-scale (arg1,relation,arg2) resources
such as ReVerb-15M (containing 15 million general
triples) (Fader et al., 2011). Although broad cov-
erage, however, Open IE techniques typically pro-
duce noisy output. Our extraction pipeline can be
viewed as an extension of the Open IE paradigm:
we start with targeted Open IE output, and then ap-
ply a sequence of filters to substantially improve the

234


Figure 1: The extraction pipeline. A vocabulary-guided sequence of open information extraction, crowdsourcing, and
learning predicate relationships are used to produce high precision tuples relevant to the domain of interest.

output’s precision, and learn and apply relationships
between predicates.

The task of finding and exploiting relationships
between different predicates requires identifying
both equivalence between relations (e.g., clustering
to find paraphrases), and implication (hierarchical
organization of relations). One class of approach
is to use existing resources, e.g., verb taxonomies,
as a source of verbal relationships, e.g., (Grycner
and Weikum, 2014), (Grycner et al., 2015). How-
ever, the hierarchical relationship between verbs, out
of context, is often unclear, and some verbs, e.g.,
“have”, are ambiguous. To address this, we char-
acterize semantic relationships not only by a verb
but also by the types of its arguments. A sec-
ond class of approach is to induce semantic equiva-
lence from data, e.g., using algorithms such as DIRT
(Lin and Pantel, 2001), RESOLVER (Yates and Et-
zioni, 2009), WiseNet (Moro and Navigli, 2012),
and AMIE (Galárraga et al., 2013). These allow
relational equivalences to be inferred, but are also
noisy. In our pipeline, we combine these two ap-
proaches together, by clustering relations using a
similarity measure computed from both existing re-
sources and data.

A novel feature of our approach is that we not
only cluster the (typed) relations, but also identify
a canonical relation that all the other relations in a
cluster can be mapped to, without recourse to human
annotated training data or a target relational vocab-
ulary (e.g., from Freebase). This makes our prob-
lem setting different from that of universal schema
(Riedel et al., 2013) where the clusters of relations
are not explicitly represented and mapping to canon-

ical relations can be achieved given an existing KB
like Freebase. Although no existing methods can be
directly applied in our problem setting, the AMIE-
based schema clustering method of (Galárraga et al.,
2014) can be modified to do this also. We have
implemented this modification (called AMIE*, de-
scribed in Section 5.3), and we use it as a baseline
to compare our schema clustering method (CASI)
against.

Finally, interactive methods have been used to
create common sense knowledge bases, for ex-
ample ConceptNet (Speer and Havasi, 2013; Liu
and Singh, 2004) includes a substantial amount of
knowledge manually contributed by people through
a Web-based interface, and used in numerous ap-
plications (Faaborg and Lieberman, 2006; Dinakar
et al., 2012). More recently there has been work
on interactive methods (Dalvi et al., 2016; Wolfe
et al., 2015; Soderland et al., 2013), which can be
seen as a “machine teaching” approach to KB con-
struction. These approaches focus on human-in-the-
loop methods to create domain specific knowledge
bases. Such approaches are proven to be effective
on domains where expert human input is available.
In contrast, our goal is to create extraction tech-
niques that need little human supervision, and result
in comprehensive coverage of the target domain.

3 The Extraction Pipeline

We first describe the overall extraction pipeline. The
pipeline is a chain of filters and transformations, out-
putting (subject,predicate,object) triples at the end.
It uses a novel combination of familiar technologies,
plus a novel schema learning module, described in

235


more detail in Section 4.

3.1 Inputs and Outputs
Unlike many prior efforts, our goal is a domain-
focused KB. To specify the KB’s extent and focus,
we use two inputs:

1. A domain vocabulary listing the nouns and
verbs relevant to the domain. In our particular
application, the domain is Elementary science,
and the domain vocabulary is the typical vocab-
ulary of a Fourth Grader (∼10 year old child),
augmented with additional science terms from
4th Grade Science texts, comprising of about
6000 nouns, 2000 verbs, 2000 adjectives, and
600 adverbs.

2. A small set of types for the nouns, listing the
primary types of entity relevant to the domain.
In our domain, we use a manually constructed
inventory of 45 types (animal, artifact, body
part, measuring instrument, etc.).

In addition, the pipeline also uses:

3. a large, searchable text corpus to provide sen-
tences for knowledge extraction. In our case,
we use the Web via a search engine (Bing), fol-
lowed by filters to extract clean sentences from
search results.

3.2 Word Senses
Although, in general, nouns are ambiguous, in a
targeted domain there is typically a clear, primary
sense that can be identified. For example, while in
general the word “pig” can refer to an animal, a per-
son, a mold, or a block of metal, in 4th Grade Sci-
ence it universally refers to an animal3. We leverage
this for our task by assuming one sense per noun in
the domain vocabulary, and notate these senses by
manually assigning each noun to one of the entity
types in the type inventory.

Verbs are more challenging, because even within
a domain they are often polysemous out of con-
text (e.g., “have”). To handle this, we refer to
verbs along with their argument types, the com-
bination expressed as a verbal schema, e.g., (Ani-
mal,“have”,BodyPart). This allows us to distinguish

3 There are exceptions, e.g., in 4th Grade Science “bat” can
refer to either the animal or the sporting implement, but these
cases are rare.

different contextual uses of a verb without introduc-
ing a proliferation of verb sense symbols. Others
have taken a similar approach of using type restric-
tions to express verb semantics (Pantel et al., 2007;
Del Corro et al., 2014).

3.3 The Pipeline

The pipeline is sketched in Figure 1 and exemplified
in Table 1, and consists of six steps:

3.3.1 Sentence Selection

The first step is to construct a collection of
(loosely) domain-appropriate sentences from the
larger corpus. There are multiple ways this could
be done, but in our case we found the most effective
way was as follows:

a. List the core topics in the domain of inter-
est (science), here producing 81 topics derived
from syllabus guides.

b. For each topic, author 1-3 query templates, pa-
rameterized using one or more of the 45 do-
main types. For example, for the topic “animal
adapation”, a template was “[Animal] adapta-
tion environment”, parameterized by the type
Animal. The purpose of query templates is to
steer the search engine to domain-relevant text.

c. For each template, automatically instantiate its
type(s) in all possible ways using the domain
vocabulary members of those types.

d. Use each instantiation as a search query over
the corpus, and collect sentences in the top
(here, 10) documents retrieved.

In our case, this resulted in a generally domain-
relevant corpus of 7M sentences.

3.3.2 Tuple Generation

Second, we run an open information extraction
system over the sentences to generate an initial set
of (np,vp,np) tuples. In our case, we use OpenIE
4.2 (Soderland et al., 2013; Mausam et al., 2012).

3.3.3 Headword Extraction and Filtering

Third, the np arguments are replaced with their
headwords, by applying a simple headword filtering
utility. We discard tuples with infrequent vps or ver-
bal schemas (here vp frequency < 10, schema fre-
quency < 5).

236


Pipeline Example Outputs:
Inputs: corpus + vocabulary + types

1. Sentence selection:
“In addition, green leaves have chlorophyll.”)

2. Tuple Generation:
(“green leaves” “have” “chlorophyll”)

3. Headword Extraction:
(“leaf” “have” “chlorophyll”)

4. Refinement and Scoring:
(“leaf” “have” “chlorophyll”) @0.89 (score)

5. Phrasal tuple generation:
(“leaf” “have” “chlorophyll”) @0.89 (score)

(“green leaf” “have” “chlorophyll”) @0.89 (score)
6. Relation Canonicalization:

(“leaf” “have” “chlorophyll”) @0.89 (score)
(“green leaf” “have” “chlorophyll”) @0.89 (score)

(“leaf” “contain” “chlorophyll”) @0.89 (score)
(“green leaf” “contain” “chlorophyll”) @0.89 (score)

Table 1: Illustrative outputs of each step of the pipeline
for the term “leaf”.

3.3.4 Refinement and Scoring

Fourth, to improve precision, Turkers are asked
to manually score a proportion (in our case, 15%)
of the tuples, then a model is constructed from this
data to score the remainder. For the Turk task,
Turkers were asked to label each tuple as true or
false/nonsense. Each tuple is labeled 3 times, and
a majority vote is applied to yield the overall label.
The semantics we apply to tuples (and which we ex-
plain to Turkers) is one of plausibility: if the fact
is true for some of the arg1’s, then score it as true.
For example, if it is true that some birds lay eggs,
then the tuple (bird, lay, egg) should be marked true.
The degree of manual vs. automated can be selected
here depending on the precision/cost constraints of
the end application.

We then build a model using this data to predict
scores on other tuples. For this model, we use lo-
gistic regression applied to a set of tuple features.
These tuple features include normalized count fea-
tures, schema and type level features, PMI statis-
tics and semantic features. Normalized count fea-
tures are based on the number of occurrences of tu-
ples, and the number of unique sentences the tuple
is extracted from. Schema and type level features
are derived from the subject and object type, and
frequency of schema in the corpus. Semantic fea-
tures are based on whether subject and object are ab-

stract vs. concrete (using Turney et al’s abstractness
database (Turney et al., 2011)), and whether there
are any modal verbs (e.g. may, should etc.) in the
original sentence. PMI features are derived from the
count statistics of subject, predicate, object and en-
tire triple in the Google n-gram corpus (Brants and
Franz, 2006).

3.3.5 Phrasal Tuple Generation
Fifth, for each headword tuple (n,vp,n), retrieve

the original phrasal triples (np,vp,np) it was de-
rived from, and add sub-phrase versions of these
phrasal tuples to the KB. For example, if a headword
tuple (cat, chase, mouse) was derived from (A black
furry cat, chased, a grey mouse) then the algorithm
considers adding

(black cat, chase, mouse)
(black furry cat, chase, mouse)
(black cat, chase, grey mouse)
(black furry cat, chase, grey mouse)

Valid noun phrases are those following a pattern
“<Adj>* <Noun>+”. The system only retains
constructed phrasal tuples for which both subject
and object phrases satisfy PMI and count thresh-
olds4, computed using the Google N-gram corpus
(Brants and Franz, 2006). In general, if the head-
word tuple is scored as correct and the PMI and
count thresholds are met, then the phrasal originals
and variants are also correct. (We evaluate this in
Section 5.2).

3.3.6 Canonical Schema Induction
Finally, we induce a set of schema mapping rules

over the tuples that identify clusters of equivalent
and similar relations, and map them to a canonical,
generalized relation. These canonical, generalized
relations are referred to as canonical schemas, and
the induction algorithm is called CASI (Canonical
Schema Induction). The rules are then applied to
the tuples, resulting in additional general tuples be-
ing added to the KB. The importance of this step is
that generalizations among seemingly disparate tu-
ples are made explicit. While we could then discard

4 e.g., “black bear” is a usable phrase provided
it occurs > k1 times in the N-gram corpus and
log[p(“black bear”)/p(“black”).p(“bear”)] > k2 in the
N-gram corpus, where constants k1 and k2 were chosen to
optimize performance on a small test set.

237


tuples that are mapped to a generalized form, we in-
stead retain them in case a query is made to the KB
that requires the original fine-grained distinctions.
In the next section, we describe how these schema
mapping rules are learned.

4 Canonical Schema Induction (CASI)

4.1 Task: Induce schema mapping rules

The role of the schema mapping rules is to make
generalizations among seemingly disparate tuples
explicit in the KB. To do this, the system identi-
fies clusters of relations with similar meaning, and
maps them to a canonical, generalized relation. The
mappings are expressed using a set of schema map-
ping rules, and the rules can be applied to infer ad-
ditional, general triples in the KB. Informally, map-
ping rules should combine evidence from both ex-
ternal resources (e.g., verb taxonomies) and data
(tuples in the KB). This observation allows us to
formally define an objective function to guide the
search for mapping rules. We define:

• a schema is a structure
(type1,verb phrase,type2)

here the types are from the input type inventory.
• a schema mapping rule is a rule of the form

schemai → schemaj
stating that a triple using schemai can be re-
expressed using schemaj.
• a canonical schema is a schema that does not

occur on the left-hand side of any mapping rule,
i.e., it does not point to any other schema.

To learn a set of schema mapping rules, we select
from the space of possible mapping rules so as to:

• maximize the quality of the selected mapping
rules, i.e., maximize the evidence that the se-
lected rules express valid paraphrases or gen-
eralization. That is we are looking for synony-
mous and type-of edges between schemas. This
evidence is drawn from both existing resources
(e.g., WordNet) and from statistical evidence
(among the tuples themselves).
• satisfy the constraint that every schema points

to a canonical schema, or is itself a canonical
schema.

We can view this task as a subgraph selection prob-
lem in which the nodes are schemas, and directed
edges are possible mapping rules between schemas.
The learning task is to select subgraphs such that all
nodes in a subgraph are similar, and point to a sin-
gle, canonical node (Figure 2). We refer to the blue
nodes in Figure 2 as induced canonical schemas.

To solve this selection problem, we formulate it as
as a linear optimization task and solve it using inte-
ger linear programming (ILP), as we now describe.

Figure 2: Learning schema mapping rules can be viewed
as a subgraph selection problem, whose result (illus-
trated) is a set of clusters of similar schemas, all pointing
to a single, canonical form.

4.2 Features for learning schema mapping
rules

To assess the quality of candidate mapping rules, we
combine features from the following sources: Moby,
WordNet, association rules and statistical features
from our corpus. These features indicate synonymy
or type-of links between schemas. For each schema
Si e.g. (Animal, live in, Location) we define the re-
lation ri as being the verb phrase (e.g. “live in”), and
vi as the root verb of ri (e.g. “live”).

• Moby: We also use verb phrase similarity
scores derived from the Moby thesaurus. Moby
score Mij for a schema pair is computed by a
lookup in this dataset for relation pair ri, rj or
root verb pair vi, vj. This is also a directed fea-
ture, i.e. Mij 6= Mji.
• WordNet: If there exists a troponym link path

from schema ri to rj, then we define the Word-
Net score Wij for this schema pair as the in-
verse of the number of edges that need to be

238


Type Use which parts of schema? What kind of relations do they encode?
Feature source semantic distributional subject predicate object synonym type-of temporal implication
Moby X X X X
WordNet X X X
AMIE-typed X X X X X X
AMIE-untyped X X X X X X

Table 2: The different features used in relation canonicalization capture different aspects of similarity.

maximize
{Xij}

∑

i,j

Xij
(
λ1 ∗Mij + λ2 ∗Wij+λ3 ∗ATij

+λ4 ∗AUij+λ5 ∗Sij
)
−δ ∗‖X‖1

subject to, Xij ∈{0,1}, ∀〈i,j〉 Xij are boolean.
Xij + Xji ≤ 1, ∀i,j schema mapping relation is asymmetric.

∑

j

Xij ≤ 1, ∀i select at most one parent per schema.

Xij + Xjk −Xik ≤ 1, ∀〈i,j,k〉 schema mapping relation is transitive.
(1)

Figure 3: The ILP used for canonical schema induction

traveled to reach rj from ri. If such a path
does not exist, then we look for a path from vi
to vj. Since we do not know the exact Word-
Net synset applicable for each schema, we con-
sider all possible synset choices and pick the
best score as Wij. This is a directed feature
i.e., Wij 6= Wji. Note that even though Word-
Net is a high quality resource, it is not com-
pletely sufficient for our purposes. Out of 955
unique relations (verb phrases) in our KB, only
455 (47%) are present in WordNet. We can deal
with these out of WordNet verb phrases by re-
lying on other sets of features described next.
• AMIE: AMIE is an association rule min-

ing system that can produce association rules
of the form: “?a eat ?b → ?a consume
?b”. We have two sets of AMIE fea-
tures: typed and untyped. Untyped features
are of the form ri → rj, e.g., eat →
consume, whereas typed features are of the
form Si → Sj, e.g., (Animal,eat,Food) →
(Animal,consume,Food). AMIE produces
real valued scores5 between 0 to 1 for each rule.
We define AUij and ATij as untyped and typed
AMIE rule scores respectively.

5We use PCA confidence scores produced by AMIE.

• Specificity: We define specificity of each re-
lation as its IDF score in terms of the number
of argument pairs it occurs with, compared to
total number of argument type pairs in the cor-
pus. The specificity score of a schema mapping
rule favors more general predicates on the par-
ent side of the rules.

specificity(r) = IDF(r)

SP(r) =
specificity(r)

maxr′ specificity(r
′)

Sij = SP(ri)−SP(rj)
Further, we have a small set of very generic re-
lations like “have” and “be” that are considered
as relation stopwords by setting their SP(r)
scores to 1.

These features encode different aspects of simi-
larity between schemas as described in Table 2. In
this work we combine semantic high-quality fea-
tures from WordNet, Moby thesaurus with weak dis-
tributional similarity features from AMIE to gener-
ate schema mapping rules. We have observed that
thesaurus features are very effective for predicates
which are less ambiguous e.g. eat, consume, live in.
Association rule features on the other hand have ev-
idence for predicates which are very ambiguous e.g.
have, be. Thus these features are complementary.

Further, these features indicate different kinds of

239


relations between two schemas: synonymy, type-
of and temporal implication (refer Table 2). In
this work, we want to learn the schema mapping
rules that capture synonymy and type-of relations
and discard the temporal implications. This makes
our problem setting different from that of knowl-
edge base completion methods e.g., (Socher et al.,
2013). Our proposed method CASI uses an ensem-
ble of semantic and statistical features enabling us
to promote the synonymy and type-of edges, and to
select the most general schema as canonical schema
per cluster.

4.3 ILP model used in CASI
The features described in Section 4.2 provide par-
tial support for possible schema mapping rules in
our dataset. The final set of rules we select needs
to comply with asymmetry, transitive closure and
at most one parent per schema constraints. We use
an integer linear program to find the optimal set of
schema mapping rules that satisfy these constraints,
shown formally in Figure 3.

We decompose the schema mapping problem into
multiple independent sub-problems by considering
schemas related to a pair of argument types, e.g,
all schemas that have domain or range types An-
imal, Location would be considered as a separate
sub-problem. This way we can scale our method to
large sets of schemas. The ILP for each sub-problem
is presented in Equation 1.

In Equation 1, each Xij is a boolean variable
representing whether we pick the schema mapping
rule Si → Sj. As described in Section 4.2,
Mij,Wij,ATij,AUij,Sij represent the scores pro-
duced by Moby, WordNet, AMIE-typed, AMIE-
untyped and Specificity features respectively for the
schema mapping rule Si → Sj. The objective func-
tion maximizes the weighted combination of these
scores. Further, the solution picked by this ILP sat-
isfies constraints such as asymmetry, transitive clo-
sure and at most one parent per schema. We also
apply an L1 sparsity penalty on X, retaining only
those schema mapping edges for which the model is
reasonably confident.

For n schemas, there are O(n3) transitivity con-
straints which make the ILP very inefficient. Berant
et al. (2011) proposed two approximations to handle
a large number of transitivity rules by decomposing

the ILP or solving it in an incremental way. Instead
we re-write the ILP rules in such a way that we can
efficiently solve our mapping problem without intro-
ducing any approximations. The last two constraints
of this ILP can be rewritten as follows:

(∑
j Xij ≤ 1,∀i

AND Xij + Xjk −Xik ≤ 1, ∀〈i,j,k〉
)

=⇒ If(Xij = 1) then Xjk = 0 ∀k

This results in O(n2) constraints and makes the ILP
efficient. Impact of this technique in terms of run-
time is described in Section 5.3.

We then use an off-the-shelf ILP optimization en-
gine called SCPSolver (Planatscher and Schober,
2015) to solve the ILP problems. The output of our
ILP model is the schema mapping rules. We then
apply these rules onto KB tuples to generate addi-
tional, general tuples. Some examples of the learned
rules are:

(Organism, have, Phenomenon)
→ (Organism, undergo, Phenomenon)

(Animal, have, Event)
→ (Animal, experience, Event)

(Bird, occupy, Location)
→ (Bird, inhabit, Location)

5 Evaluation

5.1 KB Comprehensiveness
Our overall goal is a high-precision KB that has rea-
sonably “comprehensive” coverage of facts in the
target domain, on the grounds that these are the facts
that a domain application is likely to query about.
This notion of KB comprehensiveness is an impor-
tant but under-discussed aspect of knowledge bases.
For example, in the automatic KB construction lit-
erature, while a KB’s size is often reported, this
does not reveal whether the KB is near-complete
or merely a drop in the ocean of that required
(Razniewski et al., 2016; Stanovsky and Dagan,
2016). More formally, we define comprehensive-
ness as: recall at high (> 80%) precision of domain-
relevant facts. This measure is similar to recall at the
point P=80% on the PR curve, except recall is mea-
sured with respect to a different distribution of facts
(namely facts about elementary science) rather than
a held-out sample of data used to build the KB. The
particular target precision value is not critical; what

240


KB Precision Coverage of Tuple-Expressible KB comprehensiveness w.r.t. Science domain
Science Knowledge (Science recall @80% precision)

(Recall on science KB)
WebChild 89% 3.4% 3.4%
NELL 85% 0.1% 0.1%
ConceptNet 40% 8.4% n/a (p<80%)
ReVerb-15M 55% 11.5% n/a (p<80%)
Our KB 81% 23.2% 23.2%

Table 3: Precision and coverage of tuple-expressible elementary science knowledge by existing resources vs. our KB.
Precision estimates are within +/-3% with 95% confidence interval.

is important is that the same precision point is used
when comparing results. We choose 80% as subjec-
tively reasonable; at least 4 out of 5 queries to the
KB should be answered correctly.

There are several ways this target distribution of
required facts can be modeled. To fully realize the
ambition of this metric, we would directly identify
a sample of required end-task facts, e.g., by man-
ual analysis of questions posed to the end-task sys-
tem, or from logs of the interaction between the end-
task system and the KB. However, given the practi-
cal challenges of doing this at scale, we take a sim-
pler approach and approximate this end-task distri-
bution using facts extracted from an (independent)
domain-specific text corpus (we call this a reference
corpus). Note that these facts are only a sample of
domain-relevant facts, not the entirety. Otherwise,
we could simply run our extractor over the refer-
ence corpus and have all we need. Now we are in a
strong position, because the reference corpus gives
us a fixed point of reference to measure comprehen-
siveness: we can sample facts from it and measure
what fraction the KB “knows”, i.e., can answer as
true (Figure 4).

For our specific task of elementary science QA,
we have assembled a reference corpus6 of ∼1.2M
sentences comprising of multiple elementary sci-
ence textbooks, multiple dictionary definitions of
all fourth grade vocabulary words, and simple
Wikipedia pages for all fourth grade vocabulary
words (where such pages exist). To measure our
KB’s comprehensiveness (of facts within the ex-
pressive power of our KB), we randomly sampled
4147 facts, expressed as headword tuples, from

6This corpus named as “Aristo MINI Corpus” is avail-
able for download at http://allenai.org/data/
aristo-tuple-kb

Figure 4: Comprehensiveness (frequency-weighted cov-
erage C of the required facts D) can be estimated using
coverage A of a reference KB B as a surrogate sampling
of the target distribution.

the reference corpus. These were generated semi-
automatically using parts of our pipeline, namely in-
formation extraction then Turker scoring to obtain
true facts7. We call these facts the Reference KB8.
To the extent our tuple KB contains facts in this
Reference KB (and under the simplifying assump-
tion that these facts are representative of the sci-
ence knowledge our QA application needs), we say
our tuple KB is comprehensive. Doing this yields
a value of 23% comprehensiveness for our KB (Ta-
ble 3).

We also measured the precision and science cov-
erage of other, existing fact KBs. For precision, we
took a random sample of 1000 facts in each KB, and
followed the same methodology as earlier so that the

7This method will of course miss many facts in the reference
corpus, e.g., when extraction fails or when the fact is in a non-
sentential form, e.g., a table. However, we only assume that the
distribution of extracted facts is representative of the domain.

8These 4147 test facts are published with the dataset at
http://allenai.org/data/aristo-tuple-kb

241


comparison is valid: Turkers label each fact as true
or false/nonsense, each fact is labeled 3 times, and
the majority label is the overall label. The preci-
sions are shown in Table 3. For ConceptNet, we
used only the subset of facts with frequency > 1,
as frequency=1 facts are particularly noisy (thus the
precision of the full ConceptNet would be lower).

We also computed the science coverage (= com-
prehensiveness, if p>80%) using our reference KB.
Note that these other KBs were not designed with
elementary science in mind and so, not surprisingly,
they do not cover many of the relations in our do-
main. To make the comparison as fair as possible,
given these other KBs use different relational vocab-
ularies, we first constructed a list of 20 very general
relations (similar to the ConceptNet relations, e.g.,
causes, uses, part-of, requires), and then mapped re-
lations used in both our reference facts, and in the
other KBs, to these 20 relations. To compare if a
reference fact is in one of these other KBs, only the
general relations need to match, and only the subject
and object headwords need to match. This allows
substantial linguistic variation to be permitted dur-
ing evaluation (e.g., “contain”,. “comprise”, “part
of” etc. would all be considered matching). In other
words, this is a generous notion of “a KB containing
a fact”, in order to be as fair as possible.

As Table 4 illustrates, these other KBs cover very
little of the target science knowledge. In the case
of WebChild and NELL, the primary reason for low
recall is low overlap between their target and ours.
NELL has almost no predicate overlap with our Ref-
erence KB, reflecting it’s Named Entity centric con-
tent. WebChild is rich in part-of and location in-
formation, and covers 60% of part-of and location
facts in our Reference KB. However, these are only
4.5% of all the facts in the Reference KB, resulting
in an overall recall (and comprehensiveness) of 3%.
In contrast, ConceptNet and ReVerb-15M have sub-
stantially more relational overlap with our Reference
KB, hence their recall numbers are higher. However,
both have lower precision, limiting their utility.

This evaluation demonstrates the limited science
coverage of existing resources, and the degree to
which we have overcome this limitation. The extrac-
tion methods used to build these resources are not
directly comparable since they are starting with dif-
ferent input/output settings and involve significantly

different degrees of supervision. Rather, the results
suggest that general-purpose KBs (e.g., NELL) may
have limited coverage for specific domains, and that
our domain-targeted extraction pipeline can signifi-
cantly alleviate this in terms of precision and cover-
age when that domain is known.

Extraction stage #schemas #tuples % Avg.
output precision
2. Tuple generation - 7.5M 54.2
3. Headword tuples 29.3K 462K 68.0
4. Tuple scoring 15.8K 156K 87.2
5. Phrasal tuples 15.8K 286K 86.5
6. Canonical schemas 15.8K 340K 80.6

Table 4: Evaluation of KB at different stages of extrac-
tion. Precision estimates are within +/-3% with 95% con-
fidence interval.

5.2 Performance of the Extraction Pipeline

In addition, we measured the average precision of
facts present in the KB after every stage of the
pipeline (Table 4). We can see that the pipeline
take as input 7.5M OpenIE tuples with precision
of 54% and produces a good quality science KB of
over 340K facts with 80.6% precision organized into
15K schemas. The Table also shows that precision
is largely preserved as we introduce phrasal triples
and general tuples.

5.3 Evaluation of Canonical Schema Induction

In this section we will focus on usefulness and cor-
rectness of our canonical schema induction method.

The parameters of the ILP model (see Equation 1)
i.e., λ1 . . .λ5 and δ are tuned based on sample accu-
racy of individual feature sources and using a small
schema mapping problem with schemas applicable
to vocabulary types Animal and Body-Part.

λ1 = 0.7, λ2 = 0.9, λ3 = 0.3,
λ4 = 0.1, λ5 = 0.2, δ = 0.7

Further, with O(n3) transitivity constraints we
could not successfully solve a single ILP problem
with 100 schemas within a time limit of 1 hour.
Whereas, when we rewrite them with O(n2) con-
straints as explained in Section 4.3, we could solve
443 ILP sub-problems within 6 minutes with aver-
age runtime per ILP being 800 msec.

242


Canonical schema induction method Comprehensiveness
None 20.0%

AMIE* 20.9%
CASI (our method) 23.2%

Table 5: Use of the CASI-induced schemas significantly
(at the 99% confidence level) improves comprehensive-
ness of the KB.

As discussed in Section 2, we not only cluster
the (typed) relations, but also identify a canoni-
cal relation that all the other relations in a cluster
can be mapped to, without recourse to human an-
notated training data or a target relational vocab-
ulary. Although no existing methods do this di-
rectly, the AMIE-based schema clustering method of
(Galárraga et al., 2014) can be extended to do this by
incorporating the association rules learned by AMIE
(both typed and untyped) inside our ILP framework
to output schema mapping rules. We call this exten-
sion AMIE*, and use it as a baseline to compare the
performance of CASI against.

5.3.1 Canonical Schema Usefulness
The purpose of canonicalization is to allow equiv-

alence between seemingly different schema to be
recognized. For example, the KB query (“polar
bear”, “reside in”, “tundra”)?9 can be answered
by a KB triple (“polar bear”, “inhabit”, “tundra”) if
schema mapping rules map one or both to the same
canonical form e.g., (“polar bear”, “live in”, “tun-
dra”) using the rules:

(Animal, inhabit, Location)
→ (Animal, live in, Location)

(Animal, reside in, Location)
→ (Animal, live in, Location)

One way to quantitatively evaluate this is to mea-
sure the impact of schema mapping on the com-
prehensiveness metric. Table 5 shows that, before
applying any canonical schema induction method,
the comprehensiveness score of our KB was 20%.
The AMIE* method improves this score to 20.9%,
whereas our method achieves a comprehensiveness
of 23.2%. This latter improvement over the original
KB is statistically significant at the 99% confidence

9e.g., posed by a QA system trying to answer the question
“Which is the likely location in which a polar bear to reside in?
(A) Tundra (B) Desert (C) Grassland”

level (sample size is the 4147 facts sampled from the
reference corpus).

5.3.2 Canonical Schema Correctness
A second metric of interest is the correctness of

the schema mapping rules (just because comprehen-
siveness improves does not imply every mapping
rule is correct). We evaluate correctness of schema
mapping rules using following metric:
Precision of schema mapping rules: We asked
Turkers to directly assess whether particular schema
mapping rules were correct, for a random sample
of rules. To make the task clear, Turkers were
shown the schema mapping rule (expressed in En-
glish) along with an example fact that was rewrit-
ten using that rule (to give a concrete example of
its use), and they were asked to select one option
“correct or incorrect or unsure” for each rewrite rule.
We asked this question to three different Turkers and
considered the majority vote as final evaluation10.

The comparison results are shown in Table 6.
Starting with 15.8K schemas, AMIE* canonicalized
only 822 of those into 102 canonical schemas (using
822 schema mapping rules). In contrast, our method
CASI canonicalized 4.2K schemas into 2.5K canon-
ical schemas. We randomly sampled 500 schema
mapping rules generated by each method and asked
Turkers to evaluate their correctness, as described
earlier. As shown in Table 6, the precision of rules
produced was CASI is 68%, compared with AMIE*
which achieved 59% on this metric. Thus CASI
could canonicalize five times more schemas with 9%
more precision.

5.4 Discussion and Future Work

Next, we identify some of the limitations of our ap-
proach and directions for future work.

1. Extracting Richer Representations of Knowl-
edge: While triples can capture certain kinds of
knowledge, there are other kinds of information,
e.g. detailed descriptions of events or processes, that
cannot be easily represented by a set of independent
tuples. An extension of this work would be to extract
event frames, capable of representing a richer set of

10We discarded the unsure votes. For more than 95% of the
rules, at least 2 out of 3 Turkers reached clear consensus on
whether the rule is “correct vs. incorrect”, indicating that the
Turker task was clearly defined.

243


Canonical schema #input #schema #induced Precision of
induction method schemas mapping rules canonical schemas schema mapping rules
AMIE* 15.8K 822 102 59%
CASI (our method) 15.8K 4.2K 2.5K 68%

Table 6: CASI canonicalizes five times more schemas than AMIE*, and also achieves a small (9%) increase in preci-
sion, demonstrating how additional knowledge resources can help the canonicalization process (Section 4.2). Precision
estimates are within +/-4% with 95% confidence interval.

roles in a wider context compared to a triple fact.
For example in the news domain, while representing
an event “public shooting”, one would like to store
the shooter, victims, weapon used, date, time, loca-
tion and so on. Building high-precision extraction
techniques that can go beyond binary relations to-
wards event frames is a potential direction of future
research.

2. Richer KB Organization: Our approach or-
ganizes entities and relations into flat entity types
and schema clusters. An immediate direction for ex-
tending this work could be a better KB organization
with deep semantic hierarchies for predicates and ar-
guments, allowing inheritance of knowledge among
entities and triples.

3. Improving comprehensiveness beyond 23%:
Our comprehensiveness score is currently at 23% in-
dicating 77% of potentially useful science facts are
still missing from our KB. There are multiple ways
to improve this coverage including but not limited
to 1) processing more science corpora through our
extraction pipeline, 2) running standard KB com-
pletion methods on our KB to add the facts that are
likely to be true given the existing facts, and 3) im-
proving our canonical schema induction method fur-
ther to avoid cases where the query fact is present in
our KB but with a slight linguistic variation.

4. Quantification Sharpening: Similar to other
KBs, our tuples have the semantics of plausibility:
If the fact is generally true for some of the arg1s,
then score it as true. Although frequency filtering
typically removes facts that are rarely true for the
arg1s, there is still variation in the quantifier strength
of facts (i.e., does the fact hold for all, most, or some
arg1s?) that can affect downstream inference. We
are exploring methods for quantification sharpening,
e.g., (Gordon and Schubert, 2010), to address this.

5. Can the pipeline be easily adapted to a new
domain? Our proposed extraction pipeline expects

high-quality vocabulary and types information as
input. In many domains, it is easy to import types
from existing resources like WordNet or FreeBase.
For other domains like medicine, legal it might
require domain experts to encode this knowledge.
However, we believe that manually encoding types
is a much simpler task as compared to manually
defining all the schemas relevant for an individual
domain. Further, various design choices, e.g.,
precision vs. recall tradeoff of final KB, the amount
of expert input available, etc. would depend on the
domain and end task requirements.

6 Conclusion

Our goal is to construct, a domain-targeted,
high precision knowledge base of (sub-
ject,predicate,object) triples to support an ele-
mentary science application. We have presented
a scalable knowledge extraction pipeline that is
able to extract a large number of facts targeted to
a particular domain. The pipeline leveraging Open
IE, crowdsourcing, and a novel schema learning
algorithm, and has produced a KB of over 340,163
facts at 80.6% precision for elementary science QA.

We have also introduced a metric of comprehen-
siveness for measuring KB coverage with respect to
a particular domain. Applying this metric to our KB,
we have achieved a comprehensiveness of over 23%
of science facts within the KB’s expressive power,
substantially higher than the science coverage of
other comparable resources. Most importantly, the
pipeline offers for the first time a viable way of ex-
tracting large amounts of high-quality knowledge
targeted to a specific domain. We have made the KB
publicly available at http://data.allenai.
org/tuple-kb.

244


Acknowledgments
We are grateful to Paul Allen whose long-term

vision continues to inspire our scientific endeav-
ors. We would also like to thank Peter Turney and
Isaac Cowhey for their important contributions to
this project.

References
S. Auer, C. Bizer, J. Lehmann, G. Kobilarov, R. Cyga-

niak, and Z. Ives. 2007. DBpedia: A nucleus for a
web of open data. In In ISWC/ASWC.

Michele Banko, Michael J. Cafarella, Stephen Soderland,
Matthew Broadhead, and Oren Etzioni. 2007. Open
information extraction from the web. In IJCAI, vol-
ume 7, pages 2670–2676.

Jonathan Berant, Ido Dagan, and Jacob Goldberger.
2011. Global learning of typed entailment rules. In
ACL.

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim
Sturge, and Jamie Taylor. 2008. Freebase: A collabo-
ratively created graph database for structuring human
knowledge. In SIGMOD.

Thorsten Brants and Alex Franz. 2006. Web 1T 5-
gram version 1 LDC2006T13. Philadelphia: Linguis-
tic Data Consortium.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr
Settles, Estevam R Hruschka Jr, and Tom M Mitchell.
2010. Toward an architecture for never-ending lan-
guage learning. In AAAI, volume 5, page 3.

Bhavana Dalvi, Sumithra Bhakthavatsalam, Chris Clark,
Peter Clark, Oren Etzioni, Anthony Fader, and Dirk
Groeneveld. 2016. IKE - An Interactive Tool for
Knowledge Extraction. In AKBC@NAACL-HLT.

Luciano Del Corro, Rainer Gemulla, and Gerhard
Weikum. 2014. Werdy: Recognition and disambigua-
tion of verbs and verb phrases with syntactic and se-
mantic pruning. In 2014 Conference on Empirical
Methods in Natural Language Processing, pages 374–
385. ACL.

Karthik Dinakar, Birago Jones, Catherine Havasi, Henry
Lieberman, and Rosalind Picard. 2012. Common
sense reasoning for detection, prevention, and mitiga-
tion of cyberbullying. ACM Transactions on Interac-
tive Intelligent Systems (TiiS), 2(3):18.

Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko
Horn, Ni Lao, Kevin Murphy, Thomas Strohmann,
Shaohua Sun, and Wei Zhang. 2014. Knowledge
vault: a web-scale approach to probabilistic knowl-
edge fusion. In KDD.

Alexander Faaborg and Henry Lieberman. 2006. A goal-
oriented web browser. In Proceedings of the SIGCHI

conference on Human Factors in computing systems,
pages 751–760. ACM.

Anthony Fader, Stephen Soderland, and Oren Etzioni.
2011. Identifying relations for open information ex-
traction. In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing, pages
1535–1545. Association for Computational Linguis-
tics. ReVerb-15M available at http://openie.
cs.washington.edu.

Christiane Fellbaum. 1998. WordNet. Wiley Online Li-
brary.

Luis Galárraga, Christina Teflioudi, Katja Hose, and
Fabian M. Suchanek. 2013. AMIE: association
rule mining under incomplete evidence in ontological
knowledge bases. In WWW.

Luis Galárraga, Geremy Heitz, Kevin Murphy, and
Fabian M. Suchanek. 2014. Canonicalizing open
knowledge bases. In CIKM.

Jonathan Gordon and Lenhart K Schubert. 2010. Quan-
tificational sharpening of commonsense knowledge.
In AAAI Fall Symposium: Commonsense Knowledge.

Adam Grycner and Gerhard Weikum. 2014. Harpy: Hy-
pernyms and alignment of relational paraphrases. In
COLING.

Adam Grycner, Gerhard Weikum, Jay Pujara, James R.
Foulds, and Lise Getoor. 2015. RELLY: Inferring hy-
pernym relationships between relational phrases. In
EMNLP.

Dekang Lin and Patrick Pantel. 2001. DIRT - discov-
ery of inference rules from text. In Proceedings of
the seventh ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 323–
328. ACM.

Hugo Liu and Push Singh. 2004. ConceptNet: a prac-
tical commonsense reasoning tool-kit. BT technology
journal, 22(4):211–226.

Mausam, Michael Schmitz, Stephen Soderland, Robert
Bart, and Oren Etzioni. 2012. Open language learning
for information extraction. In EMNLP.

Andrea Moro and Roberto Navigli. 2012. WiSeNet:
building a wikipedia-based semantic network with on-
tologized relations. In CIKM.

Patrick Pantel, Rahul Bhagat, Bonaventura Coppola,
Timothy Chklovski, and Eduard H Hovy. 2007. ISP:
Learning inferential selectional preferences. In HLT-
NAACL, pages 564–571.

Hannes Planatscher and Michael Schober. 2015. SCP
solver. http://scpsolver.org.

Simon Razniewski, Fabian M Suchanek, and Werner
Nutt. 2016. But what do we actually know? In Proc.
AKBC’16.

Sebastian Riedel, Limin Yao, Andrew McCallum, and
Benjamin M. Marlin. 2013. Relation extraction with

245


matrix factorization and universal schemas. In HLT-
NAACL.

Richard Socher, Danqi Chen, Christopher D. Manning,
and Andrew Y. Ng. 2013. Reasoning with neural ten-
sor networks for knowledge base completion. In NIPS.

Stephen Soderland, John Gilmer, Robert Bart, Oren Et-
zioni, and Daniel S. Weld. 2013. Open Information
Extraction to KBP Relations in 3 Hours. In TAC.

Robert Speer and Catherine Havasi. 2013. Concept-
net 5: A large semantic network for relational knowl-
edge. In The Peoples Web Meets NLP, pages 161–176.
Springer.

Gabriel Stanovsky and Ido Dagan. 2016. Creating a
large benchmark for open information extraction. In
EMNLP.

Fabian M. Suchanek, Gjergji Kasneci, and Gerhard
Weikum. 2007. Yago: A Core of Semantic Knowl-
edge. In WWW.

Niket Tandon, Gerard de Melo, Fabian Suchanek, and
Gerhard Weikum. 2014. WebChild: Harvesting and
Organizing Commonsense Knowledge from the Web.
In WSDM.

Peter D. Turney, Yair Neuman, Dan Assaf, and Yohai Co-
hen. 2011. Literal and metaphorical sense identifica-
tion through concrete and abstract context. In EMNLP.

Travis Wolfe, Mark Dredze, James Mayfield, Paul Mc-
Namee, Craig Harman, Timothy W. Finin, and Ben-
jamin Van Durme. 2015. Interactive knowledge base
population. CoRR, abs/1506.00301.

Alexander Yates and Oren Etzioni. 2009. Unsupervised
methods for determining object and relation synonyms
on the web. Journal of Artificial Intelligence Research.

246