Inherent Disagreements in Human Textual Inferences

Ellie Pavlick
Brown University

ellie pavlick@brown.edu

Tom Kwiatkowski
Google Research

tomkwiat@google.com

Abstract
We analyze human’s disagreements about the
validity of natural language inferences. We
show that, very often, disagreements are not
dismissible as annotation ‘‘noise’’, but rather
persist as we collect more ratings and as we
vary the amount of context provided to raters.
We further show that the type of uncertainty
captured by current state-of-the-art models for
natural language inference is not reflective
of the type of uncertainty present in human
disagreements. We discuss implications of our
results in relation to the recognizing textual
entailment (RTE)/natural language inference
(NLI) task. We argue for a refined evaluation
objective that requires models to explicitly
capture the full distribution of plausible human
judgments.

1 Introduction

Entailment is arguably one of the most fun-
damental of language understanding tasks, with
Montague himself calling entailment ‘‘the basic
aim of semantics’’ (Montague, 1970). Compu-
tational work on recognizing textual entailment
(RTE) (also called natural language inference, or
NLI) has a long history, ranging from early efforts
to model logical phenomena (Cooper et al., 1996),
to later statistical methods for modeling practical
inferences needed for applications like informa-
tion retrieval and extraction (Dagan et al., 2006),
to current work on learning common sense hu-
man inferences from hundreds of thousands of
examples (Bowman et al., 2015; Williams et al.,
2018).

Broadly speaking, the goal of the NLI task is to
train models to make the inferences that a human
would make. Currently, ‘‘the inferences that a
human would make’’ are determined by asking
multiple human raters to label pairs of sentences,

and then seeking some consensus among them.
For example, having raters choose among discrete
labels and taking a majority vote (Dagan et al.,
2006; Bowman et al., 2015; Williams et al., 2018),
or having raters use a continuous Likert scale and
taking an average (Pavlick and Callison-Burch,
2016a; Zhang et al., 2017). That is, the prevailing
assumption across annotation methods is that there
is a single ‘‘true’’ inference about h given p that
we should train models to predict, and that this
label can be approximated by aggregating multi-
ple (possibly noisy) human ratings as is typical
in many other labelling tasks (Snow et al., 2008;
Callison-Burch and Dredze, 2010).

Often, however, we observe large disagree-
ments among humans about whether or not h can
be inferred from p (see Figure 1). The goal of
this study is to establish whether such disagree-
ments can safely be attributed to ‘‘noise’’ in the
annotation process (resolvable via aggregation),
or rather are a reproducible signal and thus should
be treated as part of the NLI label assigned to the
p/h pair. Specifically, our primary contributions
are:

• We perform a large-scale study of humans’
sentence-level inferences and measure the
degree to which observed disagreements
persist across samples of annotators.

• We show that current state-of-the-art NLI
systems do not capture this disagreement
by default (by virtue of treating NLI as
probabilistic) and argue that NLI evaluation
should explicitly incentivize models to pre-
dict distributions over human judgments.

• We discuss our results with respect to the
definition of the NLI task, and its increased
usage as a diagnostic task for evaluating
‘‘general purpose’’ representations of natural
language.

677

c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Transactions of the Association for Computational Linguistics, vol. 7, pp. 677–694, 2019. https://doi.org/10.1162/tacl a 00293
Action Editor: Christopher Potts. Submission batch: 5/2019; Published: 1                    1/ 2019.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021

https://doi.org/10.1162/tacl_a_00293


Figure 1: Example p/h pair on which humans exhibit
strong disagreements about whether h can be inferred
from p. Here, the disagreement appears to stem from
the implicature, but we observe similar disagreements
on a variety of linguistic phenomena.

2 The RTE/NLI Task

The task of RTE/NLI is fundamentally concerned
with drawing conclusions about the world on the
basis of limited information, but specifically in
the setting when both the information and the
conclusions are expressed in natural language.
That is, given a proposition p, should one infer
some other proposition h to be true?

Traditionally, in formal linguistics, the defini-
tion of entailment used is that defined in formal
logic—namely, p entails h if h is true in every
possible world in which p is true. This logical
definition takes for granted that lexical and con-
structional meanings are fixed in such a way that it
is possible to fully pre-specify and then repeatedly
apply those meanings across all contexts. From the
point of view of evaluating NLP systems’ ability
to reason about entailment, these are clearly diffi-
cult criteria to operationalize. Thus, within NLP,
we have rarely if ever evaluated directly against
this definition. Rather, work has been based on
the below informal definition:

p entails h if, typically, a human reading
p would infer that h is most likely
true. . . [assuming] common human un-
derstanding of language [and] common
background knowledge (Dagan et al.,
2006).

This definition was intended to undergo refine-
ment overtime, with Dagan et al. (2006) explicitly
stating that the definition was ‘‘clearly not mature
yet’’ and should evolve in response to observed
shortcomings, and, in fact, substantial discussion

surrounded the original definition of the RTE task.
In particular, Zaenen et al. (2005) argued that the
definition needed to be made more precise, so
as to circumscribe the extent to which ‘‘world
knowledge’’ should be allowed to factor into in-
ferences, and to explicitly differentiate between
distinct forms of textual inference (e.g., entailment
vs. conventional implicature vs. conversational
implicature). Manning (2006) made a counter-
argument, pushing back against a prescriptivist
definition of what types of inferences are or are
not licensed in a specific context, instead advocat-
ing that annotation tasks should be ‘‘natural’’ for
untrained annotators, and that the role of NLP
should be to model the inferences that humans
make in practical settings (which include not just
entailment, but also pragmatic inferences such as
implicatures). Both supported the use of the term
‘‘inference’’ over ‘‘entailment’’ to acknowledge
the divergence between the working NLP task
definition and the notion of entailment as used in
formal semantics.1

Since the task’s introduction, there has been
no formal consensus around which of the two
approaches offers the better cost–benefit tradeoff:
precise (at risk of being impractical), or organic
(at risk of being ill-defined). That said, there has
been a clear gravitation toward the latter, apparent
in the widespread adoption of inference datasets
that explicitly prioritize natural inferences over
rigorous annotation guidelines (Bowman et al.,
2015; Williams et al., 2018), and in the overall
shift to the word ‘‘inference’’ over ‘‘entailment.’’
There has also been significant empirical evidence
supporting the argument that humans’ semantic
inferences are uncertain and context-sensitive
(Poesio and Artstein, 2005; Versley, 2008; Simons
et al., 2010; Recasens et al., 2011; de Marneffe
et al., 2012; Passonneau et al., 2012; Pavlick
and Callison-Burch, 2016a,b; Tonhauser et al.,
2018, among others) suggesting computational
models would benefit from focusing on ‘‘speaker
meaning’’ over ‘‘sentence meaning’’ when it
comes to NLI (Manning, 2006; Westera and
Boleda, 2019).

Thus, in this paper, we assume that NLP will
maintain this hands-off approach to NLI, avoiding
definitions of what inferences humans should
make or which types of knowledge they should
invoke. We take the position that, ultimately, our

1We, too, adopt the word ‘‘inference’’ for this reason.

678

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


goal in NLP is to train models that reverse-
engineer the inferences a human would make
when hearing or reading language in the course of
their daily lives, however ad-hoc the process that
generates those inferences might be. Therefore,
our question in this paper is not yet what process
humans use to draw inferences from natural lan-
guage, but merely: Left to their own devices,
do humans, in general, tend to follow the same
process? Note that this question is independent
of the decision of whether to treat annotations as
discrete versus gradable. Even if NLI is treated as a
gradable phenomenon (as we believe it should be),
a world in which all humans share the same notion
of uncertainty necessitates very different models,
annotation practices, and modes of evaluation than
a world in which people may disagree substan-
tially in specific situations, use different heuristics,
and/or have different preferences about how to
resolve uncertainty. Specifically, current practices—
in which we aggregate human judgments through
majority vote/averaging and evaluate models on
their ability to predict this aggregated label—are
only appropriate if humans all tend to use the same
process for resolving uncertainties in practice.

3 NLI Data and Annotation

To perform our analysis, we collect NLI judg-
ments at 50× redundancy for sentence pairs drawn
from a variety of existing NLI datasets. Our anno-
tation procedure is described in detail in this
section. All of the data and collected anno-
tations are available at https://github.
com/epavlick/NLI-variation-data.

3.1 Sentence Pairs

We draw our p/h pairs from the training sets of
each of the following five datasets: RTE2 (Dagan
et al., 2006), SNLI (Bowman et al., 2015), MNLI
(Williams et al., 2018), JOCI (Zhang et al., 2017),
and DNC (Poliak et al., 2018b). Table 1 shows
randomly sampled positive (p → h) and negative
(p �→ h) examples from each. These datasets differ
substantially in the procedures used to generate the
data, and in the types of inferences they attempt
to test. RTE2 consists of premises/hypothesis
pairs derived predominantly from the output of
information retrieval systems run over newswire
text and annotated by experts (researchers in
the field). SNLI consists of premises derived
from image captions with hypotheses written and

judged by non-expert (crowdsourced) annotators.
MNLI was constructed in the same way as SNLI
but contains premises drawn from a range of text
genres, including letters, fiction, and telephone
conversations. JOCI is intended to target ‘‘com-
mon sense’’ inferences, and contains premises
drawn from existing NLI datasets2 paired with
hypothesis that were automatically generated via
either templates or seq2seq models and then
refined by humans. The DNC consists predomi-
nantly of naturally occurring premises paired with
template-generated hypotheses, and comprises a
number of sub-corpora aimed at testing systems’
understanding of specific linguistic phenomena
(e.g., lexical semantics, factuality, named entity
recognition). We draw from this variety of data-
sets in order to ensure a diversity of types of tex-
tual inference and to mitigate the risk that the
disagreements we observe are driven by a specific
linguistic phenomenon or dataset artifact on which
humans’ interpretations particularly differ.

We sample 100 p/h pairs from each dataset.
In every dataset, we limit to pairs in which the
premise and the hypothesis are both less than or
equal to 20 words, to minimize cognitive load
during annotation. We attempt to stratify across
expected labels to ensure an interesting balance of
inference types. For RTE2, SNLI, and MNLI,
this means stratifying across three categories
(ENTAILMENT/CONTRADICTION/NEUTRAL). For JOCI,
the p/h pairs are labeled on a five-point Likert
scale, where 1 denotes that h is ‘‘impossible’’
given p and 5 denotes that h is ‘‘very likely’’ given
p, and thus we stratify across these five classes.
In the DNC, all sub-corpora consist of binary
labels (ENTAILMENT/NON-ENTAILMENT) but some sub-
corpora contain finer-grained labels than others
(e.g., three-way or five-way labels). Thus, when
sampling, we first stratify across sub-corpora3

and then across the most fine-grained label type
available for the given sub-corpus.

3.2 Annotation

We show each p/h pair to 50 independent raters
on Amazon Mechanical Turk. We ask them to

2We skip the subset of JOCI that was drawn from SNLI,
to avoid redundancy with our own SNLI sample.

3We skip two sub-corpora (VerbCorner and Puns), the
former because it contains nonced words and thus is difficult
to ask humans to label without some training, and the latter
because of the potential for noisy labels due to the fact that
some people, bless their hearts, just don’t appreciate puns.

679

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021

https://github.com/epavlick/NLI-variation-data
https://github.com/epavlick/NLI-variation-data


SNLI Three dogs on a sidewalk. → There are more than one dog here.
A red rally car taking a slippery turn in a race. → ¬ The car is stopped at a traffic light.

MNLI Historical heritage is very much the theme at Ichidani. → Ichidani’s historical heritage is important.
okay i uh i have five children all together → ¬ I do not have any children.

RTE2 Self-sufficiency has been turned into a formal public awareness campaign in San Francisco, by
Mayor Gavin Newsom. → Gavin Newsom is a politician of San Francisco.
The unconfirmed case concerns a rabies-like virus known only in bats → ¬ A case of rabies was
confirmed.

JOCI It was Charlie ’s first day of work at the new firm → The firm is a business.
A young girl is holding her teddy bear while riding a pony . → ¬ The bear attacks.

DNC Tony bent the rod. → Tony caused the bending.
When asked about the restaurant, Jonah said, ‘Sauce was tasteless.’ �→ Jonah liked the restaurant.

Table 1: Examples of p/h pairs from each of our source datasets. The top pair is one labeled by the
original dataset as a valid inference (one that should be drawn), the bottom as an invalid inference
(either h is contradictory given p (p → ¬h), or h simply cannot be inferred (p �→ h)). For DNC,
examples shown are from the VerbNet (top) and Sentiment (bottom) sub corpora.

indicate using a sliding bar, which ranges from
−50 to 50,4 how likely it is that h is true given
that p is true, where −50 means that h is definitely
not true (p → ¬h), 50 means that h is definitely
true (p → h), and 0 means that h is consistent
with but not necessarily true given p (p �→ h).
Raters also have the option to indicate with a
checkbox that either/both of the sentences do not
make sense and thus no judgment can be made.
We attempt to pitch the task intuitively and keep
the instructions light, for reasons discussed in
Section 2. We provide brief instructions followed
by a few examples to situate the task. Our exact
instructions and examples are shown Table 2.

Raters label pairs in batches of 20, meaning
we have a minimum of 20 ratings per rater. We
pay $0.30 per set of 20. We restrict to raters who
have a 98% or better approval rating with at least
100 HITs approved, and who are located in a
country in which English is the native language
(US, Canada, UK, Australia, New Zealand).

3.3 Preprocessing

Filtering. In total, we had 509 workers complete
our HITs, with an average of 2.5 tasks (50 sentence
pairs) per worker. We follow the methods from
White et al. (2018) and remove workers who
demonstrate consistently low correlations with
others’ judgments. Specifically, for each sentence
pair s, for each worker wi, we compute the
Spearman correlation between wi’s labels and

4Raters do not see specific numbers on the slider.

For each pair of sentences, assume that the first sentence (S1)
is true, describes a real scenario, or expresses an opinion.
Using your best judgment, indicate how likely it is that
the second sentence (S2) is also true, describes the same
scenario, or expresses the same opinion. If either sentence
is not interpretable, check the ‘‘Does Not Make Sense’’ box.
Several examples are given below.

Example 1: In the below example, the slider is far to the right
because we can be very confident that if a person is ‘‘on a
beach’’ than that person is ‘‘outside’’.
S1: A woman is on a beach with her feet in the water.
S2: The woman is outside.

Example 2: In the below example, the slider is far to the left
because we can be very confident that if a person is ‘‘on a
beach’’ then that person is NOT ‘‘in her living room’’.
S1: A woman is on a beach with her feet in the water.
S2: The woman is in her living room.

Example 3: In the below example, the slider is in the center
because knowing that woman is on the beach does not give us
any information about the color of her hair and so we cannot
reasonably make a judgment about whether or not her hair is
brown.
S1: A woman is on a beach with her feet in the water.
S2: The woman has brown hair.

Table 2: Instructions and examples shown to
raters. Raters indicated their responses using a
sliding bar which ranged from −50 to 50. In
the instructions actually shown, the examples
were shown alongside a sliding bar reflecting
the desired rating. Exact UI not shown for
compactness.

every other wj who labeled s. Across all pairs of
workers, the mean correlation is 0.48. We consider
a pair of workers on a given assignment to be an
outlier if the correlation between those workers’
ratings falls outside 1.5 times the interquartile

680

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


range of all the correlations (White et al., 2018).
We find 234 pairs to be outliers, and that
they can be attributed to 14 individual workers.
We therefore remove all annotations from these
14 workers from our analysis. Additionally, we
remove ratings from 37 workers5 who have fewer
than 15 useable data points (i.e., judgments not
including cases in which they choose the ‘‘does
not make sense’’ option), as this will prevent us
from properly estimating and thus correcting for
their individual annotation bias (described in the
following section). Finally, we remove p/h pairs
that, after removing all problematic workers and
‘‘does not make sense’’ judgments, are left with
fewer than 15 judgments. In the end, we have
496 p/h pairs with a mean of 39 labels per pair.

Normalization. One confound that results from
collecting annotations on a continuous scale is that
each rater may choose to use the scale differently.
Thus, we apply z-score normalization to each
worker’s labels for each assignment, meaning
each worker’s ratings are rescaled such that the
mean across all labels from a single worker within
a single batch is 0 and the standard deviation
is 1. This normalization is not perfect, as every
batch has a slightly different set of pairs, and
so normalized scores are not comparable across
batches. For example, if, by chance, a batch were
to contain mostly pairs for which the ‘‘true’’ label
was p → h, a score of zero would imply p → h,
whereas if a batch were to include mostly pairs for
with the ‘‘true’’ label was p → ¬h, zero would
correspond to p → ¬h. However, for the purposes
of our analysis, this is not problematic; because
our interest is comparing disagreements between
annotations on each specific p/h pair, it is only
important that two worker’s labels on the same
pair are comparable, not that judgments across
pairs are comparable.6

5Results presented throughout are based on data with
these workers removed. However, rerunning analysis with
these workers included did not affect our overall takeaways.

6On our own manual inspection, it is nearly always the
case that the mean (0) is roughly interpretable as neutral,
with only moderate deviations from one example to the next.
Nonetheless, when interpreting the figures in the following
sections, note that the center of one pair’s distribution is not
necessarily comparable to the center of another’s.

4 Analysis of Human Judgments

4.1 Experimental Design

We aim to establish whether the disagreements
observed between humans’ NLI judgments can be
attributed to ‘‘noise’’ in the annotation process.
We make the assumption that, if the disagreements
are attributable to noise, then the observed human
judgments can be modeled as a simple Gaussian
distribution, where the mean is the true label. This
model can account for the fact that some cases
might be inherently harder than others—this could,
for example, be reflected by higher variance—
but, overall, the labels are nonetheless in ac-
cordance with the assumption that there exists
a fundamentally ‘‘true’’ label for each p/h pair
which we can faithfully represent via a single label
or value, obtainable via aggregation.

For each sentence pair, we randomly split
the collected human labels into train and test.
Specifically, we hold out 10 labels from each
pair to use as our test set. The training data are
composed of the remaining labels, which varies
in number from 5 to 40, depending on how many
labels were left for that pair after preprocessing
(see Section 3.3). The average number of training
labels is 29. For each sentence pair, we use the
training data to fit two models: 1) a single Gaussian
and 2) a Gaussian Mixture Model where the
number of components is chosen during training,7

meaning that the model may still choose to fit
only one component if appropriate. We compute
the log likelihood assigned to the held-out test data
under each model, and observe how often, and to
what extent, the additional components permitted
by the GMM yield a better fit for the held out
judgments.

If the mixture model frequently choses to use
more than one effective component, and if doing so
results in a better fit for the held-out data than the
unimodal Gaussian, we interpret this as evidence
that, for many sentence pairs, human judgments
exhibit reproducibly multimodal distributions.
Thus, for such sentence pairs, the current practice
of aggregating human judgments into a single
label would fail to accurately capture the types

7We use the Variational Bayesian estimation of a Gaussian
mixture provided in SciKit learn, with the maximum number
of components set to be the number of points in the training
data: https://scikit-learn.org/stable/modules/
generated/sklearn.mixture.BayesianGaussian
Mixture.html

681

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021

https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html


Figure 2: Log likelihood assigned to test data under
the single-component Gaussian (x-axis) vs the k-
component GMM (y-axis). Results show an average
over 10 random train/test splits; error bars not shown
to reduce clutter. Overall, multimodal distributions
generalize better to unseen human judgments than do
single Gaussians.

of semantic inferences that humans might make
about the given p/h pair.

4.2 Results

Are distributions unimodal? Figure 2 shows,
for each sentence pair, the test log likelihood
under the one-component Gaussian model versus
the k-component GMM. If the data were in fact
sampled from an underlying distribution defined
by a single Gaussian, we would expect the points
to be distributed approximately randomly around
the y = x line. That is, most of the time the
GMM would provide no advantage over the sin-
gle Gaussian. What we see instead is that the
majority of points fall on or above the y = x
line, indicating that, when there is a difference,
the additional components deemed necessary in
training tend to generalize to unseen human
judgments. Very few points fall below the y = x
line, indicating that when models choose to fit
multiple components, they are correctly modeling
the true data distribution, rather than overfitting
the training set. We note that the majority of points
fall on y = x, indicating that most examples
do exhibit consensus around one ‘‘true’’ label.8

Figure 3 shows, for each sentence pair, the weights
of the effective components according the the

8We verified that, if forced to fit more than one com-
ponent, the model often overfits, confirming that these
examples are indeed best modeled as unimodal distributions.

Figure 3: Weights of effective components for each
p/h pair. y-axis corresponds to the pairs in our data,
sorted by weight of the second component. The figure
should be interpreted as follows: When the line is all
blue (pair #400), the GMM found a single component
with a weight of 1. When the line contains mixed col-
ors, the model found multiple components with the
depicted weights (e.g., pair #0 has two components of
equal weight).

GMM. We see that for 20% of the sentence
pairs, there is a nontrivial second component
(weight > 0.2), but rarely are there more than
two components with significant weights.

Figure 4 shows several examples of sentences
for which the annotations exhibit clear bimodal
distributions. These examples show the range of
linguistic phenomena9 that can give rise to un-
certainty. In the first example, from SNLI, there
appears to be disagreement about the degree to
which two different descriptions could potentially
refer to the same scenario. In the second example,
from DNC and derived from VerbNet (Chklovski
and Pantel, 2004), there is disagreement about
the manner aspect of ‘‘swat’’, that is, whether
or not ‘‘swatting’’ is necessarily ‘‘forceful’’. In
the third example, from DNC and derived from
the MegaVerdicality dataset (White and Rawlins,
2017), there appears to be disagreement about the
degree to which ‘‘confess that’’ should be treated
as factive.

These examples highlight legitimate disagree-
ments in semantic interpretations, which can be
difficult to control without taking a highly pre-
scriptivist approach to annotation. Doing so,

9By corpus, RTE exhibits the least variation and JOCI
exhibits the most, though all of the corpora are comparable.
We did not see particularly interesting trends when we broke
down the analysis by corpus explicitly, so, for brevity, we
omit the finer-grained analysis.

682

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


Figure 4: Examples of sentence pairs with bi-modal human judgment distributions. Examples are drawn from
SNLI, the VerbNet portion of DNC, and the MegaVerdicality portion of DNC (from left to right). Training
distribution is in blue; test in orange. Dotted black line shows the model fit when using a single component; shaded
gray shows the model learned when allowed to fit k components. Distributions are over z-normalized scores in
which 0 roughly corresponds to neutral (p �→ h) but not precisely (§3.3).

however, would compromise both the ‘‘natural-
ness’’ of the task for annotators and the empiricist
approach to representation learning currently de-
sired in NLP (as discussed in Section 2).

Does context reduce disagreement? One fair
objection to these results is that sentence-level
inferences are problematic due to the lack of
context provided. It is reasonable to believe that
the divergences in judgments stem from the
fact that, when details of the context are left
unspecified, different raters choose to fill in these
details differently. This would inevitably lead to
different inferences, but would not be reflective
of differences in humans’ representations of lin-
guistic ‘‘meaning’’ as it pertains to NLI. We thus
explore whether providing additional context will
yield less-divergent human judgments. To do this,
we construct a small dataset in which we can
collect annotations with varying levels of context,
as described next.

Method. We sample sentences from Wikipedia,
restricting to sentences that are at least four words
long and contain a subject and a verb. We consider
each of these sentences to be a candidate premise
(p), and generate a corresponding hypothesis (h)
by replacing a word w1 from p with a substitute
w2, where w2 has a known lexical semantic re-
lationship to w1. Specifically, we use as set of
300 word pairs: 100 hypernym/hyponym pairs,
100 antonym pairs, and 100 co-hyponym pairs.
We chose these categories in order to ensure that
our analysis consists of meaningful substitutions
and that it covers a variety of types of inference
judgments. Our hypernyms and antonyms are

taken from WordNet (Fellbaum, 1998), with hy-
pernyms limited to first-sense immediate hyper-
nyms. Our co-hyponyms are taken from an internal
database, which we constructed by running Hearst
patterns (Hearst, 1992) over a large text corpus.
The 300 word pairs we used are available for
inspection at https://github.com/epavlick/NLI-
variation-data. After making the substitution, we
score each candidate p and h with a language
model (Józefowicz et al., 2016) and disregard
pairs for which the perplexity of h is more than 5
points above that of p. This threshold was chosen
based on manual inspection of a sample of the
output, and is effective at removing sentences
in which the substitution yielded a meaningless
hypothesis—for example, by replacing a w1 that
was part of a multiword expression.

For each resulting p/h pair, we collect ratings
at three levels: word level, in which p and h are
each a single word; sentence level, in which p
and h are each a sentence; and paragraph level, in
which p is a full paragraph and h is a sentence (as
depicted in Figure 6). We use the same anno-
tation design as described in Section 3.2. To
quantify the level of disagreement in the observed
judgments, we compute two measures: 1) variance
of observed ratings and 2) Δ log likelihood, that
is, the change in log likelihood of held out data
that results from using a k-component GMM
over a single-component Gaussian (as described
in the previous section). We note that Δ log
likelihood is a more direct measure of the type
of disagreement in which we are interested in
this paper (i.e., disagreements stemming from
multimodal distributions of judgments that are

683

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021

https://github.com/epavlick/NLI-variation-data
https://github.com/epavlick/NLI-variation-data


Figure 5: Distributions of variances (top) and Δ log likelihood (bottom) for human ratings resulting from
word, sentence, and paragraph contexts. The average variances of all levels are significantly different at p < 0.05
(word < sentence < paragraph). Average ΔLL for words was significantly lower than for sentences and paragraphs,
but there is no significant difference between sentences and paragraphs.

not well summarized by a single label/value).
High variance distributions may correspond to
‘‘difficult’’ cases which are nonetheless still
unimodal.

Results. Figure 5 shows the distribution of
each metric as a function of the level of context
given to raters. The trend is counter to our initial
intuition: Both measures of disagreement actually
increase when raters see more context. On aver-
age, we see a variance of 0.34 ± 0.02 when raters
are shown only words, 0.41 ± 0.02 when raters
are shown sentences, and 0.56 ± 0.02 when rat-
ers are given a full paragraph of context (95% con-
fidence intervals). The trend for Δ log likelihood
is similar: Disagreement at the word level (0.11 ±
0.02) is significantly lower than at the sentence
(0.21 ± 0.04) and paragraph (0.22 ± 0.03)
level, though there is no significant difference
in Δ log likelihood between sentence-level and
paragraph-level.

Figure 6 shows an example p/h pair for which
additional context increased the variance among
annotators. In the example shown, humans are
generally in agreement that ‘‘boating’’ may or
may not imply ‘‘picknicking’’, when no additional
context is given. However, when information
is provided which focuses on boating on a
specific canal, emphasizing the activities that the
water itself is used for, people diverge in their
inference judgments, with one group centered
around contradiction and a smaller group centered
around neutral.

We interpret these results as preliminary evi-
dence that disagreement is not necessarily control-
lable by providing additional context surrounding
the annotation (i.e., we do not see evidence that
increasing context helps, and it may in fact hurt).
We hypothesize that, in fact, less context may
result in higher agreement due to the fact that
humans can more readily call on conventional-
ized ‘‘default’’ interpretations. For example, in
the case of single words, people likely default

684

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


Figure 6: In the word case, human judges were shown
only the words (bolded); in the sentence case, judges
were shown pairs of sentences (gray highlight); in
the paragraph case, judges were shown all of the
text. Judges did not see markup (bold/highlight) when
presented the text to judge. Gray bars show distribution
of z-normalized scores, ticks show raw (unnormalized)
scores, bell curves are estimated by the GMM.

to reading them as referring expressions for a
single entity/event, and thus make judgments con-
sistent with the prototypical lexical entailment
relations between these words. Additional con-
text provides increased opportunity for inferences
based on pragmatics and world knowledge (e.g.,
inferences about the question under discussion
and the speaker’s intent), which are less likely to
follow consistent conventions across all raters.

We consider this study exploratory, as there
are some confounds. Most notably, increasing the
amount of context clearly increases cognitive load
on annotators, and thus we would expect to see
increased variance even if there were no increase
in actual interpretive disagreements. However, the
increase in the Δ log likelihood metric is more
meaningful, because randomly distributed noise
(which we might expect in the case of high cog-
nitive load/low annotator attention) should lead
to higher variance but not multimodality. More
work is needed to explore this trend further, and

to determine whether increasing context would
be a viable and productive means for reducing
disagreements on this task.

5 Analysis of Model Predictions

5.1 Motivation
Another natural question arising from the analysis
presented thus far is whether the phenomenon
under investigation even poses a problem for
NLP systems at all. That is, whether or not hu-
mans’ judgments can be summarized by a single
aggregate label or value might be a moot question,
since state-of-the-art models do not, in practice,
predict a single value but rather a distribution over
values. It may be the case that these predicted
distributions already reflect the distributions
observed in the human judgments and thus that
the models can be viewed as already adequately
capturing the aspects of semantic uncertainty that
cause the observed human disagreements. We
thus measure the extent to which the softmax
distributions produced by a state-of-the-art NLI
model trained on the dataset from which the p/h
pairs were drawn reflects the same distribution as
our observed human judgments.

5.2 Experimental Design
Data. NLI is standardly treated as a classi-
fication task. Thus, in order to interface with
existing NLI models, we discretize10 our collected
human judgments by mapping the raw (un-
normalized) score (which is between −50 and
50) into K evenly sized bins, where K is equal
to the number of classes that were used in the
original dataset from which the p/h pair was
drawn. Specifically, for pairs drawn from
datasets which use the three-way ENTAILMENT/
CONTRADICTION/NEUTRAL labels (i.e., SNLI, MNLI,
and RTE2), we consider human scores less than
−16.7 to be CONTRADICTION, those greater than
16.7 to be ENTAILMENT, and those in between
to be NEUTRAL. For the binary tasks (DNC),
we use the same three-way thresholds, but
consider scores below 16.7 to be NONENTAILMENT
and those above to be ENTAILMENT. After some
experimentation, we ultimately choose to map the

10We experimented with multiple variations on this
mapping, including using the z-normalized (rather than the
raw) human scores, and using bins based on percentiles
rather than evenly spaced over the full range. None of these
variants noticeably affected the results of our analysis or the
conclusions presented in the following section.

685

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


Orig./ BERT/ BERT/
Ours Orig Ours ∩

SNLI 0.790 0.890 0.830 76
MNLI 0.707 0.818 0.687 62
RTE2 0.690 0.460 0.470 36
DNC 0.780 0.900 0.800 74
JOCI 0.651 0.698 0.581 41

Table 3: Left to right: Agreement between
datasets’ original labels and the majority label
according to our (discretized) re-annotation;
accuracy of BERT NLI model against original
labels; accuracy of BERT against re-annotation
labels; number of p/h pairs (out of 100) on which
all three label sources (original, re-annotation,
model prediction) agree on the most likely label.
Our analysis in §5.3 is performed only over pairs
in ∩.

JOCI scores to a three-way classification scheme
as well, rather than the original five-way scheme,
using 1 = CONTRADICTION, {2,3,4} = NEUTRAL, and
5 = ENTAILMENT. This decision was made after
observing that, although our overall results and
conclusions remained the same regardless of the
way we performed the mapping, the three-way
mapping led to higher levels of agreement between
the original labels and our newly collected labels,
and thus gave the model the best chance of learning
the distribution against which it will be tested.11

Agreement between the original labels (i.e., those
in the published version of the data) and our
discretized newly collected labels are given in the
first column of Table 3. We note that measuring
agreement and model accuracy in terms of these
discrete distributions is not ideal, and it would be
preferable to train the model to directly predict
the full distributions, but because we do not
have sufficient training data to do this (we only
collected full distributions for 100 p/h pairs per
dataset) we must work in terms of the discrete
labels provided by the existing training datasets.

Model. We use pretrained BERT (Devlin et al.,
2019),12 fine-tuned on the training splits of the
datasets from which our test data was drawn. That
is, we fine-tune BERT five times, once on each
dataset, and then test each model on the subset of
our re-annotated p/h pairs that were drawn from

11We also try removing JOCI from our analysis entirely,
since it is the noisiest dataset, and still reach the same
conclusions from our subsequent analysis.

12https://github.com/google-research/bert

the dataset on which it was fine-tuned. We remove
from each training set the 100 p/h pairs that we had
re-annotated (i.e., the data we use for testing). We
use the BERT NLI model off-the-shelf, without
any changes to architecture, hyperparameters, or
training setup.

Table 3 shows the accuracy of each model on
the test set (i.e., our 100 re-annotated sentences)
when judged against 1) the original (discrete) label
for that pair given in the standard version of the
dataset (i.e., the same type of label on which the
model was trained) and 2) our new (discretized)
label derived from our re-annotation. Table 3 also
gives the agreement between the original discrete
labels and the discretized re-annotation labels.

Metrics. We want to quantify how well the
model’s predicted softmax distribution captures
the distribution over possible labels we see when
we solicit judgments from a large sample of
annotators. To do this, we consider the model
softmax to be a basic multinomial distribution, and
compute 1) the probability of the observed human
labels under that multinomial and 2) the cross-
entropy between the softmax and the observed
human distributions. As a point of comparison, we
compute the same metrics for a random sample,
of equal size to the set of observed labels, drawn
from the multinomial defined by the softmax.

We focus only on p/h pairs on which all three
label sources (i.e., the original label provided by
the official dataset, the new label we produce by
taking the majority vote of our newly collected,
discretized human judgments, and the model’s
prediction) agree. That is, because we want to
evaluate whether the model captures the distrib-
ution (not just the majority class that it was trained
to predict) we want to focus only on cases where
it at least gets the majority class right. Because we
want to compare against the full distribution of
discretized human labels we collected, we don’t
want to consider cases where the majority class
according to this distribution disagrees with the
majority class according to the model’s training
data, since this would unfairly penalize the model.
Table 3 shows the number of pairs (out of 100)
on which these three label sources agree, for each
dataset.

5.3 Results

Overall, the softmax is a poor approximation
of the distribution observed across the human

686

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021

https://github.com/google-research/bert


Cross Ent. Log Prob.
Exp. 0.03 (0.03, 0.03) −1.6 (−1.7, −1.5)
Obs. 0.37 (0.33, 0.42) −21.5 (−22.6, −20.1)

Table 4: Softmax is not a good estimate of the
distribution of human labels. Exp. refers to the sim-
ilarity values we expect due to random variation
(i.e., what we get when we compute against a ran-
dom sample drawn from the multinomial defined
by the softmax). Obs. refers to the similarity values
between the softmax distribution and the human
distribution. Numbers in parentheses give 95%
confidence intervals. Results are effectively the
same for each of individual corpora, so we report
only the aggregate results.

judges. The log probability assigned to the ob-
servations (i.e., the set of human labels) by the
predicted (softmax) multinomial is significantly
and substantially lower than the probability that
we would expect to be assigned if the observations
had been in fact sampled from the predicted
distribution. Similarly, the cross entropy between
the predicted and the observed distribution is
significantly higher than what can be attributed
to random noise (Table 4).

Figure 7 shows some examples of p/h pairs
for which the softmax substantially misrepresents
the nature of the uncertainty that exists among
the human labels, in one case because the model
predicts with certainty when humans find the
judgment ambiguous (due to the need to resolve an
ambiguous co-reference) and in the other because
the model suggests ambiguity when humans are
in clear consensus. Overall, the results indicate
that while softmax allows the model to represent
uncertainty in the NLI task, this uncertainty
does not necessarily mimic the uncertainty that
exists among humans’ perceptions about which
inferences can and cannot be made.

It is worth noting that the softmax distributions
tend to reflect the model’s confidence on the
dataset as a whole, rather than uncertainty on
individual examples. For example, in the RTE2
dataset, the model nearly always splits probability
mass over multiple labels, whereas in SNLI,
the model typically concentrates probability mass
onto a single label. This is not surprising behavior,
but serves to corroborate the claim that modeling
probabilistic entailment via softmax layers does

Figure 7: Examples of p/h pairs on which the model’s
predictions about the distribution (blue) misrepresent
the nature of the uncertainty observed among human
judgments (orange). In the first example (from RTE2)
the model assumes ambiguity when humans consider
the inference to be unambiguous (Cross-Ent = 0.36;
PMF = 2.2e-6). In the second example (from SNLI)
the model is certain when humans are actually in
disagreement (Cross-Ent = 0.43; PMF = 5.9e-18)

not correspond to modeling annotator uncertainty
about inference judgments on specific items.

6 Discussion

The results in Sections 4 and 5 suggest that
1) human NLI judgments are not adequately
captured by a single aggregate score and 2) NLI
systems trained to predict an aggregate score
do not learn human-like models of uncertainty
‘‘for free’’. These takeaways are significant for
work in computational semantics and language
technology in general primarily because NLI
has, historically (Cooper et al., 1996; Dagan
et al., 2006) as well as presently (White et al.,
2017), been proposed as a means for evaluating
a model’s ‘‘intrinsic’’ understanding of language:
As originally framed by Dagan et al. (2006),
NLI was proposed as an intermediate task for
evaluating whether a model will be useful in
applications, and currently, NLI is increasingly
used as a means for ‘‘probing’’ neural models
to assess their knowledge of arbitrary linguistic
phenomena (Dasgupta et al., 2018; Ettinger et al.,
2018; Poliak et al., 2018b; White et al., 2017;
Poliak et al., 2018a; McCoy et al., 2019). In
other words, NLI has largely become an evalu-
ation lingua franca through which we diagnose
what a semantic representation knows. With the
increased interest in ‘‘general-purpose’’, ‘‘task

687

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


independent’’ semantic representations,13,14 it is
particularly important that intrinsic evaluations
are reliable, if comparison of such representations
are to be meaningful.

As discussed, the preference among many in
NLP (the authors included) is to avoid tasks
which take a prescriptivist approach to language
and meaning. Instead, we attempt to design tasks
which capture humans’ linguistic behavior in as
natural a setting as possible (acknowledging that
truly natural annotation is difficult) with the hope
that models trained to perform such tasks will
be the best match for the ‘‘real world’’ settings
in which we hope to deploy them. That is, we
generally prefer to punt on precise definitions, and
instead train our models to ‘‘do what humans do’’.
In this paper, we have shown that defining ‘‘what
humans do’’ is not straightforward, as humans
do not necessarily handle ambiguity or communi-
cate uncertainty in the same way as one another.
Thus, as was the case for pipelined systems
(Zadrozny and Elkan, 2002; Finkel et al., 2006;
Bunescu, 2008) and related discussions of model
calibration (Kuleshov and Liang, 2015), we argue
that the best approach is to propagate uncertainty
downstream, so that end tasks can decide if and
how to handle inferences on which humans are
likely to disagree. From the point of view of
current neural NLI models—and the sentence
encoders on top of which they are built—this
means that a representation should be evaluated in
terms of its ability to predict the full distribution
of human inferences (e.g., by reporting cross-
entropy against a distribution of human ratings),
rather than to predict a single aggregate score
(e.g., by reporting accuracy against a discrete ma-
jority label or correlation with a mean score).

We have shown that models that are trained
to predict an aggregate score do not, by default,
model the same type of uncertainty as that which
is captured by distributions over many human
raters’ judgments. Thus, several challenges would
need to be overcome to switch to the proposed
NLI evaluation. First, NLI evaluation sets would
need to be annotated by sufficiently many raters
such that we can have an accurate estimate of the
distribution against which to evaluate. Although
the data collected for the purposes of this paper

13https://www.clsp.jhu.edu/workshops/18-
workshop/general-purpose-sentence-
representation-learning/

14https://repeval2019.github.io

could serve as a start towards this end, a larger
effort to augment or replace existing evaluation
sets with full distributions of judgments would
be necessary in order to yield a meaningful
redefinition of the NLI task. Second, changes
would be required to enable models to learn to
predict these distributions. One approach could
be to annotate training data, not just evaluation
data, with full distributions, and optimize for
the objective directly. This would clearly incur
additional costs, but could be overcome with more
creative crowdsourcing techniques (Dumitrache
et al., 2013; Poesio et al., 2019). However, re-
quiring direct supervision of full distributions is
arguably an unsatisfying solution: Rarely if ever
do humans witness multiple people responding to
identical stimuli. Rather, more plausibly, we form
generalizations about the linguistic phenomena
that give rise to uncertainty on the basis of a
large number of singly labeled examples. Thus,
ideally, progress can be made by developing new
architectures and/or training objectives that enable
models to learn a notion of uncertainty that is
consistent with the full range of possible human
inferences, despite observing labels from only one
or a few people on any given p/h pair. Overcoming
these challenges, and moving towards models
which can both understand sources of linguistic
uncertainty and anticipate the range of ways that
people might resolve it would be exciting both for
NLI and for representation learning in general.

7 Related Work

Defining Entailment and NLI. As outlined in
Section 2, there has been substantive discussion
about the definition of the NLI task. This debate
can largely be reduced to a debate about sentence
meaning versus speaker meaning. The former
aligns more closely with the goals of formal
semantics and seeks a definition of the NLI task
that precisely circumscribes the ways in which
vague notions of ‘‘world knowledge’’ and ‘‘com-
mon sense’’ can factor into inference (Zaenen
et al., 2005). The latter takes the perspective that
the NLI task should maintain an informal defi-
nition in which p → h as long as h is some-
thing that a human would be ‘‘happy to infer’’
from p, where the humans making the inferences
are assumed to be ‘‘awake, careful, moderately
intelligent and informed . . . but not . . . seman-
ticists or similar academics’’ (Manning, 2006).

688

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021

https://www.clsp.jhu.edu/workshops/18-workshop/general-purpose-sentence-representation-learning/
https://www.clsp.jhu.edu/workshops/18-workshop/general-purpose-sentence-representation-learning/
https://www.clsp.jhu.edu/workshops/18-workshop/general-purpose-sentence-representation-learning/
https://repeval2019.github.io


Garoufi (2007) provides an overview of attempts
that have been made to circumscribe the annota-
tion process by providing finer-grained annota-
tion options, in order to bring it more in line with
the sentence-meaning task definition. Westera
and Boleda (2019), in the context of advocating
for distributional models of semantics in general,
makes a case in favor of the speaker-meaning ap-
proach, arguing that issues like entailment, refer-
ence, and truth conditions should not fall within
the purview of sentence meaning at all, despite
being quintessential topics of formal semantic
study. Chatzikyriakidis et al. (2017) overview
NLI datasets, observing that datasets tend to be
designed with one of these perspectives in mind,
and thus all datasets ‘‘fail to capture the wealth of
inferential mechanisms present in NLI and seem
to be driven by the dominant discourse in the field
at the time of their creation.’’

An orthogonal line of discussion about the
definition of entailment focuses on the question
of whether truth-conditional semantics should be
strictly binary (propositions are either true or false)
or rather treated as continuous/probabilistic val-
ues. Currently, at least within computationally
minded work on textual inference, the prevailing
opinion is in favor of the latter (i.e., allowing
semantic judgments to be probabilistic) with few
(if any) advocating that we should build systems
that only support discrete true/false decisions.
Still, significant theoretical and algorithmic work
has gone into making probabilistic logics work in
practice. Such work includes (controversial) for-
malisms such as fuzzy set theory (Zadeh, 1994,
1996), as well as more generally accepted formal-
isms which assume access to boolean ground-
ings, such as probabilistic soft logic (Friedman
et al., 1999; Kimmig et al., 2012; Beltagy et al.,
2014) and Markov logic networks (Richardson
and Domingos, 2006). Also related is work on
collecting and analyzing graded entailment judg-
ments (de Marneffe et al., 2012). We note that
the question of strict vs. graded entailment judg-
ments pertains to modeling of uncertainty within
an individual rater’s judgments. This is indepen-
dent of the question of if/how to model disagree-
ments between raters, which is the our focus in
this work.

Embracing Rater Disagreement. Significant
past work has looked an annotator disagreement in
linguistic annotations, and has advocated that this

disagreement should be taken as signal rather than
noise (Aroyo et al., 2018; Palomaki et al., 2018).
Plank et al. (2014) showed that incorporating
rater uncertainty into the loss function for a
POS tagger improves downstream performance.
Similar approaches have been applied in parsing
(Martı́nez Alonso et al., 2015) and supersense
tagging (Martı́nez Alonso et al., 2016). Specif-
ically relevant to this work is past discussion of
disagreement on semantic annotation tasks, in-
cluding anaphora resolution (Poesio and Artstein,
2005), coreference (Versley, 2008; Recasens
et al., 2011), word sense disambiguation (Erk
and McCarthy, 2009; Passonneau et al., 2012;
Jurgens, 2013), veridicality (Geis and Zwicky,
1971; Karttunen et al., 2014; de Marneffe et al.,
2012), semantic frames (Dumitrache et al., 2019),
and grounding (Reidsma and op den Akker, 2008).

Most of this work focuses on the uncertainty
of individual raters, oftentimes concluding that
such uncertainty can be addressed by shifting
to a graded rather than discrete labeling schema
and/or that uncertainty can be leveraged as a
means for detecting inherently ambiguous items.
In contrast, we do not look at measures of
uncertainty/ambiguity from the point of view
of an individual (though this is a very interest-
ing question); rather, we focus on disagreements
that exist between raters. We agree strongly that
semantic judgments should be treated as graded,
and that ambiguous items should be acknowl-
edged as such. Still, this is independent of the
issue of inter-rater disagreement: Two raters can
disagree when making graded judgments as much
as they can when making discrete judgments, and
they can disagree when they are both uncertain as
much as they can when they are both certain. Thus,
the central question of this work is whether aggre-
gation (via average or majority vote) is a faith-
ful representation of the underlying distribution
of judgments across annotators. Arguably, such
aggregation is a faithful (albeit lossy) representa-
tion of high-variance unimodal distributions, but
not of multi-modal ones.

In this regard, particularly relevant to our work
is de Marneffe et al. (2012) and de Marneffe
et al. (2018), who observed similarly persistent
disagreement in graded judgments of veridicality,
and made a case for attempting to model the full
distribution as opposed to a single aggregate score.
Smith et al. (2013) present related theoretical
work, which proposes specific mechanisms by

689

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


which humans might handle lexical uncertainty in
the context of inference. Their model assumes
pragmatic speakers and listeners who reason
simultaneously about one another’s goals and
about the lexicon itself, and could be used to
explain differing inferences in cases where raters
share different beliefs about the speaker (author)
of p and/or about the lexicon. Schaekermann et al.
(2016) develop a proof-of-concept annotation in-
terface specifically intended to recognize whether
or not inter-rater disagreement is ‘‘resolvable’’
via more annotation, or rather is likely to persist,
although they don’t discuss natural language
semantics directly. Finally, Tanenhaus et al.
(1985) discuss the role of formal semantics and
generative grammar in inference, and specifically
differentiates between work which treats grammar
as a causal process of how inferences occur
versus work which treats grammar as a descriptive
framework of the structure of language. Such dis-
cussion is relevant going forward, as engineers of
NLI systems must determine both how to define
the evaluation task, as well as the role that concepts
from formal semantics should play within such
systems.

8 Conclusion

We provide an in-depth study of disagreements in
human judgments on the NLI task. We show that
many disagreements persist even after increasing
the number of annotators and the amount of
context provided, and that models which represent
these annotations as multimodal distributions gen-
eralize better to held-out data than those which do
not. We evaluate whether a state-of-the-art NLI
model (BERT) captures these disagreements by
virtue of producing softmax distributions over
labels and show that it does not. We argue that, if
NLI is to serve as an adequate intrinsic evaluation
of semantic representations, then models should
be evaluated in terms of their ability to predict the
full expected distribution over all human raters,
rather than a single aggregate score.

Acknowledgments

Thank you to the Action Editor Chris Potts and
the anonymous reviewers for their input on
earlier drafts of this paper. This work evolved
substantially as a result of their suggestions and
feedback. Thank you to Dipanjan Das, Michael
Collins, Sam Bowman, Ankur Parikh, Emily

Pitler, Yuan Zhang, and the rest of the Google
Language team for many useful discussions.

References

Lora Aroyo, Anca Dumitrache, Praveen Paritosh,
Alex Quinn, and Chris Welty, editors. 2018.
Proc. Subjectivity, Ambiguity and Disagree-
ment in Crowdsourcing (SAD), volume 1 of 1.
HCOMP, Zurich, Switzerland.

Islam Beltagy, Katrin Erk, and Raymond Mooney.
2014. Probabilistic soft logic for semantic tex-
tual similarity. In Proceedings of the 52nd
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 1210–1219, Baltimore, MD. Association
for Computational Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural lan-
guage inference. In Proceedings of the 2015
Conference on Empirical Methods in Nat-
ural Language Processing, pages 632–642,
Lisbon, Portugal. Association for Computa-
tional Linguistics.

Razvan Bunescu. 2008. Learning with probabilis-
tic features for improved pipeline models. In
Proceedings of the 2008 Conference on Empir-
ical Methods in Natural Language Processing,
pages 670–679, Honolulu, HI. Association for
Computational Linguistics.

Chris Callison-Burch and Mark Dredze. 2010.
Creating speech and language data with
Amazon’s mechanical turk. In Proceedings of
the NAACL HLT 2010 Workshop on Creating
Speech and Language Data with Amazon’s Me-
chanical Turk, pages 1–12, Los Angeles, CA.
Association for Computational Linguistics.

Stergios Chatzikyriakidis, Robin Cooper, Simon
Dobnik, and Staffan Larsson. 2017. An over-
view of natural language inference data col-
lection: The way forward? In Proceedings of
the Computing Natural Language Inference
Workshop.

Timothy Chklovski and Patrick Pantel. 2004.
VerbOcean: Mining the Web for fine-grained
semantic verb relations. In Proceedings of

690

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


the 2004 Conference on Empirical Methods
in Natural Language Processing, pages 33–40,
Barcelona, Spain. Association for Computa-
tional Linguistics.

Robin Cooper, Dick Crouch, Jan Van Eijck,
Chris Fox, Johan Van Genabith, Jan Jaspars,
Hans Kamp, David Milward, Manfred Pinkal,
Massimo Poesio, and Steve Pullman. 1996, Using
the framework. Technical report, Technical Report
LRE 62-051 D-16, The FraCaS Consortium.

Ido Dagan, Oren Glickman, and Bernardo
Magnini. 2006, The PASCAL recognising tex-
tual entailment challenge. In Machine Learning
Challenges. Evaluating Predictive Uncertainty,
Visual Object Classification, and Recognising
Textual Entailment, pages 177–190. Springer

Ishita Dasgupta, Demi Guo, Andreas Stuhlmüller,
Samuel J. Gershman, and Noah D. Goodman.
2018. Evaluating compositionality in sentence
embeddings. arXiv preprint arXiv:1802.04302.

Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT: Pre-
training of deep bidirectional transformers for
language understanding. In Proceedings of the
2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, MN. Associ-
ation for Computational Linguistics.

Anca Dumitrache, Lora Aroyo, and Chris Welty.
2019. A crowdsourced frame disambiguation
corpus with ambiguity. In Proceedings of
the 2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pages 2164–
2170, Minneapolis, MN. Association for Com-
putational Linguistics.

Anca Dumitrache, Lora Aroyo, Christopher A.
Welty, Robert-Jan Sips, and Anthony Levas.
2013. Dr. Detective: Combining gamification
techniques and crowdsourcing to create a gold
standard for the medical domain. In CrowdSem
Workshop at the International Semantic Web
Conference.

Katrin Erk and Diana McCarthy. 2009. Graded
word sense assignment. In Proceedings of the

2009 Conference on Empirical Methods in
Natural Language Processing, pages 440–449,
Singapore. Association for Computational
Linguistics.

Allyson Ettinger, Ahmed Elgohary, Colin
Phillips, and Philip Resnik. 2018. Assessing
composition in sentence vector representations.
In Proceedings of the 27th International
Conference on Computational Linguistics,
pages 1790–1801, Santa Fe, NM. Association
for Computational Linguistics.

Christiane Fellbaum. 1998. WordNet, Wiley
Online Library.

Jenny Rose Finkel, Christopher D. Manning, and
Andrew Y. Ng. 2006. Solving the problem
of cascading errors: Approximate Bayesian
inference for linguistic annotation pipelines.
In Proceedings of the 2006 Conference on
Empirical Methods in Natural Language Pro-
cessing, pages 618–626, Sydney, Australia.
Association for Computational Linguistics.

Nir Friedman, Lise Getoor, Daphne Koller,
and Avi Pfeffer. 1999. Learning probabilistic
relational models. In IJCAI, 99: 1300–1309.

Konstantina Garoufi. 2007. Towards a Better
Understanding of Applied Textual Entailment.
Ph.D. thesis, Citeseer.

Michael L. Geis and Arnold M. Zwicky. 1971.
On invited inferences. Linguistic inquiry,
2(4):561–566.

Marti A. Hearst. 1992. Automatic acquisition of
hyponyms from large text corpora. In COLING
1992 Volume 2: The 15th International Con-
ference on Computational Linguistics.

Rafal Józefowicz, Oriol Vinyals, Mike Schuster,
Noam Shazeer, and Yonghui Wu. 2016.
Exploring the limits of language modeling.
ArXiv, abs/1602.02410.

David Jurgens. 2013. Embracing ambiguity:
A comparison of annotation methodologies
for crowdsourcing word sense labels. In
Proceedings of the 2013 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 556–562, Atlanta, GA.
Association for Computational Linguistics.

691

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


Lauri Karttunen, Stanley Peters, Annie Zaenen,
and Cleo Condoravdi. 2014. The chameleon-
like nature of evaluative adjectives. Empirical
Issues in Syntax and Semantics, 10:233–250.

Angelika Kimmig, Stephen Bach, Matthias
Broecheler, Bert Huang, and Lise Getoor. 2012.
A short introduction to probabilistic soft logic.
In Proceedings of the NIPS Workshop on
Probabilistic Programming: Foundations and
Applications, pages 1–4.

Volodymyr Kuleshov and Percy S. Liang. 2015.
Calibrated structured prediction. In Advances
in Neural Information Processing Systems,
pages 3474–3482.

Christopher D. Manning. 2006. Local textual
inference: Its hard to circumscribe, but you
know it when you see it–and NLP needs it.
https://nlp.stanford.edu/manning/papers/Textual
Inference.pdf

Marie-Catherine de Marneffe, Mandy Simons, and
Judith Tonhauser. 2018. Factivity in doubt:
Clause-embedding predicates in naturally
occurring discourse. Sinn und Bedeutung 23
(Poster).

Marie-Catherine de Marneffe, Christopher D.
Manning, and Christopher Potts. 2012. Did it
happen? The pragmatic complexity of verid-
icality assessment. Computational Linguistics,
38(2):301–333.

Héctor Martı́nez Alonso, Anders Johannsen, and
Barbara Plank. 2016. Supersense tagging with
inter-annotator disagreement. In Proceedings of
the 10th Linguistic Annotation Workshop held
in conjunction with ACL 2016 (LAW-X 2016),
pages 43–48, Berlin, Germany. Association for
Computational Linguistics.

Héctor Martı́nez Alonso, Barbara Plank, Arne
Skjærholt, and Anders Søgaard. 2015. Learning
to parse with IAA-weighted loss. In Proceed-
ings of the 2015 Conference of the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
nologies, pages 1357–1361, Denver, CO.
Association for Computational Linguistics.

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.
Right for the wrong reasons: Diagnosing syn-
tactic heuristics in natural language inference.

In Proceedings of the 57th Annual Meeting
of the Association for Computational Lin-
guistics, pages 3428–3448, Florence, Italy.
Association for Computational Linguistics.

Richard Montague. 1970. Universal grammar.
Theoria, 36(3):373–398.

Jennimaria Palomaki, Olivia Rhinehart, and
Michael Tseng. 2018. A case for a range
of acceptable annotations. In Proceedings of
Workshop on Subjectivity, Ambiguity, and
Disagreement (SAD). HCOMP.

Rebecca J. Passonneau, Vikas Bhardwaj, Ansaf
Salleb-Aouissi, and Nancy Ide. 2012. Multi-
plicity and word sense: Evaluating and learning
from multiply labeled word sense annota-
tions. Language Resources and Evaluation,
46(2):219–252.

Ellie Pavlick and Chris Callison-Burch. 2016a.
Most ‘‘babies’’ are ‘‘little’’ and most ‘‘prob-
lems’’ are ‘‘huge’’: Compositional entailment
in adjective-nouns. In Proceedings of the 54th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 2164–2173, Berlin, Germany. Associa-
tion for Computational Linguistics.

Ellie Pavlick and Chris Callison-Burch. 2016b.
So-called non-subsective adjectives. In Pro-
ceedings of the Fifth Joint Conference
on Lexical and Computational Semantics,
pages 114–119, Berlin, Germany. Association
for Computational Linguistics.

Barbara Plank, Dirk Hovy, and Anders Søgaard.
2014. Learning part-of-speech taggers with
inter-annotator agreement loss. In Proceedings
of the 14th Conference of the European Chapter
of the Association for Computational Lin-
guistics, pages 742–751, Gothenburg, Sweden.
Association for Computational Linguistics.

Massimo Poesio and Ron Artstein. 2005. The reli-
ability of anaphoric annotation, reconsidered:
Taking ambiguity into account. In Proceed-
ings of the Workshop on Frontiers in Corpus
Annotations II: Pie in the Sky, pages 76–83,
Ann Arbor, MI. Association for Computational
Linguistics.

Massimo Poesio, Jon Chamberlain, Silviu Paun,
Juntao Yu, Alexandra Uma, and Udo Kruschwitz.

692

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


2019. A crowdsourced corpus of multiple judg-
ments and disagreement on anaphoric interpre-
tation. In Proceedings of the 2019 Conference of
the North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short
Papers), pages 1778–1789, Minneapolis, MN.
Association for Computational Linguistics.

Adam Poliak, Yonatan Belinkov, James Glass, and
Benjamin Van Durme. 2018a. On the evaluation
of semantic phenomena in neural machine
translation using natural language inference.
In Proceedings of the 2018 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, Volume 2 (Short Papers),
pages 513–523, New Orleans, LA. Association
for Computational Linguistics.

Adam Poliak, Aparajita Haldar, Rachel Rudinger,
J. Edward Hu, Ellie Pavlick, Aaron Steven
White, and Benjamin Van Durme. 2018b.
Collecting diverse natural language inference
problems for sentence representation evalua-
tion. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language
Processing, pages 67–81, Brussels, Belgium.
Association for Computational Linguistics.

Marta Recasens, Eduard Hovy, and M. Antònia
Martı́. 2011. Identity, non-identity, and near-
identity: Addressing the complexity of coref-
erence. Lingua, 121(6):1138–1152.

Dennis Reidsma and Rieks op den Akker. 2008.
Exploiting ‘subjective’ annotations. In Coling
2008: Proceedings of the workshop on Hu-
man Judgements in Computational Linguistics,
pages 8–16, Manchester, UK. Coling 2008
Organizing Committee.

Matthew Richardson and Pedro Domingos. 2006.
Markov logic networks. Machine Learning,
62(1–2):107–136.

Mike Schaekermann, Edith Law, Alex C.
Williams, and William Callaghan. 2016. Re-
solvable vs. irresolvable ambiguity: A new
hybrid framework for dealing with uncertain
ground truth. In 1st Workshop on Human-
Centered Machine Learning at SIGCHI.

Mandy Simons, Judith Tonhauser, David Beaver,
and Craige Roberts. 2010. What projects and

why. Semantics and Linguistic Theory, 20:
309–327.

Nathaniel J. Smith, Noah Goodman, and Michael
Frank. 2013. Learning and using language
via recursive pragmatic reasoning about other
agents. In Advances in Neural Information
Processing Systems, pages 3039–3047.

Rion Snow, Brendan O’Connor, Daniel Jurafsky,
and Andrew Ng. 2008. Cheap and fast—but
is it good? Evaluating non-expert annotations
for natural language tasks. In Proceedings of
the 2008 Conference on Empirical Methods in
Natural Language Processing, pages 254–263,
Honolulu, HI. Association for Computational
Linguistics.

M. Tanenhaus, G. Carlson, and M. S. Seidenberg.
1985. Do listeners compute linguistic repre-
sentations? D. Dowty, L. Karttunen, and A.
Zwicky, editors, Natural Language Parsing.
Cambridge University Press.

Judith Tonhauser, David I. Beaver, and Judith
Degen. 2018. How Projective is Projective Con-
tent? Gradience in Projectivity and At-issueness.
Journal of Semantics, 35(3):495–542.

Y. Versley. 2008. Vagueness and referential
ambiguity in a large-scale annotated corpus. Re-
search on Language and Computation, 6:333–353.

Matthijs Westera and Gemma Boleda. 2019.
Don’t blame distributional semantics if it can’t
do entailment. In Proceedings of the 13th
International Conference on Computational
Semantics - Long Papers, pages 120–133,
Gothenburg, Sweden. Association for Com-
putational Linguistics.

Aaron S. White, Valentine Hacquard, and Jeffrey
Lidz. 2018. Semantic information and the
syntax of propositional attitude verbs. Cognitive
Science, 42(2):416–456.

Aaron Steven White, Pushpendre Rastogi, Kevin
Duh, and Benjamin Van Durme. 2017. In-
ference is everything: Recasting semantic
resources into a unified evaluation framework.
In Proceedings of the Eighth International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 996–1005,

693

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021


Taipei, Taiwan. Asian Federation of Natural
Language Processing.

Aaron Steven White and Kyle Rawlins. 2017.
The role of veridicality and factivity in clause
selection. 48th Annual Meeting of the North
East Linguistic Society, Reykjavı́k. http://
iceland2017.nelsconference.org/wp-content/
uploads/2017/08/White-Rawlins.pdf.

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through
inference. In Proceedings of the 2018 Con-
ference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long Papers), pages 1112–1122, New Orleans,
LA. Association for Computational Linguistics.

Lotfi A. Zadeh. 1994. Fuzzy logic, neural net-
works, and soft computing. Communications of
the ACM, 37(3):77–84.

Lotfi A. Zadeh. 1996. Fuzzy logic = computing
with words. IEEE Transactions on Fuzzy Sys-
tems, 4(2):103–111.

Bianca Zadrozny and Charles Elkan. 2002.
Transforming classifier scores into accurate
multiclass probability estimates. In Proceedings
of the Eighth ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining, pages 694–699. ACM.

Annie Zaenen, Lauri Karttunen, and Richard
Crouch. 2005. Local textual inference: Can it
be defined or circumscribed? In Proceedings
of the ACL Workshop on Empirical Modeling
of Semantic Equivalence and Entailment,
pages 31–36, Ann Arbor, MI. Association for
Computational Linguistics.

Sheng Zhang, Rachel Rudinger, Kevin Duh,
and Benjamin Van Durme. 1996. Ordinal
Common-sense Inference. Transactions of the
Association for Computational Linguistics,
5:379–395. https://www.aclweb.org/anthology/
Q17-1027. doi:10.1162/tac a 00068.

694

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021

http://iceland2017.nelsconference.org/wp-content/uploads/2017/08/White-Rawlins.pdf
http://iceland2017.nelsconference.org/wp-content/uploads/2017/08/White-Rawlins.pdf
http://iceland2017.nelsconference.org/wp-content/uploads/2017/08/White-Rawlins.pdf
https://www.aclweb.org/anthology/Q17-1027
https://www.aclweb.org/anthology/Q17-1027

	Introduction
	The RTE/NLI Task
	NLI Data and Annotation
	Sentence Pairs
	Annotation
	Preprocessing

	Analysis of Human Judgments
	Experimental Design
	Results

	Analysis of Model Predictions
	Motivation
	Experimental Design
	Results

	Discussion
	Related Work
	Conclusion