Latent Structures for Coreference Resolution

Sebastian Martschat and Michael Strube
Heidelberg Institute for Theoretical Studies gGmbH

Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany
(sebastian.martschat|michael.strube)@h-its.org

Abstract
Machine learning approaches to coreference
resolution vary greatly in the modeling of the
problem: while early approaches operated on
the mention pair level, current research fo-
cuses on ranking architectures and antecedent
trees. We propose a unified representation
of different approaches to coreference reso-
lution in terms of the structure they operate
on. We represent several coreference reso-
lution approaches proposed in the literature
in our framework and evaluate their perfor-
mance. Finally, we conduct a systematic anal-
ysis of the output of these approaches, high-
lighting differences and similarities.

1 Introduction

Coreference resolution is the task of determining
which mentions in a text are used to refer to the same
real-world entity. The era of statistical natural lan-
guage processing saw the shift from rule-based ap-
proaches (Hobbs, 1976; Lappin and Leass, 1994) to
increasingly sophisticated machine learning models.
While early approaches cast the problem as binary
classification of mention pairs (Soon et al., 2001),
recent approaches make use of complex structures
to represent coreference relations (Yu and Joachims,
2009; Fernandes et al., 2014).

The aim of this paper is to devise a framework
for coreference resolution that leads to a unified rep-
resentation of different approaches to coreference
resolution in terms of the structure they operate
on. Previous work in other areas of natural lan-
guage processing such as parsing (Klein and Man-
ning, 2001) and machine translation (Lopez, 2009)

has shown that providing unified representations of
approaches to a problem deepens its understanding
and can also lead to empirical improvements. By im-
plementing popular approaches in this framework,
we can highlight structural differences and similar-
ities between them. Furthermore, this establishes a
setting to systematically analyze the contribution of
the underlying structure to performance, while fix-
ing parameters such as preprocessing and features.

In particular, we analyze approaches to corefer-
ence resolution and point out that they mainly dif-
fer in the structures they operate on. We then note
that these structures are not annotated in the train-
ing data (Section 2). Motivated by this observation,
we develop a machine learning framework for struc-
tured prediction with latent variables for coreference
resolution (Section 3). We formalize the mention
pair model (Soon et al., 2001; Ng and Cardie, 2002),
mention ranking architectures (Denis and Baldridge,
2008; Chang et al., 2012) and antecedent trees (Fer-
nandes et al., 2014) in our framework and high-
light key differences and similarities (Section 4). Fi-
nally, we present an extensive comparison and anal-
ysis of the implemented approaches, both quantita-
tive and qualitative (Sections 5 and 6). Our analy-
sis shows that a mention ranking architecture with
latent antecedents performs best, mainly due to its
ability to structurally model determining anaphoric-
ity. Finally, we briefly describe how entity-centric
approaches fit into our framework (Section 7).

An open source toolkit which implements the ma-
chine learning framework and the approaches dis-
cussed in this paper is available for download1.

1http://smartschat.de/software

405

Transactions of the Association for Computational Linguistics, vol. 3, pp. 405–418, 2015. Action Editor: Mark Johnson.
Submission batch: 3/2015; Revision batch 6/2015; Published 7/2015.

c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


2 Modeling Coreference Resolution

The aim of automatic coreference resolution is to
predict a clustering of mentions such that each clus-
ter contains all mentions that are used to refer to the
same entity. However, most coreference resolution
models reduce the problem to predicting coreference
between pairs of mentions, and jointly or cascad-
ingly consolidating these predictions. Approaches
differ in the scope (pairwise, per anaphor, per docu-
ment, ...) they employ while learning a scoring func-
tion for these pairs, and the way the consolidating is
handled.

The different ways to employ the scope and to
consolidate decisions can be understood as operat-
ing on latent structures: as pairwise links are not
annotated in the data, coreference approaches create
structures (either heuristically or data-driven) that
guide the learning of the pairwise scoring function.

To understand this better, let us consider two ex-
amples. Mention pair models (Soon et al., 2001;
Ng and Cardie, 2002) cast the problem as first cre-
ating a list of mention pairs, and deciding for each
pair whether the two mentions are coreferent. Af-
terwards the decisions are consolidated by a cluster-
ing algorithm such as best-first or closest-first. We
therefore can consider this approach to operate on a
list of mention pairs where each pair is handled in-
dividually. In contrast, antecedent tree models (Fer-
nandes et al., 2014; Björkelund and Kuhn, 2014)
consider the whole document at once and predict a
tree consisting of anaphor-antecedent pairs.

3 A Structured Prediction Framework

In this section we introduce a structured prediction
framework for learning coreference predictors with
latent variables. When devising the framework, we
focus on accounting for the latent structures under-
lying coreference resolution approaches. The frame-
work is a generalization of previous work on latent
antecedents and trees for coreference resolution (Yu
and Joachims, 2009; Chang et al., 2012; Fernandes
et al., 2014).

3.1 Setting

In all prediction tasks, the goal is to learn a mapping
f from inputs x ∈ X to outputs y ∈ Yx. A predic-
tion task is structured if the output elements y ∈Yx

exhibit some structure. As we work in a latent vari-
able setting, we assume that Yx = Hx ×Zx, and
therefore y = (h,z) ∈ Hx × Zx. We call h the
hidden or latent part, which is not observed in the
data, and z the observed part (during training). We
assume that z can be inferred from h, and that in a
pair (h,z), h and z are always consistent.

We first define the input space X and the output
spaces Hx and Zx for x ∈X .

3.2 The Input Space X
The input space consists of documents. We repre-
sent a document x ∈ X as follows. Let us assume
that Mx is the set of mentions (expressions which
may be used to refer to entities) in the document.
We write Mx = {m1, . . . ,mk}, where the mi are in
ascending order with respect to their position in the
document. We then consider M0x = {m0} ∪ Mx,
where m0 precedes every mi ∈ Mx (Chang et al.,
2012; Fernandes et al., 2014).
m0 plays the role of a dummy mention for

anaphoricity detection: if m0 is chosen as the an-
tecedent, the corresponding mention is deemed as
non-anaphoric. This enables joint coreference reso-
lution and anaphoricity determination.

3.3 The Latent Space Hx for an Input x
Let x ∈X be some document. As we saw in the pre-
vious section, approaches to coreference resolution
predict a latent structure which is not annotated in
the data but is used to infer coreference information.
Inspired by previous work on coreference (Bengtson
and Roth, 2008; Fernandes et al., 2014; Martschat
and Strube, 2014), we now develop a graph-based
representation for these structures.

A valid latent structure for the document x is a
labeled directed graph h = (V,A,LA) where
• the set of nodes are the mentions, V = M0x,
• the set of edges A consists of links between

mentions pointing back in the text,

A ⊆{(mj,mi) |j > i}⊆ Mx ×M0x.

• LA : A → L assigns a label ` ∈ L to each
edge. L is a finite set of labels, for example
signaling coreference or non-coreference.

We split h into subgraphs (called substructures
from now on), which we notate as h = h1⊕. . .⊕hn,

406


with hi = (Vi,Ai,LAi) ∈ Hx,i, where Hx,i is the
latent space for an input x restricted to the mentions
appearing in hi. hi encodes coreference decisions
for a subset of mentions in x.

m0 m1 m2 m3
−

+

−

−
+

+

Figure 1: Graph-based representation of the mention pair
model. The dashed box shows one substructure of the
structure.

Figure 1 depicts a graph that captures the latent
structure underlying the mention pair model. Men-
tion pairs are represented as node connected by an
edge. The edge either has label “+” (if the mentions
are coreferent) or “−” (otherwise). As the mention
pair model considers each mention pair individually,
each edge is one substructure of the latent structure
(expressed via the dashed box). We describe this
representation in more detail in Section 4.1.

3.4 The Observed Output Space Zx for an
Input x

Let x ∈X be some document. The observed output
space consists of all functions ex : Mx → N that
map mentions to entity identifiers. Two mi,mj ∈
Mx are coreferent if and only if ex(mi) = ex(mj).
ex is inferred from the latent structure, e.g. by taking
the transitive closure over coreference decisions.

This representation corresponds to the way coref-
erence is annotated in corpora.

3.5 Linear Models
Let us write H = ∪x∈XHx for the full latent space
(analogously Z). Our goal is to learn the mapping
f : X → H×Z. We assume that the mapping is
parametrized by a weight vector θ ∈ Rd, and there-
fore write f = fθ. We restrict ourselves to linear
models. That is,

fθ(x) = arg max
(h,z)∈Hx×Zx

〈θ,φ(x,h,z)〉,

where φ: X ×H×Z → Rd is a joint feature func-
tion for inputs and candidate outputs.

Since h = h1 ⊕ . . .⊕hn, we have

fθ(x) = arg max
(h,z)∈Hx×Zx

〈θ,φ(x,h,z)〉

=
n⊕

i=1

arg max
(hi,z)∈Hx,i×Zx

〈θ,φ(x,hi,z)〉.

In this paper, we only consider feature functions
which factor with respect to the edges in hi =
(Vi,Ai,LAi), i.e. φ(x,hi,z) =

∑
a∈Ai φ(x,a,z).

Hence, the features examine properties of mention
pairs, such as head word of each mention, number
of each mention, or the existence of a string match.
We describe the feature set used for all approaches
represented in our framework in Section 5.2.

3.6 Decoding
Given an input x ∈ X and a weight vector θ ∈ Rd,
we obtain the prediction by solving the arg max
equation described in the previous subsection. This
can be viewed as searching the output space Hx×Zx
for the highest scoring output pair (h,z).

The details of the search procedure depend on the
space Hx of latent structures and the factorization
into substructures. For the structures we consider in
this paper, the maximization can be solved exactly
via greedy search. For structures with complex con-
straints like transitivity, more complex or even ap-
proximate search methods need to be used (Klenner,
2007; Finkel and Manning, 2008).

3.7 Learning
We assume a supervised learning setting with latent
variables, i.e., we have a training set of documents

D =
{(
x(i),z(i)

)
| i = 1, . . . ,m

}

at our disposal. Note that the latent structures are not
encoded in this training set.

In principle we would like to directly optimize
for the evaluation metric we are interested in. Un-
fortunately, the evaluation metrics used in corefer-
ence do not allow for efficient optimization based on
mention pairs, since they operate on the entity level.
For example, the CEAFe metric (Luo, 2005) needs
to compute optimal entity alignments between gold
and system entities. These alignments do not factor
with respect to mention pairs. We therefore have to
use some surrogate loss.

407


Algorithm 1 Structured latent perceptron with cost-
augmented inference.
Input: Training set D, a cost function c, number of

epochs n.
function PERCEPTRON(D, c, n)

set θ = (0, . . . ,0)
for epoch = 1, . . . ,n do

for (x,z) ∈D do
for each substructure do
ĥopt,i = arg max

hi∈const(Hx,z,i)
〈θ,φ(x,hi,z)〉

(ĥi, ẑ) = arg max
(hi,z)∈Hx,i×Zx

(〈θ,φ(x,hi,z)〉

+c(x,hi, ĥopt,i,z))

if ĥi does not partially encode z then
set θ = θ + φ(x,ĥopt,i,z)

−φ(x,ĥi, ẑ)
Output: A weight vector θ.

We employ a structured latent perceptron (Sun et
al., 2009) extended with cost-augmented inference
(Crammer et al., 2006) to learn the parameters of
the models we discuss. While this restricts us to a
particular objective to optimize, it comes with var-
ious advantages: the implementation is simple and
fast, we can incorporate error functions via cost-
augmentation, the structures are plug-and-play if we
provide a decoder, and the (structured) perceptron
with cost-augmented inference has exhibited good
performance for coreference resolution (Chang et
al., 2012; Fernandes et al., 2014).

To describe the algorithm, we need some addi-
tional terminology. Let (x,z) be a training exam-
ple. Let (ĥ, ẑ) = fθ(x) be the prediction under the
model parametrized by θ. Let Hx,z be the space of
all latent structures for an input x that are consistent
with a coreference output z. Structures in Hx,z pro-
vide substitutes for gold structures in training. Some
approaches restrict Hx,z, for example by learning
only from the closest antecedent of a mention (Denis
and Baldridge, 2008). Hence, we consider the con-
strained space const(Hx,z) ⊆ Hx,z, where const is
a function that depends on the approach in focus.

ĥopt = arg max
h∈const(Hx,z)

〈θ,φ(x,h,z)〉

is the optimal constrained latent structure under the

current model which is consistent with z. We write
ĥi and ĥopt,i for the ith substructure of the latent
structure.

To estimate θ, we iterate over the training data.
For each input, we compute the optimal constrained
prediction consistent with the gold information,
ĥopt,i. We then compute the optimal prediction
(ĥi, ẑ), but also include the cost function c in our
maximization problem. This favors solutions with
high cost, which leads to a large margin approach.

If ĥi does not partially encode the gold data, we
update the weight vector. This is repeated for a given
number of epochs2. Algorithm 1 gives a more for-
mal description.

4 Latent Structures

In the previous section we developed a machine
learning framework for coreference resolution. It is
flexible with respect to
• the latent structure h ∈Hx for an input x,
• the substructures of h ∈Hx,
• the constrained space of latent structures con-

sistent with a gold solution const(Hx,z), and
• the cost function c and its factorization.

In this paper, we focus on giving a unified represen-
tation and in-depth analysis of prevalent coreference
models from the literature. Future work should in-
vestigate devising and analyzing novel representa-
tions for coreference resolution in the framework.

We express three main coreference models in
our framework, the mention pair model (Soon et
al., 2001), the mention ranking model (Denis and
Baldridge, 2008; Chang et al., 2012) and antecedent
trees (Yu and Joachims, 2009; Fernandes et al.,
2014; Björkelund and Kuhn, 2014). We character-
ize each approach by the latent structure it operates
on during learning and inference (we assume that
all approaches we consider share the same features).
Furthermore, we also discuss the factorization into
substructures and typical cost functions used in the
literature.

4.1 Mention Pair Model

We first consider the mention pair model. In its orig-
inal formulation, it extracts mention pairs from the

2We also shuffle the data before each epoch and use averag-
ing (Collins, 2002).

408


data and labels these as positive or negative. During
testing, all pairs are extracted and some clustering
algorithm such as closest-first or best-first is applied
to the list of pairs. During training, some heuristic is
applied to help balancing positive and negative ex-
amples. The most popular heuristic is to take the
closest antecedent of an anaphor as a positive exam-
ple, and all pairs in between as negative examples.

Latent Structure. In our framework, we can rep-
resent the mention pair model as a labeled graph.
In particular, let the set of edges be all backward-
pointing edges, i.e. A = {(mj,mi) |j > i}. In the
testing phase, we operate on the whole set A. Dur-
ing training, we consider only a subset of edges, as
defined by the heuristic used by the approach.

The labeling function maps a pair of mentions to
a positive (“+”) or a negative label (“−”) via

LA(mj,mi) =

{
+ mj,mi are coreferent,
− otherwise.

One such graph is depicted in Figure 1 (Section 3).
A clustering algorithm (like closest-first or best-

first) is then employed to infer the coreference infor-
mation from this latent structure.

Substructures. In the mention pair model, the
parts of the substructures are the individual edges:
each pair of mentions is considered as an instance
from which the model learns and which the model
predicts individually.

Cost Function. As discussed above, mention
pair approaches employ heuristics to resample the
training data. This is a common method to in-
troduce cost-sensitivity into classification (Elkan,
2001; Geibel and Wysotzk, 2003). Hence, mention
pair approaches do not use cost functions in addition
to the resampling.

4.2 Mention Ranking Model

The mention ranking model captures competition
between antecedents: for each anaphor, the highest-
scoring antecedent is selected. For training, this ap-
proach needs gold antecedents to compare to. There
are two main approaches to determine these: first,
they are heuristically extracted similarly to the men-
tion pair model (Denis and Baldridge, 2008; Rah-
man and Ng, 2011). Second, latent antecedents are
employed (Chang et al., 2012): in such models, the

highest-scoring preceding coreferent mention of an
anaphor under the current model is selected as the
gold antecedent.

m0

m1

m2

m3

m4

m5

Figure 2: Latent structure underlying the mention ranking
and the antecedent tree approach. The black nodes and
arcs represent one substructure for the mention ranking
approach.

Latent Structure. The mention ranking ap-
proach can be represented as an unlabeled graph.
In particular, we allow any graph with edges A ⊆
{(mj,mi) | j > i} such that for all j there is exactly
one i with (mj,mi) ∈ A (each anaphor has exactly
one antecedent). Figure 2 shows an example graph.

We can represent heuristics for creating train-
ing data by constraining the latent structures con-
sistent with the gold information Hx,z. Again,
the most popular heuristic is to consider the clos-
est antecedent of a mention as the gold an-
tecedent during training (Denis and Baldridge,
2008). This corresponds to constraining Hx,z such
that const(Hx,z) = {h} with h = (V,A,LA) and
(mj,mi) ∈ A if and only if mi is the closest an-
tecedent of mj. When learning from latent an-
tecedents, the unconstrained space Hx,z is consid-
ered.

To infer coreference information from this la-
tent structure, we take the transitive closure over all
anaphor-antecedent decisions encoded in the graph.

Substructures. The distinctive feature of
the mention ranking approach is that it consid-
ers each anaphor in isolation, but all candidate
antecedents at once. We therefore define sub-
structures as follows. The jth substructure is the
graph hj with nodes Vj = {m0, . . . ,mj} and Aj =
{(mj,mi) | there is i with j > i s.t. (mj,mi) ∈ A}.
Aj contains the antecedent decision for mj. One
such substructure encoding the antecedent decision
for m3 is colored black in Figure 2.

409


Cost Function. Cost functions for the mention
ranking model can reward the resolution of spe-
cific classes. The most sophisticated cost func-
tion was proposed by Durrett and Klein (2013),
who distinguish between three errors: finding an
antecedent for a non-anaphoric mention, misclassi-
fying an anaphoric mention as non-anaphoric, and
finding a wrong antecedent for an anaphoric men-
tion. We will use a variant of this cost function in
our experiments (described in Section 5.3).

4.3 Antecedent Trees
Finally, we consider antecedent trees. This structure
encodes all antecedent decisions for all anaphors. In
our framework they can be understood as an exten-
sion of the mention ranking approach to the docu-
ment level. So far, research did not investigate con-
straints on the space of latent structures consistent
with the gold annotation.

Latent Structure. Antecedent trees are based on
the same structure as the mention ranking approach.

Substructures. In the antecedent tree approach,
the latent structure does not factor in parts: the
whole graph encoding all antecedent information for
all mentions is treated as an instance.

Cost Function. The cost function from the men-
tion ranking model naturally extends to the tree case
by summing over all decisions. Furthermore, in
principle we can take the structure into account.
However, we are not aware of any approaches which
go beyond (variations of) Hamming loss (Hamming,
1950).

5 Experiments

We now evaluate model variants based on different
latent structures on a large benchmark corpus. The
aim of this section is to compare popular approaches
to coreference only in terms of the structure they op-
erate on, fixing preprocessing and feature set. In
Section 6 we complement this comparison with a
qualitative analysis of the influence of the structures
on the output.

5.1 Data and Evaluation Metrics
The aim of our evaluation is to assess the effec-
tiveness and competitiveness of the models imple-
mented in our framework in a realistic coreference
setting, i.e. without using gold information such as

gold mentions. As all models we consider share the
same preprocessing and features, this allows for a
fair comparison of the individual structures.

We train, evaluate and analyze the models on the
English data of the CoNLL-2012 shared task on
multilingual coreference resolution (Pradhan et al.,
2012). The shared task organizers provide the train-
ing/development/ test split. We use the 2802 training
documents for training the models, and evaluate and
analyze the models on the development set contain-
ing 343 documents. The 349 test set documents are
only used for final evaluation.

We work in a setting that corresponds to the
shared task’s closed track (Pradhan et al., 2012).
That is, we make use of the automatically created
annotation layers (parse trees, NE information, ...)
shipped with the data. As additional resources we
use only WordNet 3.0 (Fellbaum, 1998) and the
number/gender data of Bergsma and Lin (2006).

For evaluation we follow the practice of the
CoNLL-2012 shared task and employ the reference
implementation of the CoNLL scorer (Pradhan et al.,
2014) which computes the popular evaluation met-
rics MUC (Vilain et al., 1995), B3 (Bagga and Bald-
win, 1998), CEAFe (Luo, 2005) and their average.
The average is the metric for ranking the systems in
the CoNLL shared tasks on coreference resolution
(Pradhan et al., 2011; Pradhan et al., 2012).

5.2 Features
We employ a rich set of features frequently used in
the literature (Ng and Cardie, 2002; Bengtson and
Roth, 2008; Björkelund and Kuhn, 2014). The set
consists of the following features:
• the mention type (name, def. noun, indef. noun,

citation form of pronoun, demonstrative) of
anaphor, antecedent and both,
• gender, number, semantic class, named en-

tity class, grammatical function and length in
words of anaphor, antecedent and both,
• semantic head, first/last/preceding/next token

of anaphor, antecedent and both,
• distance between anaphor and antecedent in

sentences,
• modifier agreement,
• whether anaphor and antecedent embed each

other,
• whether there is a string match, head match or

410


an alias relation,
• whether anaphor and antecedent have the same

speaker.
If the antecedent in the pair under consideration is
m0, i.e. the dummy mention, we do not extract any
feature (Chang et al., 2012).

State-of-the-art models greatly benefit from fea-
ture conjunctions. Approaches for building such
conjunctions include greedy extension (Björkelund
and Kuhn, 2014), entropy-guided induction (Fernan-
des et al., 2014) and linguistically motivated heuris-
tics (Durrett and Klein, 2013). We follow Durrett
and Klein (2013) and conjoin every feature with
each mention type feature.

5.3 Model Variants
We now consider several instantiations of the ap-
proaches discussed in the previous section in order
of increasing complexity. These instantiations cor-
respond to specific coreference models proposed in
the literature. With the framework described in this
paper, we are able to give a unified account of repre-
senting and learning these models. We always train
on automatically predicted mentions.

We start with the mention pair model. To create
training graphs, we employ a slight modification of
the closest pair heuristic (Soon et al., 2001), which
worked best in preliminary experiments. For each
mention mj which is in some coreference chain and
has an antecedent mi, we add an edge to mi with
label “+”. For all k with i < k < j, we add an
edge from mj to mk with label “−”. If mj does not
have an antecedent, we add edges from mj to mk
with label “−” for all 0 < k < j. Compared to
the heuristic of Soon et al. (2001), who only learn
from anaphoric mentions, this improves precision.
During testing, if for a mention mj no pair (mj,mi)
is deemed as coreferent, we consider the mention as
not anaphoric. Otherwise, we employ best-first clus-
tering and take the mention in the highest scoring
pair as the antecedent of mj (Ng and Cardie, 2002).

The mention ranking model tries to improve the
mention pair model by capturing the competition
between antecedents. We consider two variants
of the mention ranking model, where each em-
ploys dummy mentions for anaphoricity determina-
tion. The first variant Closest (Denis and Baldridge,
2008) constrains the latent structures consistent with

the gold annotation: for each mention, the closest
antecedent is chosen as the gold antecedent. If the
mention does not have any antecedent, we take the
dummy mention m0 as the antecedent. The sec-
ond variant Latent (Chang et al., 2012) aims to learn
from more meaningful antecedents by dropping the
constraints, and therefore selecting the best-scoring
antecedent (which may also be m0) under the cur-
rent model during training.

We view the antecedent tree model (Fernandes et
al., 2014) as a natural extension of the mention rank-
ing model. Instead of predicting an antecedent for
each mention, we predict an entire tree of anaphor-
antecedent pairs. This should yield more consistent
entities. As in previous work we only consider the
latent variant.

For the mention ranking model and for antecedent
trees we use a cost function similar to previous work
(Durrett and Klein, 2013; Fernandes et al., 2014).
For a pair of mentions (mj,mi), we consider

cpair(mj,mi) =





λ i > 0 and
mj,mi are not coreferent,

2λ i = 0 and mj is anaphoric,
0 otherwise,

where λ > 0 will be tuned on development data.
Let ĥi = (Vi,Ai,LAi). cpair is extended to a cost

function for the whole latent structure ĥi by

c(x,ĥi, ĥopt,i,z) =
∑

(mj,mk)∈Ai
cpair(mj,mk).

The use of such a cost function is necessary to
learn reasonable weights, since most automatically
extracted mentions in the data are not anaphoric.

5.4 Experimental Setup
We evaluate the models on the development and the
test sets. When evaluating on the test set, we train on
the concatenation of the training and development
set. After preliminary experiments with the ranking
model with closest antecedents on the development
set, we set the number of perceptron epochs to 5 and
set λ = 100 in the cost function.

We assess statistical significance of the difference
in F1 score for two approaches via an approximate
randomization test (Noreen, 1989). We say an im-
provement is statistically significant if p < 0.05.

411


MUC B3 CEAFe

Model R P F1 R P F1 R P F1 Average F1

CoNLL-2012 English development data

Fernandes et al. (2014) 64.88 74.74 69.46 51.85 65.35 57.83 51.50 57.72 54.43 60.57
Björkelund and Kuhn (2014) 68.58 73.04 70.74 57.97 62.28 60.03 54.57 59.23 56.80 62.52

Mention Pair 66.68 71.71 69.10 53.57 62.44 57.67 52.56 53.87 53.21 59.99
Ranking: Closest 67.85 76.66 71.99∗ 55.33 65.45 59.97∗ 53.16 61.28 56.93∗ 62.96
Ranking: Latent 68.02 76.73 72.11�× 55.61 66.91 60.74†� 54.48 61.36 57.72†�× 63.52
Antecedent Trees 65.91 77.92 71.41 52.72 67.98 59.39 52.13 60.82 56.14 62.31

CoNLL-2012 English test data

Fernandes et al. (2014) 65.83 75.91 70.51 51.55 65.19 57.58 50.82 57.28 53.86 60.65
Björkelund and Kuhn (2014) 67.46 74.30 70.72 54.96 62.71 58.58 52.27 59.40 55.61 61.63

Mention Pair 67.16 71.48 69.25 51.97 60.55 55.93 51.02 51.89 51.45 58.88
Ranking: Closest 67.96 76.61 72.03∗ 54.07 64.98 59.03∗ 51.45 59.02 54.97∗ 62.01
Ranking: Latent 68.13 76.72 72.17� 54.22 66.12 59.58†� 52.33 59.47 55.67†� 62.47
Antecedent Trees 65.79 78.04 71.39 50.92 67.76 58.15 50.55 58.34 54.17 61.24

Table 1: Results of different systems and model variants on CoNLL-2012 English development and test data. Models
below the dashed lines are implemented in our framework. The best F1 score results for each dataset and metric are
boldfaced. ∗ indicates significant improvements in F1 score of Ranking: Closest compared to Mention Pair; † indicates
significant improvements of Ranking: Latent compared to Ranking: Closest; � indicates significant improvements of
Ranking: Latent compared to Antecedent Trees; × indicates significant improvements of Ranking: Latent compared
to Björkelund and Kuhn (2014). We do not perform significance tests on differences in average F1 since this measure
constitutes an average over other F1 scores.

5.5 Results

Table 1 shows the result of all model configurations
discussed in the previous section on CoNLL’12 En-
glish development and test data. In order to put
the numbers into context, we also report the re-
sults of Björkelund and Kuhn (2014), who present
a system that implements an antecedent tree model
with non-local features. Their system is the highest-
performing system on the CoNLL data which op-
erates in a closed track setting. We also compare
with Fernandes et al. (2014), the winning system of
the CoNLL-2012 shared task (Pradhan et al., 2012)3.
Both systems were trained on training data for eval-
uating on the development set, and on the concatena-

3We do not compare with the system of Durrett and Klein
(2014) since it uses Wikipedia as an additional resource, and
therefore does not work under the closed track setting. Its per-
formance is 61.71 average F1 (71.24 MUC F1, 58.71 B3 F1 and
55.18 CEAFe F1) on CoNLL-2012 English test data.

tion of training and development data for evaluating
on the test set.

Despite its simplicity, the mention pair model
yields reasonable performance. The gap to
Björkelund and Kuhn (2014) is roughly 2.8 points
in average F1 score on test data.

Compared to the mention pair model, the variants
of the mention ranking model improve the results
for all metrics, largely due to increased precision.
Switching from regarding the closest antecedent as
the gold antecedent to latent antecedents yields an
improvement of roughly 0.5 points in average F1.
All improvements of the mention ranking model
with closest antecedents compared to the mention
pair model are statistically significant. Furthermore,
with the exception of the differences in MUC F1, all
improvements are significant when switching from
closest antecedents to latent antecedents.

The mention ranking model with latent an-

412


Recall Precision

Model Errors Max % of Max Errors Max % of Max

Mention Pair 4867

14609

33% 4187 13585 31%
Ranking: Closest 4695 32% 3336 12932 26%
Ranking: Latent 4671 32% 3357 12951 26%
Antecedent Trees 4979 34% 3042 12358 25%

Table 2: Overview of recall and precision errors.

tecedents outperforms the state-of-the-art system
by Björkelund and Kuhn (2014) by more than 0.8
points average F1. These results show the com-
petitiveness of a simple mention ranking architec-
ture. Regarding the individual F1 scores compared
to Björkelund and Kuhn (2014), the improvements
in the MUC and CEAFe metrics on development
data are statistically significant. The improvements
on test data are not statistically significant.

Using antecedent trees yields higher precision
than using the mention ranking model. However,
recall is much lower. The performance is similar to
the antecedent tree models of Fernandes et al. (2014)
and Björkelund and Kuhn (2014).

6 Analysis

The numbers discussed in the previous section do
not give insights into where the models make differ-
ent decisions. Are there specific linguistic classes
of mention pairs where one model is superior to the
other? How do the outputs differ? How can these
differences be explained by different structures em-
ployed by the models?

In order to answer these questions, we need to per-
form a qualitative analysis of the differences in sys-
tem output for the approaches. To do so, we employ
the error analysis method presented in Martschat and
Strube (2014). In this method, recall errors are ex-
tracted via comparing spanning trees of reference
entities with system output. Edges in the spanning
tree missing from the output are extracted as errors.
For extracting precision errors, the roles of reference
and system entities are switched. To define the span-
ning trees, we follow Martschat and Strube (2014)
and use a notion based on Ariel’s accessibility the-
ory (Ariel, 1990) for reference entities, while we
take system antecedent decisions for system entities.

6.1 Overview

We extracted all errors of the model variants de-
scribed in the previous section on CoNLL-2012 En-
glish development data.

Table 2 gives an overview of all recall and preci-
sion errors. For each model variant the table shows
the number of recall and precision errors, and the
maximum number of errors4. The numbers con-
firm the findings obtained from Table 1: the ranking
models beat the mention pair model largely due to
fewer precision errors.

The antecedent tree model outputs more precise
entities by establishing fewer coreference links: it
makes fewer decisions and fewer precision errors
than the other configurations, but at the expense of
an increased number of recall errors.

The more sophisticated models make consistently
fewer linking decisions than the mention pair model.
We therefore hypothesize that the improvements in
the numbers mainly stem from improved anaphoric-
ity determination. The mention pair model handles
anaphoricity determination implicitly: if for a men-
tion mj no pair (mj,mi) is deemed as coreferent,
the model does not select an antecedent for mj5.
Since the mention ranking model allows to include
the search for the best antecedent during prediction,
we can explicitly model the anaphoricity decision,
via including the dummy mention during search.

We now examine the errors in more detail to in-
vestigate this hypothesis. To do so, we will investi-

4For recall, the maximum number of errors is the number of
errors made by a system that assigns each mention to its own
entity. For precision, the maximum number of errors is the total
number of anaphor-antecedent decisions made by the model.

5Initial experiments which included the dummy mention
during learning for the mention pair model yielded worse re-
sults. This is arguably due to the large number of non-anaphoric
mentions, which causes highly imbalanced training data.

413


Name/noun Anaphor pronoun

Model Both name Mixed Both noun I/you/we he/she it/they Remaining

Upper bound 3579 948 2063 2967 1990 2471 591

Mention Pair 815 657 1074 394 373 1005 549
Ranking: Closest 879 637 1221 348 247 806 557
Ranking: Latent 857 647 1158 370 251 822 566
Antecedent Trees 911 686 1258 441 247 863 572

Table 3: Recall errors of model variants on CoNLL-2012 English development data.

Name/noun Anaphor pronoun

Model Both name Mixed Both noun I/you/we he/she it/they Remaining
err. corr. err. corr. err. corr. err. corr. err. corr. err. corr. err. corr.

Mention Pair 885 2673 83 79 1055 1098 836 2479 289 1546 864 1408 175 115
Ranking: Closest 587 2620 93 96 494 960 873 2521 324 1692 844 1510 121 97
Ranking: Latent 640 2664 92 102 567 1038 862 2461 318 1692 835 1594 42 43
Antecedent Trees 595 2628 57 82 442 924 836 2398 318 1691 757 1557 37 36

Table 4: Precision errors (err.) and correct links (corr.) of model variants on CoNLL-2012 English development data.

gate error classes, and compare the models in terms
of how they handle these error classes. This is a
practice common in the analysis of coreference reso-
lution approaches (Stoyanov et al., 2009; Martschat
and Strube, 2014). We distinguish between errors
where both mentions are a proper name or a com-
mon noun, errors where the anaphor is a pronoun
and the remaining errors.

Tables 3 and 4 summarize recall and precision er-
rors for subcategories of these classes6. We now
compare individual models.

6.2 Mention Ranking vs. Mention Pair

For pairs of proper names and pairs of common
nouns, employing the ranking model instead of the
mention pair model leads to a large decrease in pre-
cision errors, but an increase in recall errors. For
pronouns and mixed pairs, we can observe decreases
in recall errors and slight increases in precision er-
rors, except for it/they, where both recall precision
errors decrease.

We can attribute the largest differences to deter-
mining anaphoricity: in 82% of all precision errors

6For the pronoun subcategories, we map each pronoun to its
canonical form. For example, we map him to he.

between two proper names made by the mention pair
model, but not by the ranking model, the mention
appearing later in the text is non-anaphoric. The
ranking model correctly determines this. Similar
numbers hold for common noun pairs.

While most nouns and names are not anaphoric,
most pronouns are. Hence, determining anaphoric-
ity is less of an issue here. From the resolved it/they
recall errors of the ranking model compared to the
mention pair model, we can attribute 41% to bet-
ter antecedent selection: the mention pair model de-
cided on a wrong antecedent. The ranking model,
however, was able to leverage the competition be-
tween the antecedents to decide on a correct an-
tecedent. The remaining 59% stem from selecting a
correct antecedent for pronouns that were classified
as non-anaphoric by the mention pair model. We
observe similar trends for the other pronoun classes.

Overall, the majority of error reduction can be
attributed to improved determination of anaphoric-
ity, which can be modeled structurally in the men-
tion ranking model (we do not use any features
when a dummy mention is involved, therefore non-
anaphoricity decisions always get the score 0).
However, for pronoun resolution, where there are

414


many competing compatible antecedents for a men-
tion, the model is able to learn better weights by
leveraging the competition. These findings suggest
that extending the mention pair model to explicitly
determine anaphoricity should improve results espe-
cially for non-pronominal coreference.

6.3 Latent Antecedent vs. Closest Antecedent

Using latent instead of closest antecedents leads to
fewer recall errors and more precision errors for
non-pronominal coreference. Pronoun resolution re-
call errors slightly increase, while precision errors
slightly decrease.

While these changes are minor, there is a large
reduction in the remaining precision errors. Most
of these correspond to predictions which are consid-
ered very difficult, such as links between a proper
name anaphor and a pronoun antecedent (Bengtson
and Roth, 2008). Via latent antecedents, the model
can avoid learning from the most unreliable pairs.

6.4 Antecedent Trees vs. Ranking

Compared to the ranking model with latent an-
tecedents, the antecedent tree model commits con-
sistently more recall errors and fewer precision er-
rors. This is partly due to the fact that the antecedent
tree model also predicts fewer links between men-
tions than the other models. The only exception is
he/she, where there is not much of a difference.

The only difference between the ranking model
with latent antecedents and the antecedent tree
model is that weights are updated document-wise for
antecedent trees, while they are updated per anaphor
for the ranking model. This leads to more precise
predictions, at the expense of recall.

6.5 Summary

Our analysis shows that the mention ranking model
mostly improves precision over the mention pair
model. For non-pronominal coreference, the im-
provements can be mainly attributed to improved
anaphoricity determination. For pronoun resolution,
both anaphoricity determination and capturing an-
tecedent competition lead to improved results. Em-
ploying latent antecedents during training mainly
helps in resolving very difficult cases. Due to the
update strategy, employing antecedent trees leads to

a more precision-oriented approach, which signifi-
cantly improves precision at the expense of recall.

7 Beyond Pairwise Predictions

In this paper we concentrated on representing and
analyzing the most prevalent approaches to coref-
erence resolution, which are based on predicting
whether pairs of mentions are coreferent. Hence, we
choose graphs as latent structures and let the feature
functions factor over edges in the graph, which cor-
respond to pairs of mentions.

However, entity-based approaches (Rahman and
Ng, 2011; Stoyanov and Eisner, 2012; Lee et al.,
2013, inter alia) obtain coreference chains by pre-
dicting whether sets of mentions are coreferent, go-
ing beyond pairwise predictions. While a detailed
discussion of such approaches is beyond the scope
of this paper, we now briefly describe how we can
generalize the proposed framework to accommodate
for such approaches.

When viewing coreference resolution as predic-
tion of latent structures, entity-based models op-
erate on structures that relate sets of mentions to
each other. This can be expressed by hypergraphs,
which are graphs where edges can link more than
two nodes. Hypergraphs have already been used to
model coreference resolution (Cai and Strube, 2010;
Sapena, 2012).

To model entity-based approaches, we extend the
valid latent structures to labeled directed hyper-
graphs. These are tuples h = (V,A,LA), where
• the set of nodes are the mentions, V = M0x,
• the set of edges A ⊆ 2V × 2V consists of di-

rected hyperedges linking two sets of mentions,
• LA : A → L assigns a label ` ∈ L to each

edge. L is a finite set of labels.
For example, the entity-mention model (Yang et

al., 2008) predicts coreference in a left-to-right fash-
ion. For each anaphor mj, it considers the set

Ej ⊆ 2{m0,...,mj−1}

of preceding partial entities that have been estab-
lished so far (such as e = {m1,m3,m6}). In
terms of our framework, substructures for this ap-
proach are hypergraphs with hyperedges ({mj} ,e)
for e ∈ Ej, encoding the decision to which partial
entity mj refers.

415


The definitions of features and the decoding prob-
lem carry over from the graph-based framework (we
drop the edge factorization assumption for features).
Learning requires adaptations to cope with the de-
pendency between coreference decisions. For exam-
ple, for the entity-mention model, establishing that
an anaphor mj refers to a partial entity e influences
the search space for decisions for anaphors mk with
k > j. We leave a more detailed discussion to future
work.

8 Related Work

The main contributions of this paper are a frame-
work for representing coreference resolution ap-
proaches and a systematic comparison of main
coreference approaches in this framework.

Our representation framework generalizes ap-
proaches to coreference resolution which employed
specific latent structures for representation, such
as latent antecedents (Chang et al., 2012) and an-
tecedent trees (Fernandes et al., 2014). We give a
unified representation of such approaches and show
that seemingly disparate approaches such as the
mention pair model also fit in a framework based
on latent structures.

Only few studies systematically compare ap-
proaches to coreference resolution. Most previous
work highlights the improved expressive power of
the presented model by a comparison to a men-
tion pair baseline (Culotta et al., 2007; Denis and
Baldridge, 2008; Cai and Strube, 2010).

Rahman and Ng (2011) consider a series of mod-
els with increasing expressiveness, ranging from a
mention pair to a cluster-ranking model. However,
they do not develop a unified framework for compar-
ing approaches, and their analysis is not qualitative.
Fernandes et al. (2014) compare variations of an-
tecedent tree models, including different loss func-
tions and a version with a fixed structure. They only
consider antecedent trees and also do not provide a
qualitative analysis. Kummerfeld and Klein (2013)
and Martschat and Strube (2014) present a large-
scale qualitative comparison of coreference systems,
but they do not investigate the influence of the latent
structures the systems operate on. Furthermore, the
systems in their studies differ in terms of mention
extraction and feature sets.

9 Conclusions

We observed that many approaches to coreference
resolution can be uniformly represented by the latent
structure they operate on. We devised a framework
that accounts for such structures, and showed how
we can express the mention pair model, the mention
ranking model and antecedent trees in this frame-
work.

An evaluation of the models on CoNLL-2012 data
showed that all models yield competitive results.
While antecedent trees give results with the high-
est precision, a mention ranking model with latent
antecedent performs best, obtaining state-of-the-art
results on CoNLL-2012 data.

An analysis based on the method of Martschat and
Strube (2014) highlights the strengths of the mention
ranking model compared to the mention pair model:
it is able to structurally model anaphoricity deter-
mination and antecedent competition, which leads
to improvements in precision for non-pronominal
coreference resolution, and in recall for pronoun res-
olution. The effect of latent antecedents is negligible
and has a large effect only on very difficult cases of
coreference.

The flexibility of the framework, toolkit and
analysis methods presented in this paper helps re-
searchers to devise, analyze and compare represen-
tations for coreference resolution.

Acknowledgments

This work has been funded by the Klaus Tschira
Foundation, Heidelberg, Germany. The first au-
thor has been supported by a HITS PhD scholar-
ship. We thank the anonymous reviewers and our
colleagues Benjamin Heinzerling, Yufang Hou and
Nafise Moosavi for feedback on earlier drafts of
this paper. Furthermore, we are grateful to Anders
Björkelund for helpful comments on cost functions.

References
Mira Ariel. 1990. Accessing Noun Phrase Antecedents.

Routledge, London, U.K.; New York, N.Y.
Amit Bagga and Breck Baldwin. 1998. Algorithms for

scoring coreference chains. In Proceedings of the 1st
International Conference on Language Resources and
Evaluation, Granada, Spain, 28–30 May 1998, pages
563–566.

416


Eric Bengtson and Dan Roth. 2008. Understanding the
value of features for coreference resolution. In Pro-
ceedings of the 2008 Conference on Empirical Meth-
ods in Natural Language Processing, Waikiki, Hon-
olulu, Hawaii, 25–27 October 2008, pages 294–303.

Shane Bergsma and Dekang Lin. 2006. Bootstrapping
path-based pronoun resolution. In Proceedings of the
21st International Conference on Computational Lin-
guistics and 44th Annual Meeting of the Association
for Computational Linguistics, Sydney, Australia, 17–
21 July 2006, pages 33–40.

Anders Björkelund and Jonas Kuhn. 2014. Learning
structured perceptrons for coreference resolution with
latent antecedents and non-local features. In Proceed-
ings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
Baltimore, Md., 22–27 June 2014, pages 47–57.

Jie Cai and Michael Strube. 2010. End-to-end coref-
erence resolution via hypergraph partitioning. In
Proceedings of the 23rd International Conference on
Computational Linguistics, Beijing, China, 23–27 Au-
gust 2010, pages 143–151.

Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya,
Mark Sammons, and Dan Roth. 2012. Illinois-Coref:
The UI system in the CoNLL-2012 shared task. In
Proceedings of the Shared Task of the 16th Confer-
ence on Computational Natural Language Learning,
Jeju Island, Korea, 12–14 July 2012, pages 113–117.

Michael Collins. 2002. Discriminative training meth-
ods for Hidden Markov Models: Theory and experi-
ments with perceptron algorithms. In Proceedings of
the 2002 Conference on Empirical Methods in Natural
Language Processing, Philadelphia, Penn., 6–7 July
2002, pages 1–8.

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-
Shwartz, and Yoram Singer. 2006. Online passive-
aggressive algorithms. Journal of Machine Learning
Research, 7:551–585.

Aron Culotta, Michael Wick, and Andrew McCallum.
2007. First-order probabilistic models for coreference
resolution. In Proceedings of Human Language Tech-
nologies 2007: The Conference of the North American
Chapter of the Association for Computational Linguis-
tics, Rochester, N.Y., 22–27 April 2007, pages 81–88.

Pascal Denis and Jason Baldridge. 2008. Specialized
models and ranking for coreference resolution. In Pro-
ceedings of the 2008 Conference on Empirical Meth-
ods in Natural Language Processing, Waikiki, Hon-
olulu, Hawaii, 25–27 October 2008, pages 660–669.

Greg Durrett and Dan Klein. 2013. Easy victories and
uphill battles in coreference resolution. In Proceed-
ings of the 2013 Conference on Empirical Methods in
Natural Language Processing, Seattle, Wash., 18–21
October 2013, pages 1971–1982.

Greg Durrett and Dan Klein. 2014. A joint model for en-
tity analysis: Coreference, typing, and linking. Trans-
actions of the Association of Computational Linguis-
tics, 2:477–490.

Charles Elkan. 2001. The foundations of cost-sensitive
learning. In Proceedings of the 17th International
Joint Conference on Artificial Intelligence, Seattle,
Wash., 4–10 August, 2001, pages 973–978.

Christiane Fellbaum, editor. 1998. WordNet: An Elec-
tronic Lexical Database. MIT Press, Cambridge,
Mass.

Eraldo Fernandes, Cı́cero dos Santos, and Ruy Milidiú.
2014. Latent trees for coreference resolution. Compu-
tational Linguistics, 40(4):801–835.

Jenny Rose Finkel and Christopher Manning. 2008. En-
forcing transitivity in coreference resolution. In Com-
panion Volume to the Proceedings of the 46th Annual
Meeting of the Association for Computational Linguis-
tics, Columbus, Ohio, 15–20 June 2008, pages 45–48.

Peter Geibel and Fritz Wysotzk. 2003. Perceptron based
learning with example dependent and noisy costs. In
Proceedings of the 20th International Conference on
Machine Learning, Washington, D.C., 21–24 August
2003, pages 218–225.

Richard W. Hamming. 1950. Error detecting and er-
ror correcting codes. Bell System Technical Journal,
26(2):147–160.

Jerry R. Hobbs. 1976. Pronoun resolution. Technical
Report 76-1, Dept. of Computer Science, City College,
City University of New York.

Dan Klein and Christopher D. Manning. 2001. Parsing
and hypergraphs. In Proceedings of the Seventh In-
ternational Workshop on Parsing Technologies (IWPT-
2001), 17-19 October 2001, Beijing, China, pages
123–134.

Manfred Klenner. 2007. Enforcing consistency on coref-
erence sets. In Proceedings of the International Con-
ference on Recent Advances in Natural Language Pro-
cessing, Borovets, Bulgaria, 27–29 September 2007,
pages 323–328.

Jonathan K. Kummerfeld and Dan Klein. 2013. Error-
driven analysis of challenges in coreference resolution.
In Proceedings of the 2013 Conference on Empiri-
cal Methods in Natural Language Processing, Seattle,
Wash., 18–21 October 2013, pages 265–277.

Shalom Lappin and Herbert J. Leass. 1994. An algo-
rithm for pronominal anaphora resolution. Computa-
tional Linguistics, 20(4):535–561.

Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael
Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013.
Deterministic coreference resolution based on entity-
centric, precision-ranked rules. Computational Lin-
guistics, 39(4):885–916.

417


Adam Lopez. 2009. Translation as weighted deduction.
In Proceedings of the 12th Conference of the European
Chapter of the Association for Computational Linguis-
tics, Athens, Greece, 30 March – 3 April 2009, pages
532–540.

Xiaoqiang Luo. 2005. On coreference resolution per-
formance metrics. In Proceedings of the Human Lan-
guage Technology Conference and the 2005 Confer-
ence on Empirical Methods in Natural Language Pro-
cessing, Vancouver, B.C., Canada, 6–8 October 2005,
pages 25–32.

Sebastian Martschat and Michael Strube. 2014. Recall
error analysis for coreference resolution. In Proceed-
ings of the 2014 Conference on Empirical Methods
in Natural Language Processing, Doha, Qatar, 25–29
October 2014, pages 2070–2081.

Vincent Ng and Claire Cardie. 2002. Improving machine
learning approaches to coreference resolution. In Pro-
ceedings of the 40th Annual Meeting of the Association
for Computational Linguistics, Philadelphia, Penn., 7–
12 July 2002, pages 104–111.

Eric W. Noreen. 1989. Computer-Intensive Methods
for Testing Hypotheses. An Introduction. Wiley, New
York.

Sameer Pradhan, Lance Ramshaw, Mitchell Marcus,
Martha Palmer, Ralph Weischedel, and Nianwen Xue.
2011. CoNLL-2011 Shared Task: Modeling unre-
stricted coreference in OntoNotes. In Proceedings of
the Shared Task of the 15th Conference on Compu-
tational Natural Language Learning, Portland, Oreg.,
23–24 June 2011, pages 1–27.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,
Olga Uryupina, and Yuchen Zhang. 2012. CoNLL-
2012 Shared Task: Modeling multilingual unrestricted
coreference in OntoNotes. In Proceedings of the
Shared Task of the 16th Conference on Computational
Natural Language Learning, Jeju Island, Korea, 12–14
July 2012, pages 1–40.

Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Ed-
uard Hovy, Vincent Ng, and Michael Strube. 2014.
Scoring coreference partitions of predicted mentions:
A reference implementation. In Proceedings of the
52nd Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers), Balti-
more, Md., 22–27 June 2014, pages 30–35.

Altaf Rahman and Vincent Ng. 2011. Narrowing the
modeling gap: A cluster-ranking approach to corefer-
ence resolution. Journal of Artificial Intelligence Re-
search, 40:469–521.

Emili Sapena. 2012. A constraint-based hyper-
graph partitioning approach to coreference resolution.
Ph.D. thesis, Departament de Llenguatges i Sistemes
Informàtics, Universitat Politècnica de Catalunya,
Barcelona, Spain.

Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong
Lim. 2001. A machine learning approach to corefer-
ence resolution of noun phrases. Computational Lin-
guistics, 27(4):521–544.

Veselin Stoyanov and Jason Eisner. 2012. Easy-first
coreference resolution. In Proceedings of the 24th In-
ternational Conference on Computational Linguistics,
Mumbai, India, 8–15 December 2012, pages 2519–
2534.

Veselin Stoyanov, Nathan Gilbert, Claire Cardie, and
Ellen Riloff. 2009. Conundrums in noun phrase coref-
erence resolution: Making sense of the state-of-the-
art. In Proceedings of the Joint Conference of the 47th
Annual Meeting of the Association for Computational
Linguistics and the 4th International Joint Conference
on Natural Language Processing, Singapore, 2–7 Au-
gust 2009, pages 656–664.

Xu Sun, Takuya Matsuzaki, Daisuke Okanohara, and
Jun’ichi Tsujii. 2009. Latent variable perceptron al-
gorithm for structured classification. In Proceedings
of the 21th International Joint Conference on Artificial
Intelligence, Pasadena, Cal., 14–17 July 2009, pages
1236–1242.

Marc Vilain, John Burger, John Aberdeen, Dennis Con-
nolly, and Lynette Hirschman. 1995. A model-
theoretic coreference scoring scheme. In Proceedings
of the 6th Message Understanding Conference (MUC-
6), pages 45–52, San Mateo, Cal. Morgan Kaufmann.

Xiaofeng Yang, Jian Su, Jun Lang, Chew Lim Tan, Ting
Liu, and Sheng Li. 2008. An entity-mention model
for coreference resolution with Inductive Logic Pro-
gramming. In Proceedings of the 46th Annual Meeting
of the Association for Computational Linguistics: Hu-
man Language Technologies, Columbus, Ohio, 15–20
June 2008, pages 843–851.

Chun-Nam John Yu and Thorsten Joachims. 2009.
Learning structural SVMs with latent variables. In
Proceedings of the 26th International Conference on
Machine Learning, Montréal, Québec, Canada, 14–18
June 2009, pages 1169–1176.

418