Transactions of the Association for Computational Linguistics, 1 (2013) 231–242. Action Editor: Noah Smith.
Submitted 11/2012; Revised 2/2013; Published 5/2013. c©2013 Association for Computational Linguistics.

Modeling Semantic Relations Expressed by Prepositions

Vivek Srikumar and Dan Roth
University of Illinois, Urbana-Champaign

Urbana, IL. 61801.
{vsrikum2, danr}@illinois.edu

Abstract
This paper introduces the problem of predict-
ing semantic relations expressed by preposi-
tions and develops statistical learning models
for predicting the relations, their arguments
and the semantic types of the arguments. We
define an inventory of 32 relations, build-
ing on the word sense disambiguation task
for prepositions and collapsing related senses
across prepositions. Given a preposition in
a sentence, our computational task to jointly
model the preposition relation and its argu-
ments along with their semantic types, as a
way to support the relation prediction. The an-
notated data, however, only provides labels for
the relation label, and not the arguments and
types. We address this by presenting two mod-
els for preposition relation labeling. Our gen-
eralization of latent structure SVM gives close
to 90% accuracy on relation labeling. Further,
by jointly predicting the relation, arguments,
and their types along with preposition sense,
we show that we can not only improve the re-
lation accuracy, but also significantly improve
sense prediction accuracy.

1 Introduction

This paper addresses the problem of predicting se-
mantic relations conveyed by prepositions in text.
Prepositions express many semantic relations be-
tween their governor and object. Predicting these
can help advancing text understanding tasks like
question answering and textual entailment. Consider
the sentence:

(1) The book of Prof. Alexander on primary school
methods is a valuable teaching resource.

Here, the preposition on indicates that the book
and primary school methods are connected by the
relation Topic and of indicates the Creator-
Creation relation between Prof. Alexander and
the book. Predicting these relations can help answer
questions about the subject of the book and also rec-
ognize the entailment of sentences like Prof. Alexan-
der has written about primary school methods.

Being highly polysemous, the same preposition
can indicate different kinds of relations, depending
on its governor and object. Furthermore, several
prepositions can indicate the same semantic relation.
For example, consider the sentence:

(2) Poor care led to her death from pneumonia.

The preposition from in this sentence expresses the
relation Cause(death, pneumonia). In a differ-
ent context, it can denote other relations, as in the
phrases copied from the film (Source) and recog-
nized from the start (Temporal). On the other
hand, the relation Cause can be expressed by sev-
eral prepositions; for example, the following phrases
express a Cause relation: died of pneumonia and
tired after the surgery.

We characterize semantic relations expressed by
transitive prepositions and develop accurate models
for predicting the relations, identifying their argu-
ments and recognizing the semantic types of the ar-
guments. Building on the word sense disambigua-
tion task for prepositions, we collapse semantically
related senses across prepositions to derive our re-
lation inventory. These relations act as predicates
in a predicate-argument representation, where the
arguments are the governor and the object of the

231


preposition. While ascertaining the arguments is a
largely syntactic decision, we point out that syn-
tactic parsers do not always make this prediction
correctly. However, as illustrated in the examples
above, identifying the relation depends on the gov-
ernor and object of the preposition.

Given a sentence and a preposition, our goal is
to model the predicate (i.e. the preposition rela-
tion) and its arguments (i.e. the governor and ob-
ject). Very often, the relation label is not influenced
by the surface form of the arguments but rather by
their semantic types. In sentence (2) above, we
want the predicate to be Cause when the object of
the preposition is any illness. We thus suggest to
model the argument types along with the preposi-
tion relations and arguments, using different notions
of types. These three related aspects of the rela-
tion prediction task are further explained in Section
3 leading up to the problem definition.

Though we wish to predict relations, arguments
and types, there is no corpus which annotates all
three. The SemEval 2007 shared task of word sense
disambiguation for prepositions provides sense an-
notations for prepositions. We use this data to gen-
erate training and test corpora for the relation la-
bels. In Section 4, we present two models for the
prepositional relation identification problem. The
first model considers all possible argument candi-
dates from various sources along with all argument
types to predict the preposition relation label. The
second model treats the arguments and types as la-
tent variables during learning using a generalization
of the latent structural SVM of (Yu and Joachims,
2009). We show in Section 5 that this model not
only predicts the arguments and types, but also im-
proves relation prediction performance.

The primary contributions of this paper are:

1. We introduce a new inventory of preposition
relations that covers the 34 prepositions that
formed the basis of the SemEval 2007 task of
preposition sense disambiguation.

2. We model preposition relations, arguments and
their types jointly and propose a learning algo-
rithm that learns to predict all three using train-
ing data that annotates only relation labels.

3. We show that jointly predicting relations with

word sense not only improves the relation pre-
dictor, but also gives a significant improvement
in sense prediction.

2 Prepositions & Predicate-Argument
Semantics

Semantic role labeling (cf. (Gildea and Jurafsky,
2002; Palmer et al., 2010; Punyakanok et al., 2008)
and others) is the task of converting text into a
predicate-argument representation. Given a trigger
word or phrase in a sentence, this task solves two
related prediction problems: (a) identifying the rela-
tion label, and (b) identifying and labeling the argu-
ments of the relation.

This problem has been studied in the con-
text of verb and nominal triggers using the Prop-
Bank (Palmer et al., 2005) and NomBank (Meyers
et al., 2004) annotations over the Penn Treebank,
and also using the FrameNet lexicon (Fillmore et
al., 2003), which allows arbitrary words to trigger
semantic frames.

This paper focuses on semantic relations ex-
pressed by transitive prepositions1. We can define
the two prediction tasks for prepositions as follows:
identifying the relation label for a preposition, and
predicting the arguments of the relation. Preposi-
tions can mark arguments (both core and adjunct)
for verbal and nominal predicates. In addition, they
can also trigger relations that are not part of other
predicates. For example, in sentence (3) below, the
prepositional phrase starting with to is an argument
of the verb visit, but the in triggers an independent
relation indicating the location of the aquarium.

(3) The children enjoyed the visit to the aquarium
in Coney Island.

FrameNet covers some prepositional relations, but
allows only temporal, locative and directional senses
of prepositions to evoke frames, accounting for only
3% of the targets in the SemEval 2007 shared task
of FrameNet parsing. In fact, the state-of-the-art
FrameNet parser of (Das et al., 2010) does not con-
sider any frame inducing prepositions.

(Baldwin et al., 2009) highlights the importance
of studying prepositions for a complete linguistic

1By transitive prepositions we refer to the standard usage of
prepositions that take an object. In particular, we do not con-
sider prepositional particles in our analysis.

232


analysis of sentences and surveys work in the NLP
literature that addresses the syntax and semantics
of prepositions. One line of work (Ye and Bald-
win, 2006) addressed the problem of preposition
semantic role labeling by considering prepositional
phrases that act as arguments of verbs according
to the PropBank annotation. They built a system
that predicts the labels of these prepositional phrases
alone. However, by definition, this covered only
verb-attached prepositions. (Zapirain et al., 2012)
studied the impact of automatically learned selec-
tional preferences for predicting arguments of verbs
and showed that modeling prepositional phrases sep-
arately improves the performance of argument pre-
diction.

Preposition semantics has also been studied
via the Preposition Project (Litkowski and Har-
graves, 2005) and the related SemEval 2007 shared
task of word sense disambiguation of prepositions
(Litkowski and Hargraves, 2007). The Preposi-
tion Project identifies preposition senses based on
their definitions in the Oxford Dictionary of English.
There are 332 different labels to be predicted with a
wide variance in the number of senses per preposi-
tion ranging from 2 (during and as) to 25 (on). For
example, according to the preposition sense inven-
tory, the preposition from in sentence (2) above will
be labeled with the sense from:12(9) to indicate a
cause. (Dahlmeier et al., 2009) added sense anno-
tation to seven prepositions in four sections of the
Penn Treebank with the goal of studying their inter-
action with verb arguments.

Using the SemEval data, (Tratz and Hovy, 2009)
and (Hovy et al., 2010) showed that the arguments
offer an important cue to identify the sense of the
preposition and (Tratz, 2011) showed further im-
provements by refining the sense inventory. How-
ever, though these works used a dependency parser
to identify arguments, in order to overcome parsing
errors, they augment the parser’s predictions using
part-of-speech based heuristics.

We argue that, while disambiguating the sense
of a preposition does indeed reveal nuances of its
meaning, it leads to a proliferation of labels to be
predicted. Most importantly, sense labels do not
transfer to other prepositions that express the same
meaning. For example, both finish lunch before
noon and finish lunch by noon express a Temporal

relation. According to the Preposition Project, the
sense label for the first preposition is before:1(1),
and that for the second is by:17(4). This both de-
feats the purpose of identifying the relations to aid
natural language understanding and makes the pre-
diction task harder than it should be: using the stan-
dard word sense classification approach, we need to
train a separate classifier for each word because the
labels are defined per-preposition. In other words,
we cannot share features across the different prepo-
sitions. This motivates the need to combine such
senses of prepositions into the same class label.

In this direction, (O’Hara and Wiebe, 2009) de-
scribes an inventory of preposition relations ob-
tained using Penn Treebank function tags and frame
elements from FrameNet. (Srikumar and Roth,
2011) merged preposition senses of seven preposi-
tions into relation labels. (Litkowski, 2012) also
suggests collapsing the definitions of prepositions
into a smaller set of semantic classes. To aid bet-
ter generalization and to reduce the label complex-
ity, we follow this line of work to define a set of rela-
tion labels which abstract word senses across prepo-
sitions2.

3 Preposition-triggered Relations

This section describes the inventory of preposition
relations introduced in this paper, and then identifies
the components of the preposition relation extraction
problem.

3.1 Preposition Relation Inventory
We build our relation inventory using the sense an-
notation in the Preposition Project, focusing on the
34 prepositions3 annotated for the SemEval-2007
shared task of preposition sense disambiguation.

As discussed in Section 2, we construct the in-
ventory of preposition relations by collapsing se-
mantically related preposition senses across differ-

2Since the preposition sense data is annotated over
FrameNet sentences, sense annotation can be used to extend
FrameNet (Litkowski, 2012). We believe that the abstract la-
bels proposed in this paper can further help in this effort.

3We consider the following prepositions: about, above,
across, after, against, along, among, around, as, at, before, be-
hind, beneath, beside, between, by, down, during, for, from, in,
inside, into, like, of, off, on, onto, over, round, through, to, to-
wards, and with. This does not include multi-word prepositions
such as because of and due to.

233


ent prepositions. For each sense that is defined,
the Preposition Project also specifies related prepo-
sitions. These definitions and related prepositions
provide a starting point to identify senses that can
be merged across prepositions. We followed this
with a manual cleanup phase. Some senses do not
cleanly align with a single relation because the def-
initions include idiomatic or figurative usage. For
example, the sense in:7(5) of the preposition in, ac-
cording to the definition, includes both spatial and
figurative notions of the spatial sense (that is, both
in London and in a film). In such cases, we sam-
pled 20 examples from the SemEval 2007 training
set and assigned the relation label based on majority.
If sufficient examples could not be sampled, these
senses were added to the label Other, which is not
a semantically coherent category and represents the
‘overflow’ case.

Overall, we have 32 labels, which are listed in
Table 14. A companion publication (available on
the authors’ website) provides detailed definitions
of each relation and the senses that were merged to
create each label. Since we define relations to be
groups of preposition sense labels, each sense can
be uniquely mapped to a relation label. Hence, we
can use the annotated sense data from SemEval 2007
to obtain a corpus of relation-labeled sentences.

To validate the labeling scheme, two native speak-
ers of English annotated 200 sentences from the
SemEval training corpus using only the definitions
of the labels as the annotation guidelines. We mea-
sured Cohen’s kappa coefficient (Cohen, 1960) be-
tween the annotators to be 0.75 and also between
each annotator and the original corpus to be 0.76 and
0.74 respectively.

3.2 Preposition Relation Extraction

The input to the prediction problem consists of a
preposition in a sentence and the goal is to jointly
model the following: (i) The relation expressed by
the preposition, and (ii) The arguments of the rela-
tion, namely the governor and the object.

We use sentence (2) in the introduction as our run-
ning example the following discussion. In our run-

4Note that, even though we do not consider intransitive
prepositions, the definitions of some relations in Table 1 could
be extended apply to prepositional particles such drive down
(Direction) and run about (Manner).

Relation Name Example
Activity good at boxing
Agent opened by Annie
Attribute walls of stone
Beneficiary fight for Napoleon
Cause died of cancer
Co-Participants pick one among these
Destination leaving for London
Direction drove towards the border
EndState driven to tears
Experiencer warm towards her
Instrument cut with a knife
Journey travel by road
Location living in London
Manner scream like an animal
MediumOfCommunication new show on TV
Numeric increase by 10%
ObjectOfVerb murder of the boys
Opponent/Contrast fight with him
Other all others
Participant/Accompanier steak with wine
PartWhole member of gang
PhysicalSupport lean against the wall
Possessor son of a friend
ProfessionalAspect works in publishing
Purpose tools for making it
Recipient unkind to her
Separation ousted from power
Source purchased from the shop
Species city of Prague
StartState recover from illness
Temporal arrived on Monday
Topic books on Shakespeare

Table 1: List of preposition relations

ning example, the relation label is Cause. We rep-
resent the predicted relation label by r.

Arguments The relation label crucially depends
on correctly identifying the arguments of the prepo-
sition, which are death and pneumonia in our run-
ning example. While a parser can identify the argu-
ments of a preposition, simply relying on the parser
may impose an upper limit on the accuracy of rela-
tion prediction.

We build an oracle experiment to highlight this
limitation. Table 2 shows the recall of the easy-first
dependency parser of (Goldberg and Elhadad, 2010)
on Section 23 of the Penn Treebank for identifying
the governor and object of prepositions.

We define heuristics that generate a candidate
governors and objects for a preposition. For the gov-

234


ernor, this set includes the previous verb or noun
and for the object, it includes only the next noun.
The row labeled Best(Parser, Heuristics) shows the
performance of an oracle predictor which selects the
true governor/object if present among the parser’s
prediction and the heuristics. We see that, even for
the in-domain case, if we are able to re-rank the can-
didates, we could achieve a big improvement in ar-
gument identification.

Recall
Governor Object

Parser 88.88 92.37
Best(Parser, Heuristics) 92.50 93.06

Table 2: Identifying governor and object of prepositions
in the Penn Treebank data. Here, Best(Parser, Heuris-
tics) reports the performance of an oracle that picks the
true governor and object, if present among the candidates
presented by the parser and the heuristic. This presents
an in-domain upper bound for governor and object detec-
tion. See text for further details.

To overcome erroneous parser decisions, we en-
tertain governor and object candidates proposed
both by the parser and the heuristics. In the follow-
ing discussion, we denote the chosen governor and
object by g and o respectively.

Argument types While the primary purpose of
this work is to model preposition relations and their
arguments, the relation prediction is strongly depen-
dent on the semantic type of the arguments. To il-
lustrate this, consider the following incomplete sen-
tence: The message was delivered at · · · . This
preposition can express both a Temporal or a
Location relation depending on the object (for ex-
ample, noon vs. the doorstep).

(Agirre et al., 2008) shows that modeling the se-
mantic type of the arguments jointly with attachment
can improve PP attachment accuracy. In this work,
we point out that argument types should be modeled
jointly with both aspects of the problem of preposi-
tion relation labeling.

Types are an abstraction that capture common
properties of groups of entities. For example, Word-
Net provides generalizations of words in the form of
their hypernyms. In our running example, we wish
to generalize the relation label for death from pneu-
monia to include cases such as suffering from flu.

Figure 1 shows the hypernym hierarchy for the word
pneumonia. In this case, synsets in the hypernym
hierarchy, like pathological state or physical condi-
tion, would also include ailments like flu.

pneumonia
=> respiratory disease
=> disease

=> illness
=> ill health
=> pathological state

=> physical condition
=> condition

=> state
=> attribute
=> abstraction

=> entity

Figure 1: Hypernym hierarchy for the word pneumonia

We define a semantic type to be a cluster of words.
In addition to WordNet hypernyms, we also cluster
verbs, nouns and adjectives using the dependency-
based word similarity of (Lin, 1998) and treat cluster
membership as types. These are described in detail
in Section 5.1.

Relation prediction involves not only identifying
the arguments, but also selecting the right semantic
type for them, which together, help predicting the
relation label. Given an argument candidate and a
collection of possible types (given by WordNet or
the similarity based clusters), we need to select one
of the types. For example, in the WordNet case, we
need to pick one of the hypernyms in the hypernym
hierarchy. Thus, for the governor and object, we
have a set of type labels, comprised of one element
for each type category. We denote this by tg (gover-
nor type) and to (object type) respectively.

3.3 Problem definition

The input to our prediction task is a preposition in
a sentence. Our goal is to jointly model the relation
it expresses, the governor and the object of the rela-
tion and the types of each argument (both WordNet
hypernyms and cluster membership). We denote the
input by x, which consists not only of the prepo-
sition but also a set of candidates for the governor
and the object and, for each type category, the list of
types for the governor and candidate.

235


The prediction, which we denote by y, consists
of the relation r, which can be one of the valid re-
lation labels in Table 1 and the governor and object,
denoted by g and o, each of which is one of text seg-
ments proposed by the parser or the heuristics. Ad-
ditionally, y also consists of type predictions for the
governor and object, denoted by tg and to respec-
tively, each of which is a vector of labels, one for
each type category. Table 3 summarizes the nota-
tion described above. We refer to the ith element of
vectors using subscripts and use the superscript ∗ to
denote gold labels. Recall that we have gold labels
only for the relation labels and not for arguments and
their types.

Symbol Meaning
x Input (pre-processed sentence and

preposition)
r relation label for the preposition
g,o governor and object of the relation
tg,to vectors of type assignments for

governor and object respectively
y Full structure (r,g,o,tg,to)

Table 3: Summary of notation

4 Learning preposition relations

A key challenge in modeling preposition relations is
that our training data only annotates the relation la-
bels and not the arguments and types. In this section,
we introduce two approaches for predicting preposi-
tion relations using this data.

4.1 Feature Representation

We use the notation Φ(x,y) to indicate the feature
function for an input x and the full output y. We
build Φ using the features of the components of y:

1. Arguments: For g and o, which represent an
assignment to the governor and object, we de-
note the features extracted from the arguments
as φA(x,g) and φA(x,o) respectively.

2. Types: Given a type assignment tgi to the i
th

type category of the governor, we define fea-
tures φT (x,g,t

g
i ). Similarly, we define features

φT (x,o,t
o
i ) for the types of the object.

We combine the argument and their type features to
define the features for classifying the relation, which
we denote by φ(x,g,o,tg,to):

φ =
∑

a∈{g,o}

(
φA(x,a) +

∑

i

φT (x,a,t
a
i )

)
(1)

Section 5 describes the actual features used in our
experiments.

Observe that given the arguments and their types,
the task of predicting relations is simply a multiclass
classification problem. Thus, following the standard
convention for multiclass classification, the overall
feature representation for the relation and argument
prediction is defined by conjoining the relation r
with features for the corresponding arguments and
types, φ. This gives us the full feature representa-
tion, Φ(x,y).

4.2 Model 1: Predicting only relations
The first model aims at predicting only the rela-
tion labels and not the arguments and types. This
falls into the standard multiclass classification set-
ting, where we wish to predict one of 32 labels. To
do so, we sum over all the possible assignments to
the rest of the structure and define features for the
inputs as

φ̂(x) =
∑

g,o,tg,to

φ(x,g,o,tg,to) (2)

Effectively, doing so uses all the governor and ob-
ject candidates and all their semantic types to get
a feature representation for the relation classifica-
tion problem. Once again, for a relation label r, the
overall feature representation is defined by conjoin-
ing the relation r with the features for that relation
φ̂, which we write as φR(x,r). Note that this sum-
mation is computationally inexpensive in our case
because the sum decomposes according to equation
(1). With a learned weight vector w, the relation
label is predicted as

r = arg max
r′

wT φR(x,r
′) (3)

We use a structural SVM (Tsochantaridis et al.,
2004) to train a weight vector w that predicts the re-
lation label as above. The training is parameterized
by C, which represents the tradeoff between gener-
alization and the hinge loss.

236


4.3 Model 2: Learning from partial
annotations

In the second model, even though our annotation
does not provide gold labels for arguments and
types, our goal is to predict them. At inference time,
if we had a weight vector w, we could predict the
full structure using inference as follows:

y = arg max
y′

wT Φ(x,y) (4)

We propose an iterative learning algorithm to learn
this weight vector.

In the following discussion, for a labeled example
(x,y∗), we refer to the missing part of its structure
as h(y∗). That is, h(y∗) is the assignment to the
arguments of the relation and their types. We use the
notation r(y) to denote the relation label specified
by a structure y.

Our learning algorithm is closely related to re-
cently developed latent variable based frameworks
(Yu and Joachims, 2009; Chang et al., 2010a; Chang
et al., 2010b), where the supervision provides only
partial annotation. We begin by defining two addi-
tional inference procedures:

1. Latent Inference: Given a weight vector w
and a partially labeled example (x,y∗), we can
‘complete’ the rest of the structure by infer-
ring the highest scoring assignment to the miss-
ing parts. In the algorithm, we call this pro-
cedure LatentInf(w,x,y∗), which solves the
following maximization problem:

ŷ = arg maxy w
T Φ(x,y), (5)

s.t. r(y) = r(y∗).

2. Loss augmented inference: This is a variant
of the the standard loss augmented inference
for structural SVMs, which solves the follow-
ing maximization problem for a given x and
fully labeled y∗:

arg max
y

wT Φ(x,y) + ∆(y,y∗) (6)

Here, ∆(y,y∗) denotes the loss function. In
the standard structural SVMs, the loss is over
the entire structure. In the Latent Structural
SVM formulation of (Yu and Joachims, 2009),

the loss is defined only over the part of the
structure with the gold label. In this work, we
use the standard Hamming loss over the entire
structure, but scale the loss for the elements of
h(y) by a parameter α < 1. This is a gen-
eralization of the latent structural SVM, which
corresponds to the setting α = 0. The intu-
ition behind having a non-zero α is that in ad-
dition to penalizing the learning algorithm if it
violates the annotated part of the structure, we
also incorporate a small penalty for the rest of
the structure.

Using these two inference procedures, we define
the learning algorithm as Algorithm 1. The weight
vector is initialized using Model 1. The algorithm
then finds the best arguments and types for all ex-
amples in the training set (steps 3-5). Doing so
gives an estimate of the arguments and types for
each example, giving us ‘fully labeled’ structured
data. The algorithm then proceeds to use this data to
train a new weight vector using the standard struc-
tural SVM with the loss augmented inference listed
above (step 6). These two steps are repeated several
times. Note that as with the summation in Model
1, solving the inference problems described above is
computationally inexpensive.

Algorithm 1 Algorithm for learning Model 2
Input: Examples D = {xi,r(y∗i )}, where exam-

ples are labeled only with the relation labels.
1: Initialize weight vector w using Model 1
2: for t = 1, 2, · · · do
3: for (xi,y∗i ) ∈ D do
4: ŷi ← LatentInf(w,xi,y∗i ) (Eq. 5)
5: end for
6: w ← LearnSSV M({xi, ŷi}) with the loss

augmented inference of Eq. 6
7: end for
8: return w

Algorithm 1 is parameterized by C and α. The
parameter α controls the extent to which the hypoth-
esized labels according to the previous iteration’s
weight vector influence the learning.

237


4.4 Joint inference between preposition senses
and relations

By defining preposition relations as disjoint sets of
preposition senses, we effectively have a hierarchi-
cal relationship between senses and relations. This
suggests that joint inference can be employed be-
tween sense and relation predictions with a validity
constraint connecting the two. The idea of employ-
ing inference to combine independently trained pre-
dictors to obtain a coherent output structure has been
used for various NLP tasks in recent years, starting
with the work of (Roth and Yih, 2004; Roth and Yih,
2007).

We use the features defined by (Hovy et al., 2010),
which we write as φs(x,s) for a given input x and
sense label s, and train a separate preposition sense
model on the SemEval data with features φs(x,s)
using the structural SVM algorithm. Thus, we have
two weight vectors – the one for predicting preposi-
tion relations described earlier, and the preposition
sense weight vector. At prediction time, for a given
input, we find the highest scoring joint assignment to
the relation, arguments and types and the sense, sub-
ject to the constraint that the sense and the relation
agree according to the definition of the relations.

5 Experiments and Results

The primary research goal of our experiments is to
evaluate the different models (Model 1, Model 2 and
joint relation-sense inference) for predicting prepo-
sition relations. In additional analysis experiments,
we also show that the definition of preposition rela-
tions indeed captures cross-preposition semantics by
taking advantage of shared features and also high-
light the need for going beyond the syntactic parser.

5.1 Types and Features

Types As described in Section 3, we use WordNet
hypernyms as one of the type categories. We use all
hypernyms within four levels in the hypernym hier-
archy for all senses.

The second type category is defined by word-
similarity driven clusters. We briefly describe the
clustering process here. The thesaurus of (Lin,
1998) specifies similar lexical items for a given
word along with a similarity score from 0 to 1. It
treats nouns, verbs and adjectives separately. We

use the score to cluster groups of similar words us-
ing a greedy set-covering approach. Specifically,
we randomly select a word which is not yet in a
cluster as the center of a new cluster and add all
words whose score is greater than σ to it. We re-
peat this process till all words are in some clus-
ter. A word can appear in more than one cluster
because all words similar to the cluster center are
added to the cluster. We repeat this process for
σ ∈{0.1, 0.125, 0.15, 0.175, 0.2, 0.25}. By increas-
ing the value of σ, the clusters become more selec-
tive and hence smaller. Table 4 shows example noun
clusters created using σ = 0.15. For a given word,
identifiers for clusters to which the word belongs
serve as type label candidates for this type category5.

Features Our argument features, denoted by φA
in Section 4.1, are derived from the preposition
sense feature set of (Hovy et al., 2010) and extract
the following from the argument: 1. Word, part-of-
speech, lemma and capitalization indicator, 2. Con-
flated part-of-speech (one of Noun, Verb, Adjective,
Adverb, and Other), 3. Indicator for existence in
WordNet, 4. WordNet synsets for the first and all
senses, 5. WordNet lemma, lexicographer file names
and part, member and substance holonyms, 6. Roget
thesaurus divisions for the word, 7. The first and last
two and three letters, and 8. Indicators for known af-
fixes. Our type features (φT ) are simply indicators
for the type label, conjoined with the type category.

One advantage of abstracting word senses into re-
lations is that we can share features across different
prepositions. The base feature set (for both types
and arguments) defined above does not encode in-
formation about the preposition to be classified. We
do so by conjoining the features with the preposi-
tion. In addition, since the relation labels are shared
across all prepositions, we include the base features
as a shared representation between prepositions.

We consider two variants of our feature sets.
We refer to the features described above as the
typed features. In addition, we define the
typed+gen features by conjoining argument and
type features of typed with the name of the genera-
tor that proposes the argument. Recall that governor
candidates are proposed by the dependency parser,
or by the heuristics described earlier. Hence, for

5The clusters can be downloaded from the authors’ website.

238


Jimmy Carter; Ronald Reagan; richard nixon; George Bush; Lyndon Johnson; Richard M. Nixon; Gerald Ford
metalwork; porcelain; handicraft; jade; bronzeware; carving; pottery; ceramic; earthenware; jewelry; stoneware; lacquerware
degradation; erosion; pollution; logging; desertification; siltation; urbanization; felling; poaching; soil erosion; depletion;
water pollution; deforestation
expert; Wall Street analyst; analyst; economist; telecommunications analyst; strategist; media analyst
fox news channel; NBC News; MSNBC; Fox News; CNBC; CNNfn; C-Span
Tuesdays; Wednesdays; weekday; Mondays; Fridays; Thursdays; sundays; Saturdays

Table 4: Examples of noun clusters generated using the set-covering approach for σ = 0.15

a governor, the typed+gen features would conjoin
the corresponding typed features with one of parser,
previous-verb, previous-noun, previous-adjective, or
previous-word.

5.2 Experimental setup and data
All our experiments are based on the Sem-
Eval 2007 data for preposition sense disambigua-
tion (Litkowski and Hargraves, 2007) comprising
word sense annotation over 16176 training and
8058 examples of prepositions labeled with their
senses. We pre-processed sentences with part-of-
speech tags using the Illinois POS tagger and de-
pendency graphs using the parser of (Goldberg and
Elhadad, 2010)6. For the experiments described be-
low, we used the relation-annotated training set to
train the models and evaluate accuracy of prediction
on the test set.

We chose the structural SVM parameter C using
five-fold cross-validation on a 1000 random exam-
ples chosen from the training set. For Model 2, we
picked α = 0.1 using a validation set consisting of
a separate set of 1000 training examples. We ran
Algorithm 1 for 20 rounds.

Predicting the most frequent relation for a prepo-
sition gives an accuracy of 21.18%. Even though
the performance of the most-frequent relation label
is poor, it does not represent the problem’s difficulty
and is not a good baseline. To compare, for prepo-
sition senses, using features from the neighboring
words, (Ye and Baldwin, 2007) obtained an accuracy
of 69.3%, and with features designed for the prepo-
sition sense task, (Hovy et al., 2010) get up to 84.8%
accuracy for the task. Our re-implementation of the
latter system using a different set of pre-processing
tools gets an accuracy of 83.53%.

For preposition relations, our baseline system for
6We used the Curator (Clarke et al., 2012) for all pre-

processing.

relation labeling uses the typed feature set, but with-
out any type information. This produces an accuracy
of 88.01% with Model 1 and 88.64% with Model 2.
We report statistical significance of results using our
implementation of Dan Bikel’s stratified-shuffling
based statistical significance tester7.

5.3 Main results: Relation prediction

Our main result, presented in Table 5, compares the
baseline model (without types) against other sys-
tems, using both models described in Section 4.
First, we see that adding type information (typed)
improves performance over the baseline. Expand-
ing the feature space (typed+gen) gives further im-
provements. Finally, jointly predicting the relations
with preposition senses gives another improvement.

Setting
Accuracy

Model 1 Model 2
No types 88.01 88.64
typed 88.77 89.14
typed+gen 89.90∗ 89.43∗

Joint typed+gen & sense 89.99∗ 90.26∗†

Table 5: Main results: Accuracy of relation labeling.
Results in bold are statistically significant (p < 0.01)
improvements over the system that is unaware of types.
Superscripts ∗ and † indicate significant improvements
over typed and typed+gen respectively at p < 0.01. For
Model 2, the improvement of typed over the model with-
out types is significant at p < 0.05.

Our objective is not predicting preposition sense.
However, we observe that with Model 2, jointly pre-
dicting the sense and relations improves not only the
performance of relation identification, but via joint
inference between relations and senses also leads to
a large improvement in sense prediction accuracy.
Table 6 shows the accuracy for sense prediction. We

7http://www.cis.upenn.edu/∼dbikel/software.html

239


see that while Model 1 does not lead to a significant
improvement in the accuracy, Model 2 gives an ab-
solute improvement of over 1%.

Setting Sense accuracy
Hovy (re-implementation) 83.53
Joint + Model 1 83.78
Joint + Model 2 84.78∗

Table 6: Sense prediction performance. Joint inference
with Model 1, while improving relation performance,
does not help sense accuracy in comparison to our re-
implementation of the Hovy sense disambiguation sys-
tem. However, with Model 2, the improvement is statis-
tically significant at p < 0.01.

5.4 Ablation experiments
Feature sharing across prepositions In our first
analysis experiment, we seek to highlight the utility
of sharing features between different prepositions.
To do so, we compare the performance of a sys-
tem trained without shared features against the type-
independent system, which uses shared features. To
discount the influence of other factors, we use Model
1 in the typed setting without any types. Table 7
reports the accuracy of relation prediction for these
two feature sets. We observed a similar improve-
ment in performance even when type features are
added or the setting is changed to typed+gen or with
Model 2.

Setting Accuracy
Independent 87.17
+ Shared 88.01

Table 7: Comparing the effect of feature sharing across
prepositions. We see that having a shared representation
that goes across prepositions improves accuracy of rela-
tion prediction (p < 0.01).

Different argument candidate generators Our
second ablation study looks at the effect of the var-
ious argument candidate generators. Recall that in
addition to the dependency governor and object, our
models also use the previous word, the previous
noun, adjective and verb as governor candidates and
the next noun as object candidate. We refer to the
candidates generated by the parser as Parser only
and the others as Heuristics only. Table 8 compares

the performance of these two argument candidate
generators against the full set using Model 1 in both
the typed and typed+gen settings.

We see that the heuristics give a better accu-
racy than the parser based system. This is because
the heuristics often contain the governor/object pre-
dicted by the dependency. This is not always the
case, though, because using all generators gives a
slightly better performing system (not statistically
significant). In the overall system, we retain the de-
pendency parser as one of the generators in order to
capture long-range governor/object candidates that
may not be in the set selected by the heuristics.

Feature sets
Generator typed typed+gen
Parser only 87.12 87.12
Heuristics only 87.63 88.84
All 88.01 89.12

Table 8: The performance of different argument candi-
date generators. We see that considering a larger set of
candidate generators gives a big accuracy improvement.

6 Discussion

There are two key differences between Model 1 and
2. First, the former predicts only the relation label,
while the latter predicts the entire structure. Table 9
shows example predictions of Model 2 for relation
label and WordNet argument types. These examples
show how the argument types can be thought of as
an explanation for the choice of relation label.

Input Relation Hypernyms
governor object

died of pneumonia Cause experience disease
suffered from flu Cause experience disease

recovered from flu StartState change disease

Table 9: Example predictions according to Model 2. The
hypernyms column shows a representative of the synset
chosen for the WordNet types. We see that in the com-
bination of experience and disease suggests the relation
Cause while the change and disease indicate the rela-
tion StartState.

The main difference between the two models is
in the treatment of the unlabeled (or latent) parts of
the structure (namely, the arguments and the types)
during training and inference. During training, for

240


each example, Model 1 aggregates features from all
governors and objects even if they are possibly ir-
relevant, which may lead to a much bigger model in
terms of the number of active weights. On the other
hand, for Model 2, Algorithm 1 uses the single high-
est scoring prediction of the latent variables, accord-
ing to the current parameters, to refine the parame-
ters. Indeed, in our experiments, we observed that
the number of non-zero weights in the weight vec-
tor of Model 2 is much smaller than that of Model
1. For instance, in the typed setting, the weight vec-
tor for Model 1 had 2.57 million elements while that
for Model 2 had only 1.0 million weights. Similarly,
for the typed+gen setting, Model 1 had 5.41 million
non-zero elements in the weight vector while Model
2 had only 2.21 million non-zero elements.

The learning algorithm itself is a generalization
of the latent structural SVM of (Yu and Joachims,
2009). By setting α to zero, we get the latent struc-
ture SVM. However, we found via cross-validation
that this is not the best setting of the parameter. A
theoretical understanding of the sparsity of weights
learned by the algorithm and a study of its conver-
gence properties is an avenue of future research.

7 Conclusion

We addressed the problem of modeling semantic re-
lations expressed by prepositions. We approached
this task by defining a set of preposition relations
that combine preposition senses across prepositions.
Doing so allowed us to leverage existing annotated
preposition sense data to induce a corpus for prepo-
sition labels. We modeled preposition relations in
terms of its arguments, namely the governor and ob-
ject of the preposition, and the semantic types of
the arguments. Using a generalization of the latent
structural SVM, we trained a relation, argument and
type predictor using only annotated relation labels.
This allowed us to get an accuracy of 89.43% on re-
lation prediction. By employing joint inference with
a preposition sense predictor, we further improved
the relation accuracy to 90.23%.

Acknowledgments

The authors wish to thank Martha Palmer, Nathan Schneider,

the anonymous reviewers and the editor for their valuable feed-

back. The authors gratefully acknowledge the support of the

Defense Advanced Research Projects Agency (DARPA) Ma-

chine Reading Program under Air Force Research Laboratory

(AFRL) prime contract no. FA8750-09-C-0181. This material

is also based on research sponsored by DARPA under agree-

ment number FA8750-13-2-0008. The U.S. Government is au-

thorized to reproduce and distribute reprints for Governmental

purposes notwithstanding any copyright notation thereon. The

views and conclusions contained herein are those of the authors

and should not be interpreted as necessarily representing the of-

ficial policies or endorsements, either expressed or implied, of

DARPA, AFRL or the U.S. Government.

References

E. Agirre, T. Baldwin, and D. Martinez. 2008. Improv-
ing parsing and PP attachment performance with sense
information. In Proceedings of the Annual Meeting of
the Association for Computational Linguistics (ACL),
pages 317–325, Columbus, USA.

T. Baldwin, V. Kordoni, and A. Villavicencio. 2009.
Prepositions in applications: A survey and introduc-
tion to the special issue. Computational Linguistics,
35(2):119–149.

M. Chang, D. Goldwasser, D. Roth, and V. Srikumar.
2010a. Discriminative learning over constrained latent
representations. In Proceedings of the Annual Meet-
ing of the North American Association of Computa-
tional Linguistics (NAACL), pages 429–437, Los An-
geles, USA.

M. Chang, V. Srikumar, D. Goldwasser, and D. Roth.
2010b. Structured output learning with indirect super-
vision. In Proceedings of the International Conference
on Machine Learning (ICML), pages 199–206, Haifa,
Israel.

J. Clarke, V. Srikumar, M. Sammons, and D. Roth. 2012.
An NLP Curator (or: How I Learned to Stop Wor-
rying and Love NLP Pipelines). In Proceedings of
the International Conference on Language Resources
and Evaluation (LREC), pages 3276–3283, Istanbul,
Turkey.

J. Cohen. 1960. A coefficient of agreement for nominal
scales. Educational and Psychological Measurement,
20:37–46.

D. Dahlmeier, H. T. Ng, and T. Schultz. 2009. Joint
learning of preposition senses and semantic roles of
prepositional phrases. In Proceedings of the Confer-
ence on Empirical Methods for Natural Language Pro-
cessing (EMNLP), pages 450–458, Singapore.

D. Das, N. Schneider, D. Chen, and N. Smith. 2010.
Probabilistic frame-semantic parsing. In Proceedings
of Human Language Technologies: The 2010 Annual

241


Conference of the North American Chapter of the As-
sociation for Computational Linguistics, pages 948–
956, Los Angeles, USA.

C. Fillmore, C. Johnson, and M. Petruck. 2003. Back-
ground to FrameNet. International Journal of Lexi-
cography, 16(3):235–250.

D. Gildea and D. Jurafsky. 2002. Automatic label-
ing of semantic roles. Computational Linguistics,
28(3):245–288.

Y. Goldberg and M. Elhadad. 2010. An efficient algo-
rithm for easy-first non-directional dependency pars-
ing. In Proceedings of Human Language Technolo-
gies: The 2010 Annual Conference of the North Ameri-
can Chapter of the Association for Computational Lin-
guistics, pages 742–750, Los Angeles, USA.

D. Hovy, S. Tratz, and E. Hovy. 2010. What’s in a prepo-
sition? dimensions of sense disambiguation for an in-
teresting word class. In Coling 2010: Posters, pages
454–462, Beijing, China.

D. Lin. 1998. Automatic retrieval and clustering of sim-
ilar words. In Proceedings of the Annual Meeting of
the Association for Computational Linguistics (ACL),
pages 768–774, Montreal, Canada.

K. Litkowski and O. Hargraves. 2005. The Preposition
Project. In ACL-SIGSEM Workshop on the Linguistic
Dimensions of Prepositions and their Use in Computa-
tional Linguistics Formalisms and Applications, pages
171–179, Colchester, UK.

K. Litkowski and O. Hargraves. 2007. SemEval-2007
Task 06: Word-Sense Disambiguation of Prepositions.
In Proceedings of the Fourth International Workshop
on Semantic Evaluations (SemEval-2007), pages 24–
29, Prague, Czech Republic.

K. Litkowski. 2012. Proposed Next Steps for The Prepo-
sition Project. Technical Report 12-01, CL Research.

A. Meyers, R. Reeves, C. Macleod, R. Szekely, V. Zielin-
ska, B. Young, and R. Grishman. 2004. The Nom-
Bank project: An interim report. In HLT-NAACL 2004
Workshop: Frontiers in Corpus Annotation, pages 24–
31, Boston, USA.

T. O’Hara and J. Wiebe. 2009. Exploiting semantic role
resources for preposition disambiguation. Computa-
tional Linguistics, 35(2):151–184.

M. Palmer, D. Gildea, and P. Kingsbury. 2005. The
Proposition Bank: An Annotated Corpus of Semantic
Roles. Computational Linguistics, 31(1):71–106.

M. Palmer, D. Gildea, and N. Xue. 2010. Semantic Role
Labeling, volume 3. Morgan & Claypool Publishers.

V. Punyakanok, D. Roth, and W. Yih. 2008. The impor-
tance of syntactic parsing and inference in semantic
role labeling. Computational Linguistics, 34(2).

D. Roth and W. Yih. 2004. A linear programming formu-
lation for global inference in natural language tasks.

In Proceedings of the Annual Conference on Compu-
tational Natural Language Learning (CoNLL), pages
1–8, Boston, USA.

D. Roth and W. Yih. 2007. Global inference for en-
tity and relation identification via a linear program-
ming formulation. Introduction to Statistical Rela-
tional Learning.

V. Srikumar and D. Roth. 2011. A joint model for
extended semantic role labeling. In Proceedings of
the Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), Edinburgh, Scotland.

S. Tratz and D. Hovy. 2009. Disambiguation of prepo-
sition sense using linguistically motivated features. In
Proceedings of Human Language Technologies: The
2009 Annual Conference of the North American Chap-
ter of the Association for Computational Linguistics,
Companion Volume: Student Research Workshop and
Doctoral Consortium, pages 96–100, Boulder, USA.

S. Tratz. 2011. Semantically-enriched Parsing for Natu-
ral Language Understanding. Ph.D. thesis, University
of Southern California.

I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun.
2004. Support vector machine learning for interde-
pendent and structured output spaces. In Proceedings
of the International Conference on Machine Learning
(ICML), pages 104–111, Banff, Canada.

P. Ye and T. Baldwin. 2006. Semantic role labeling
of prepositional phrases. ACM Transactions on Asian
Language Information Processing (TALIP), 5(3):228–
244.

P. Ye and T. Baldwin. 2007. MELB-YB: Preposition
sense disambiguation using rich semantic features.
In Proceedings of the Fourth International Workshop
on Semantic Evaluations (SemEval-2007), pages 241–
244, Prague, Czech Republic.

C. Yu and T. Joachims. 2009. Learning structural SVMs
with latent variables. In Proceedings of the Inter-
national Conference on Machine Learning (ICML),
pages 1–8, Montreal, Canada.

B. Zapirain, E. Agirre, L. Màrquez, and M. Surdeanu.
2012. Selectional preferences for semantic role classi-
fication. Computational Linguistics, pages 1–33.

242