Generating Training Data for Semantic Role Labeling based on Label
Transfer from Linked Lexical Resources

Silvana Hartmann�, Judith Eckle-Kohler�, Iryna Gurevych�‡

� UKP Lab, Technische Universität Darmstadt
‡ UKP Lab, German Institute for Educational Research

http://www.ukp.tu-darmstadt.de

Abstract

We present a new approach for generating
role-labeled training data using Linked Lexi-
cal Resources, i.e., integrated lexical resources
that combine several resources (e.g., Word-
Net, FrameNet, Wiktionary) by linking them
on the sense or on the role level. Unlike
resource-based supervision in relation extrac-
tion, we focus on complex linguistic anno-
tations, more specifically FrameNet senses
and roles. The automatically labeled train-
ing data (www.ukp.tu-darmstadt.de/
knowledge-based-srl/) are evaluated
on four corpora from different domains for the
tasks of word sense disambiguation and seman-
tic role classification. Results show that classi-
fiers trained on our generated data equal those
resulting from a standard supervised setting.

1 Introduction

In this work, we present a novel approach to automati-
cally generate training data for semantic role labeling
(SRL). It follows the distant supervision paradigm
and performs knowledge-based label transfer from
rich external knowledge sources to large corpora.

SRL has been shown to improve many NLP appli-
cations that rely on a deeper understanding of seman-
tics, such as question answering, machine translation
or recent work on classifying stance and reason in
online debates (Hasan and Ng, 2014) and reading
comprehension (Berant et al., 2014).

Even though unsupervised approaches continue to
gain popularity, SRL is typically still solved using
supervised training on labeled data. Creating such
labeled data requires manual annotations by experts,1

1Even though crowdsourcing has been used, it is still prob-

resulting in corpora of highly limited size. This is
especially true for the task of FrameNet SRL where
the amount of annotated data available is small.

FrameNet SRL annotates fine-grained semantic
roles in accordance with the theory of Frame Seman-
tics (Fillmore, 1982) as illustrated by the following
example showing an instance of the Feeling frame
including two semantic roles:

HeExperiencer feltFeeling no sense of
guiltEmotion in the betrayal of personal
confidence.

Our novel approach to training data generation for
FrameNet SRL uses the paradigm of distant supervi-
sion (Mintz et al., 2009) which has become popular
in relation extraction. In distant supervision, the over-
all goal is to align text and a knowledge base, using
some notion of similarity. Such an alignment allows
us to transfer information from the knowledge base to
the text, and this information can serve as labeling for
supervised learning. Hence, unlike semi-supervised
methods which typically employ a supervised clas-
sifier and a small number of seed instances to do
bootstrap learning (Yarowsky, 1995), distant supervi-
sion creates training data in a single run. A particular
type of knowledge base relevant for distant supervi-
sion are linked lexical resources (LLRs): integrated
lexical resources that combine several resources (e.g.,
WordNet, FrameNet, Wiktionary) by linking them on
the sense or on the role level.

Previous approaches to generating training data
for SRL (Fürstenau and Lapata, 2012) do not use lex-
ical resources apart from FrameNet. For the task of

lematic for SRL labeling: the task is very complex, which results
in manually adapted definitions (Fossati et al., 2013), or con-
strained role sets (Feizabadi and Padó, 2014).

197

Transactions of the Association for Computational Linguistics, vol. 4, pp. 197–213, 2016. Action Editor: Yuji Matsumoto.
Submission batch: 9/2015; Revision batch: 1/2016; Published 5/2016.

c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

www.ukp.tu-darmstadt.de/knowledge-based-srl/
www.ukp.tu-darmstadt.de/knowledge-based-srl/


word sense disambiguation (WSD), recent work on
automatic training data generation based on LLR has
only used WordNet (Cholakov et al., 2014), not con-
sidering other sense inventories such as FrameNet.

Our distant supervision approach for automatic
training data generation employs two types of knowl-
edge sources: LLRs and linguistic knowledge formal-
ized as rules to create data labeled with FrameNet
senses and roles. It relies on large corpora, because
we attach labels to corpus instances only sparsely.

We generate training data for two commonly dis-
tinguished subtasks of SRL: first, for disambiguation
of the frame-evoking lexical element relative to the
FrameNet sense inventory, a WSD task; and second,
for argument identification and labeling of the seman-
tic roles, which depends on the disambiguation result.
Regarding the subtask of FrameNet WSD, we derive
abstract lexico-syntactic patterns from lexical infor-
mation linked to FrameNet senses in an LLR and
recover them in large-scale corpora to create a sense
(frame) labeled corpus. We address the subsequent
steps of argument identification and role labeling by
making use of linguistic rules and role-level links
in an LLR, creating a large role-labeled corpus with
more than 500,000 roles.

We extrinsically evaluate the quality of the auto-
matically labeled corpora for frame disambiguation
and role classification for verbs, using four FrameNet-
labeled test-sets from different domains, and show
that the generated training data is complementary to
the FrameNet fulltext corpus: augmenting it with
the automatically labeled data improves on using the
FrameNet training corpus alone. We also evaluate
our approach on German data to show that it gener-
alizes across languages. We discuss in detail how
our method relates to and complements recent devel-
opments in FrameNet SRL. The need for additional
training data has also been reported for state-of-the-
art systems (FitzGerald et al., 2015).

Our work has three main contributions: (i) for
automatic sense labeling, we significantly extend
Cholakov et al. (2014)’s distant supervision approach
by using discriminating patterns and a different sense
inventory, i.e., FrameNet. We show that discriminat-
ing patterns can improve the quality of the automatic
sense labels. (ii) We use a distant supervision ap-
proach – building on LLRs – to address the complex
problem of training data generation for FrameNet

role labeling, which builds upon the sense labeling
in (i). (iii) Our detailed evaluation and analysis show
that our approach for data generation is able to gen-
eralize across domains and languages.

The rest of this paper is structured as follows: after
introducing our approach to training data generation
in section 2, we describe the automatic sense labeling
(section 3) and role classification (section 4) in detail.
In section 5 we apply our approach to German data.
We present related work in section 6 and discuss our
approach in relation to state-of-the-art FrameNet SRL
in section 7, followed by discussion and outlook in
section 8. Section 9 concludes.

2 Knowledge-based Label Transfer

Figure 1: Automatic training data generation – overview.

Our distant supervision method for generating
training data for SRL consists of two stages, first gen-
erating sense-labeled data, then extending these to
role-labeled data, as shown in Fig.1. Both stages use
large-scale corpora and LLRs as knowledge sources.

Knowledge sources. For the first stage, pre-
sented in detail in section 3, we use sense-level in-
formation from the LLR Uby (LLR 1 in Fig.1) and
exploit the sense links between the Uby versions
of FrameNet, WordNet, Wiktionary and VerbNet.
More specifically, we employ (i) sense examples
from FrameNet, WordNet, and Wiktionary, and (ii)
VerbNet information, i.e., syntactic subcategorization
frames, as well as semantic roles and selectional pref-
erence information of the arguments. It is important
to note that the sense examples in FrameNet (called
lexical unit examples) are a different resource than
the FrameNet fulltext corpus.

198


For the second stage, presented in section 4, we use
the LLR SemLink (LLR 2 in Fig.1) and exploit the
role-level links between VerbNet semantic roles and
FrameNet roles. SemLink (Bonial et al., 2013) con-
tains manually curated mappings of the fine-grained
FrameNet roles to 28 roles in the VerbNet role inven-
tory, including more than 1600 role-level links.

Formalization. More formally, we can cast our
distant supervision method for automatic training
data generation as a knowledge-based label transfer
approach. Given a set X of seed instances derived
from knowledge sources and a label space Y , a set
of labeled seed instances consists of pairs {xi, yi},
where xi ∈ X, and yi ∈ Y ; i = 1, . . . , n. For an
unlabeled instance uj ∈ U, j = 1, . . . , m, where
U is a large corpus and U ∩ X = Ø, we employ
label transfer from {xi, yi} to uj based on a common
representation rxi and ruj using a matching criterion
c. The label yi is transferred to uj if c is met.

For the creation of sense labeled data, we perform
pattern-based labeling, where Y is the set of sense
labels, rxi and ruj are sense patterns generated from
corpus instances and from LLRs including sense-
level links, and c considers the similarity of the pat-
terns based on a similarity metric.

We create role-labeled data with rule-based label-
ing where Y is the set of role labels, rxi and ruj
are attribute representations of roles using syntactic
and semantic attributes. Attribute representations
are derived from parsed corpus instances and from
linguistic knowledge, also including role-level links
from LLRs; here, c is fulfilled if the attribute repre-
sentations match.

Experimental setup. In our distant supervision
approach to training data generation, we (i) create
our training data in a single run (and not iteratively)
(ii) perform sparse labeling in order to create training
data, i.e., we need a very large corpus (e.g., unlabeled
web data) in order to obtain a sufficient number of
training instances. We analyze the resulting labeled
corpora and evaluate them extrinsically using a clas-
sifier trained on the automatically labeled data on
separate test datasets from different domains. This
way, we can also show that a particular strength of
our approach is to generalize across domains.

In the next section we present the automatic cre-
ation of sense-labeled corpora (stage 1).

3 Automatic Labeling for Word Sense

In this work, we are the first to apply distant super-
vision-based verb sense labeling to the FrameNet
verb sense inventory. We extend the methodology by
Cholakov et al. (2014) who also exploit sense-level
information from the LLR Uby for the automatic
sense-labeling of corpora with verb senses, but use
WordNet as a sense inventory. We use the same types
of information and similarity metric (which we call
sim in this paper), but our label space Y is given
by the FrameNet verb sense inventory (|Y |=4,670),
and therefore we exploit 42,000 sense links from
FrameNet to WordNet, VerbNet and Wiktionary.

The similarity metric sim ∈ [0..1] is based on
Dice’s coefficient and considers the common n-
grams, n = 2, . . . , 4:

(1) sim(rxi, ruj ) =

4∑
n=2

|Gn(p1)∩Gn(p2)|·n
normw

where w >= 1 is the size of the window around the
target verb, Gn(pi), i ∈ {1, 2} is the set of n-grams
occurring in rxi and ruj , and normw is the normal-
ization factor defined by the sum of the maximum
number of common n-grams in the window w.2

Step 1A: Seed Pattern extraction and filtering.
We call the sense patterns rxi generated from seed in-
stances xi in the LLR Uby seed patterns and follow
Cholakov et al. (2014) for the generation of those
seed patterns, i.e., we create lemma sense patterns
(LSPs), consisting of the target verb lemma and lem-
matized context only, and second, abstract sense pat-
terns (ASPs), consisting of the target verb lemma and
a number of rule-based generalizations of the context
words. An example of each of the sense patterns for
the FrameNet sense Feeling of the verb feel in the
sense example He felt no sense of guilt in the betrayal
of personal confidence is:

1. LSP: he feel no sense of guilt in
2. ASP: PP feel cognition of feeling in act

ASPs generalize to a large number of contexts which
is particularly important for identifying productively
used verb senses, while LSPs serve to identify fixed
constructions such as multiword expressions.

2Using n-grams instead of unigrams takes word order into
account, which is particularly important for verb senses, as syn-
tactic and semantic properties often correlate.

199


A drawback of Cholakov et al. (2014)’s method
for seed pattern extraction is that it extracts a certain
number of very similar (or even identical) seed pat-
terns for different senses. Those seed patterns may
lead to noise in the sense-labeled data. To prevent this
problem, we developed an optional discriminating
filter that removes problematic seed patterns.

The intuition behind the discriminating filter is the
following: some of the ASP and LSP patterns which
we extract from the seed instances discriminate better
between senses than others; i.e., if the same or a very
similar pattern is extracted for sense wi and sense wj
of a word w, i, j ∈ (1, . . . , n), n=number of senses
of w, i 6= j, this pattern does not discriminate well,
and should not be used when labeling new senses.

We filter the ASP and LSP patterns by comparing
each pattern for sense wi to the patterns of all the
other senses wj, i 6= j using the similarity metric
sim; if we find two patterns wi, wj whose similarity
score exceeds a filtering threshold f, we greedily
discard them both.

The filtering may increase precision at the cost
of recall, because it reduces the number of seed pat-
terns. Since we use the approach on large corpora,
we still expect sufficient recall. Our results show
that discriminating filtering improves the quality of
the automatically labeled corpus. Essentially, our
discriminating filter integrates the goal of capturing
sense distinctions into our approach. The same goal
is pursued by Corpus Analysis Patterns (CPA pat-
terns, Hanks (2013)), which have been created to
capture sense distinctions in word usage by combin-
ing argument structures, collocations and an ontology
of semantic types for arguments. In contrast to our
fully automatic approach, developing CPA patterns
based on corpus evidence was a lexicographic effort.
The following example compares two ASP patterns
to a CPA pattern from Popescu et al. (2014):

1. CPA: [[Human]] | [[Institution]] abandon [[Ac-
tivity]] | [[Plan]]

2. ASP: JJ person abandon JJ cognition of JJ
quantity

3. ASP: person abandon communication which
VVD PP JJ in

Our abstract ASP patterns look similar, as they also
abstract argument fillers to semantic classes and pre-
serve certain function words.

Step 1B: Sense label transfer. Using the ap-
proach of Cholakov et al. (2014), we create sense
patterns ruj from all sentences uj of an unlabeled
corpus that contain a target verb, for instance the sen-
tence u1: I feel strangely sad and low-spirited today
for the verb feel. For every uj, its sense pattern ruj is
then compared to the labeled seed patterns using the
similarity metric sim. From the most similar seed
patterns {rxi, yi} that have a similarity score above
a threshold t, the set of candidate labels {yi} is ex-
tracted.3 The approach picks a random sense label
from {yi} and attaches it to uj.

If an ASP and an LSP receive the same similar-
ity score, LSPs get precedence over ASPs, i.e., the
labeled sense is selected from the senses associated
with the LSP. Using this method, our example sen-
tence u1 receives the label Feeling.

This approach leads to a sparse labeling of the un-
labeled corpus, i.e., many unlabeled sentences are
discarded because their similarity to the seed pattern
is too low. It however scales well to large corpora be-
cause it requires only shallow pre-processing, as we
will show in the next section: we apply our approach
to a large web corpus and analyze the resulting sense-
labeled corpora.

3.1 Creating the sense-labeled corpus

Unlabeled corpora. We used parts 1 to 4 of the
ukWAC corpus (Baroni et al., 2009) as input for
the automatic sense label transfer for the 695 verb
lemmas in our test sets (see section 3.2, Test data).

Seed Patterns. The seed patterns for Step 1A
of the sense labeling were extracted from a) the
FrameNet example sentences, and b) sense examples
from resources linked to FrameNet in Uby, namely
WordNet, Wiktionary, and VerbNet. Without a dis-
criminating filter, this results in more than 41,700
LSP and 322,000 ASP, 11% and 89% of the total
number, respectively. Adding a strict discriminating
filter f=0.07 reduces the patterns to 39,000 LSP and
217,000 ASP. Proportionally more ASP are filtered,
leading to 15% LSP and 85% ASP. The number of
senses with patterns decreases from 4,900 to 3,900.

Threshold setting. In order to determine the pa-
rameter values for the label transfer, i.e., which values

3Unless t is very high, there is usually more than one candi-
date sense label.

200


f t P R F1

- 0.07 0.672 0.723 0.696
- 0.1 0.672 0.712 0.692
- 0.14 0.665 0.642 0.653
- 0.2 0.68 0.633 0.656

0.2 0.2 0.683 0.566 0.619
0.14 0.2 0.689 0.566 0.621
0.1 0.2 0.702 0.544 0.613
0.07 0.2 0.713 0.526 0.605

Table 1: Combinations of f and t evaluated on FNFT-dev;
configurations for best F1, R and P in bold.

for threshold t, and filter f result in a high-quality
training corpus, we perform an extrinsic evaluation
on a development set: we use a set of automatically
labeled corpora based on ukWAC section 1 generated
with different threshold values to train a verb sense
disambiguation (VSD) system. We evaluate preci-
sion P (the number of correct instances/number of
labeled instances), recall R (the number of labeled
instances/all instances), and F1 (harmonic mean of P
and R) of the systems on the development-split FNFT-
dev of the FrameNet 1.5 fulltext corpus (FNFT), used
by Das and Smith (2011). A detailed description of
the VSD system follows in the next section.

We varied the thresholds of the discriminating fil-
ter f (Step 1A) and the threshold t (Step 1B) on
the values (0.07, 0.1, 0.14, 0.2), as was suggested by
Cholakov et al. (2014) for t. We also compare cor-
pora with and without the discriminating filter f. To
save space, we only report results with f for t = 0.2
in Table 1.

As expected, increasing the pattern similarity
threshold t at which a corpus sentence is labeled
with a sense increases the precision at the cost of
recall. Similarly, employing a discriminating filter
f at t=0.2 increases precision compared to using no
filter, and leads to the best precision on the validation
set. Note that the discriminating filter gets stricter,
i.e. discriminates more, with a lower f value. Ac-
cordingly, low f values lead to the highest precision
of 0.713 for the strict thresholds t=0.2 and f=0.07,
indicating that precision-oriented applications can
benefit from higher discrimination.

Automatically labeled corpora. The setting with
the highest F1 in Table 1 leads to the very large sense-
labeled corpus WaS XL. We also use f and t values
with the highest precision in order to evaluate the ben-
efits of the discriminating filter, leading to WaS L.

instances senses verbs s/v i/s

WaS XL
(t=0.07)

1.6*106 1,460 637 1.8 1,139

WaS X
(t=0.2)

193,000 1,249 602 1.7 155

WaS L
(t=0.2,f=0.07)

109,000 1,108 593 1.5 98

FNFT? 5,974 856 575 1.5 10

Table 2: Sense statistics of automatically labeled corpora.

The size of these corpora ranges from 100,000 to
1.6 million sense instances with an average of 1.5 to
1.8 senses per verb, compared to 6,000 verb sense
instances in FNFT?, FNFT filtered by the 695 verbs
in our four test sets, see Table 2.

We compare WaS L to WaS X, the corpus labeled
with t = 0.2, but without filter f in order to eval-
uate the impact of adding the discriminating filter.
Compared to the latter corpus, WaS L contains 44%
fewer sense instances, but only 12% fewer distinct
senses, and 75% of the senses which are also covered
by WaS XL. The number of instances per sense is
Zipf-distributed with values ranging from 1 to over
40,000, leading to the average of 1,139 reported in
Table 2 for WaS XL.

3.2 VSD experiments

To compare the quality of our automatically sense-
labeled corpora to manually labeled corpora, we per-
form extrinsic evaluation in a VSD task. VSD sys-
tem. We use a standard supervised setup for sense
disambiguation: we extract lexical, syntactic, and se-
mantic features from the various training sets and
the test sets. For pre-processing, we use DKPro
Core (Eckart de Castilho and Gurevych, 2014), e.g.,
Stanford tokenizer, TreeTagger for POS tagging and
lemmatization, StanfordNamedEntityRecognizer for
NER and Stanford Parser for dependency parsing.
We train a logistic regression classifier in the WEKA
implementation (Hall et al., 2009) using the same
features as Cholakov et al. (2014).

Training Corpora. We trained our VSD system
on WaS XL and WaS L, and, for comparison, on
the training split FNFT-train of FNFT 1.5 used by
Das and Smith (2011).

Test data. For evaluation, we used four different
FrameNet-labeled datasets. The statistics of the test
datasets are compiled in Table 3, a brief description of

201


verbs senses s/v inst(s) inst(r)

Fate 526 725 1.4 1,326 3,490
MASC 44 143 3.3 2,012 4,142
Semeval 278 335 1.2 644 1,582
FNFT-test 424 527 1.2 1,235 3,078

FNFT-dev 490 598 1.2 1,450 3,857

Table 3: Test dataset statistics on verbs; inst(s/r): number
of ambiguous sense and role instances in the datasets.

each dataset follows. We use the frame and role anno-
tations in the Semeval 2010 task 10 evaluation and
trial dataset (Ruppenhofer et al., 2010). It consists
of literature texts. The Fate corpus contains frame
annotations on the RTE-2 textual entailment chal-
lenge test set (Burchardt and Pennacchiotti, 2008).
It is based on newspaper texts, texts from informa-
tion extraction datasets such as ACE, MUC-4, and
texts from question answering datasets such as CLEF
and TREC. These two datasets were created prior
to the release of FrameNet 1.5. For those sets, only
senses (verb-frame combinations) that still occur in
FrameNet 1.5 and their roles were included in the
evaluation. The MASC WordSense sentence cor-
pus (Passonneau et al., 2012) is a balanced corpus
that contains sense annotations for 1000 instances of
100 words from the MASC corpus. It contains Word-
Net sense labels, we use a slightly smaller subset of
verbs annotated with FrameNet 1.5 labels.4 We also
evaluate on the test-split FNFT-test of the FrameNet
fulltext corpus used in Das and Smith (2011).

3.3 VSD results and analysis.
Impact of pattern filters. A comparison of results
between the WaS corpora (first block of Table 4)
shows that the filters in WaS L improve precision for
three out of four test sets, which shows that stronger
filtering can benefit precision-oriented applications.

Precision on the MASC corpus is lower when us-
ing a discriminating filter. Due to the larger polysemy
in MASC – on average 3.3 senses per verb (see s/v
in Table 3), it contains rare senses. The reduction
of sense instances caused by the discriminating filter
leads to some loss of instances for those senses and a
lower precision on MASC.

Analysing the results in detail for the example
verb tell shows that WaS XL contains all 10 senses

4This subset is currently not part of the MASC download,
but according to personal communication with the developers
will be published soon.

of tell in MASC; WaS L contains 9 of them. How-
ever, the number of training instances per sense for
WaS L can be lower by factor 10 or more compared
to WaS XL (e.g., tens to hundreds, hundreds to thou-
sands), leading to only few instances per sense. The
sparsity problem could either be solved by using a
less strict filter, or by labeling additional instances
from ukWAC, in order to preserve more instances of
the rare senses for stricter thresholds t and f.

These results also show that the noise that is added
to the corpora in a low-discrimination, high-recall
setting will be to a certain extent drowned out by the
large number of sense instances.

For WaS XL, recall is significantly higher for all
test sets, leading to a higher F1. All significance
scores reported in this paper are based on Fisher’s
exact test at significance level p<0.05.

Comparison to FNFT-train. We also compare
the results of our WaS-corpora to a VSD system
trained on the reference corpus FNFT-train (see Table
4). On Semeval, precision does not deviate signifi-
cantly from the FNFT-train system. On FNFT-test,
it is significantly lower. For WaS XL, precision is
significantly lower on Fate, but significantly higher
on MASC. For WaS L the precision is similar on
MASC and Fate. For WaS XL, the recall is signif-
icantly higher than for FNFT-train on all test sets,
leading to a higher F1. This is the result of the larger
sense coverage of the FrameNet lexicon, which pro-
vided the seeds for the automatic labeling.

Training our system directly on the FrameNet lexi-
cal unit examples is, however, not a viable alternative:
it leads to a system with similar precision our WaS-
corpora, but very low recall (between 0.22 and 0.37).
By using the sense examples in our seed patterns, we
retain their benefits on sense coverage, and improve
the system recall and F1 at the same time.

Comparative analysis. In a detailed analysis
of our results, we compare the performance of the
WaS XL and FNFT-train based systems on those
verbs of each test set that are evaluated for both sys-
tems, i.e., the intersection It. On It, precision and
F1 are higher for FNFT-train for all test sets except
MASC. For MASC, precision is similar, but recall is
0.21 points higher. For the verbs in It, the average
number of training senses in WaS XL is two senses
higher than for FNFT-train. This larger sense cov-
erage of the WaS XL is beneficial to recall on the

202


FNFT-test Fate MASC Semeval
P R F1 P R F1 P R F1 P R F1

WaS XL 0.647* 0.816* 0.722 0.628* 0.65* 0.639 0.66* 0.793* 0.72 0.665 0.761* 0.71
WaS L 0.68* 0.618 0.648 0.66 0.505* 0.572 0.639 0.707* 0.671 0.694 0.62* 0.655

FNFT-train 0.729 0.643 0.683 0.7 0.38 0.493 0.598 0.339 0.433 0.706 0.55 0.618

B-WaS XL 0.736 0.767* 0.751 0.686 0.619* 0.651 0.67* 0.699* 0.684 0.724 0.71* 0.717
U-WaS XL 0.668* 0.935* 0.78 0.63* 0.683* 0.656 0.642* 0.833* 0.725 0.667 0.849* 0.747

Table 4: VSD P, R, F1; * marks significant differences to the system trained on FNFT-train.

MASC test set which shows high polysemy.
Evaluating on the set difference between the sys-

tems (test verbs that remain after the intersection is re-
moved), we see that the lemma coverage of WaS XL
is complementary to FNFT-train. The difference Dt
is not empty for both systems, but the number of
verbs that can be evaluated additionally for WaS XL
is much larger than the one for FNFT-train. The
proportion of instances only evaluated for a specific
train set to all evaluated instances ranges between
11% and 48% for WaS XL, and between 5% and
30% for FNFT-train.

Combining training data. The complementary
nature of the sets led us to evaluate two combinations
of training sets: U-WaS XL consists of the union of
WaS XL and FNFT-train, B-WaS XL implements a
backoff strategy and thus consists of FNFT-train and
those instances of WaS XL whose lemmas are not
contained in the intersection with FNFT-train (i.e.,
if FNFT-train does not contain enough senses for a
lemma, supplement with WaS XL).

The third block in Table 4 shows that precision
is higher or not significantly lower for B-WaS XL
compared to FNFT-train, recall and F1 are higher.
U-WaS XL leads to higher recall compared to B-
WaS XL, and allover highest F1. This proves that
our automatically labeled corpus WaS XL is com-
plementary to the manually labeled FNFT-train and
contributes to a better coverage on diverse test sets.

Multiword verbs. Our approach of training data
generation also includes multiword verbs such as
carry out. We treat those verb senses as additional
senses of the head verb, for which we also create
sense patterns, i.e., the sense for carry out is a specific
sense of carry. As a result, we do not need to rely on
additional multiword detection strategies for VSD.

Our WaS XL contains more than 100,000 sense
instances of 194 multiword verbs, of which 35 have

multiple FrameNet senses. We specifically evaluated
the performance of our VSD system on multiwords
and their head verbs from MASC which contains 81
relevant sense instances. The precision is 0.66 com-
pared to 0.59 when training on FNFT-train, at slightly
higher coverage. While the test set is too small to
provide significant results, there is an indication that
the automatically labeled data also contribute to the
disambiguation of multiword verbs.

This analysis concludes our section on automatic
sense labeling. In the next section, we will describe
our method for automatically adding FrameNet role
labels to the WaS-corpora.

4 Automatic Labeling for Semantic Roles

In this section, we present our linguistically informed
approach to the automated labeling of FrameNet roles
in arbitrary text. Our method builds on the results of
rich linguistic pre-processing including dependency
parsing and uses role-level links in the LLR SemLink
and the sense labels from section 3. First, a set of
deterministic rules is applied to label syntactic argu-
ments with VerbNet semantic roles. Then, we map
the VerbNet semantic roles to FrameNet roles based
on role-level links in SemLink and the automatically
created sense labels.

Step 2A: VerbNet role label transfer. Our
precision-oriented deterministic rules build on the re-
sults of linguistic pre-processing. Our pre-processing
pipeline5 performs lemmatization, POS tagging,
named-entity-recognition and parsing with the Stan-
ford Parser (de Marneffe et al., 2006), as well as se-
mantic tagging with WordNet semantic fields.6 Step

5We used the components from DKPro Core (Eckart de
Castilho and Gurevych, 2014).

6We used the most-frequent-sense disambiguation heuristic
which works well for the coarse-grained semantic types given by
the WordNet semantic fields. Named-entity tags are also mapped

203


1 provides FrameNet sense labels for the target verbs.
Dependency parsing annotates dependency graphs,

linking a governor to its dependents within a sen-
tence. Governors and dependents are represented
by the heads of the phrases they occur in. For ver-
bal governors, the dependency graphs correspond to
predicate argument structures with the governor be-
ing the predicate and the dependents corresponding
to the argument heads.

Our rules attach VerbNet role labels to dependent
heads of their verbal governors. We can then derive
argument spans by expanding the dependent heads by
their phrases. The semantic role inventory as given by
VerbNet is our label space Y (|Y |=28). Rule-based
role labeling can be seen as label transfer where a
corpus instance uj is given by the dependent of a
verbal governor and its sentential context, including
all linguistic annotations. Then ruj is compared to a
prototypical attribute representation rxi of a semantic
role, derived from linguistic knowledge.7

More specifically, we iterate over the collapsed
dependencies annotated by the Stanford parser and
apply a hierarchically organized chain of 57 rules to
the dependents of all verbal governors. In this rule
chain, Location and Time roles are assigned first, in
case a dependent is a location or has the semantic
field value time. Then, the other roles are annotated.
This is done based on the dependency type in com-
bination with named entity tags or semantic fields,
either of the dependent or the verbal governor or both.
An example rule is: for the dependency nsubj, the
role Experiencer is annotated if the governor’s seman-
tic field is perception or emotion, and the role Agent
otherwise. This way, I in our example I feelFeeling
strangely sad and low-spirited today is annotated
with the Experiencer role.

Some rules also check the semantic field of the de-
pendent, e.g., the dependency prep with triggers the
annotation of the role Instrument, if the dependent is
neither a person nor a group. Often, it is not possible
to determine a single VerbNet role based on the avail-
able linguistic information (32 rules assign one role,
5 rules assign 2 roles, and 20 rules assign 3 roles),
e.g., the distinction between Theme and Co-Theme

to WordNet semantic fields.
7It took a Computational Linguist 3 days to develop the rules,

using a sample of the VerbNet annotations on PropBank from
SemLink as a development set.

can not be made. In such cases, multiple roles are
annotated, which are all considered in the subsequent
Step 2B. Evaluated on a test sample of VerbNet an-
notations on PropBank, the percentage of correctly
annotated roles among all annotated roles is 96.8% –
instances labeled with multiple roles are considered
correct if the set of roles contains the gold label. The
percentage of instances where a rule assigns at least
one role was 77.4%.

Step 2B: Mapping VerbNet roles to FrameNet
roles. Finally, the annotated VerbNet roles are
mapped to FrameNet roles using (i) the automati-
cally annotated FrameNet sense and (ii) the SemLink
mapping of VerbNet roles to FrameNet roles for this
FrameNet sense (frame).

The information on the FrameNet frame is crucial
to constrain the one-to-many mapping of VerbNet
roles to fine-grained FrameNet roles. For example,
the VerbNet role Agent is mapped to a large number
of different FrameNet roles across all frames. While
the SemLink mapping allows unique FrameNet roles
to be assigned in many cases, there are still a number
of cases left where the rule-based approach annotates
a set of FrameNet roles. Examples are Interlocutor 1
and Interlocutor 2 for the Discussion frame, or Agent
and Cause for the Cause harm frame. For the former
the distinction between the roles is arbitrary, while for
the latter further disambiguation may be desired. As
the SemLink mapping is not complete, our approach
results in partially labeled data, i.e., a sentence may
contain only a single predicate-role pair, even though
other arguments of the predicate are present. Our
experiments show that we can train semantic role
classifiers successfully on partially labeled data.

We used the training set from Das and Smith
(2011) (annotated with FrameNet roles) as a devel-
opment set. Evaluated on the test set from Das and
Smith (2011), the percentage of correctly annotated
roles among all annotated roles is 76.74%.8

4.1 Creating the role-labeled corpus

We use the two sense-labeled corpora WaS XL and
WaS L as input for the automatic role label trans-
fer, creating role-labeled corpora WaSR XL and
WaSR L. We distinguish two variants of these cor-

8As in Step 2A, instances labeled with a set of roles are
considered correct if the set contains the gold label.

204


instances roles senses r/s i/r

WaSR XL-uni 549,777 1,485 809 1.8 370
WaSR L-uni 34,678 968 597 1.6 36
WaSR XL-set 823,768 2,054 849 2.4 401
WaSR L-set 53,935 1,349 648 2.1 40

FNFT? 12,988 2,867 800 3.6 4.5

Table 5: Role statistics of automatically labeled corpora.

pora, one that only contains those role instances with
a unique role label, marked with the suffix -uni in
Table 5, and one that additionally includes sets of
labels, marked with the suffix -set.

For WaSR XL, Step 2A results in 1.9 million ar-
guments labeled with VerbNet roles. This number
is reduced by 66% in Step 2B as a result of the in-
complete mapping between VerbNet and FrameNet
senses and roles in SemLink.

Table 5 shows that the resulting corpora contain
34,000 (WaSR L-uni) and 549,000 (WaSR XL-uni)
uniquely assigned role instances for the verbs in our
test sets, a lot compared to the 13,000 instances in
FNFT?, FNFT filtered by the 695 verbs in our four
test sets. The counts are even higher for the corpora
including sets of labels.

Due to the sparse labeling approach, our WaSR cor-
pora contain on average up to 1.8 roles per predicate,
compared to an average of 3.6 roles per predicate
in FNFT?. This number rises to 2.4 when instances
with sets of labels are added.

4.2 Role classification experiments

Role classification system. We trained a supervised
system for semantic role classification as a log-linear
model per verb-frame using the features described in
Fürstenau and Lapata (2012).

Note that we do not evaluate the task of argument
identification. Argument identification is performed
by our rule-based VerbNet role transfer and follows
common syntactic heuristics based on dependency
parsing. Following Zapirain et al. (2013), we specifi-
cally consider the subtask of role classification, as we
focus on the quality of our data on the semantic level.
In this context it is important that the features of
our role classifier do not use span information: they
include lemma and POS of the argument head, its
governing word, and the words right and left of the ar-
gument head, the position of the argument relative to
the predicate, and the grammatical relation between

the argument head and the predicate. Pre-processing
is the same as for VSD.

Training and test data. We compare our
role classifier trained on WaSR XL-(set/uni) and
WaSR L-(set/uni) to the one based on FNFT-train.
Test datasets are the same as for VSD, see Table 3.

4.3 SRL results and analysis
Results on WaSR corpora. We evaluate P, R, and
F1 on all frame-verb combinations for which there is
more than one role in our training data. Training the
system on WaSR XL-set and WaSR L-set include
training instances with sets of role labels. Therefore,
sets of role labels are among the predicted labels. In
the evaluation, we count the label sets as correct if
they contain the gold label.

As expected, WaSR XL-set leads to higher preci-
sion and recall than WaSR XL-uni, resulting from
the larger role coverage in the training set, and the
lenient evaluation setting, see Table 6.

We skip the WaSR L-* corpora in Table 6, because
the benefits of the strict filtering for the sense corpora
do not carry over to the role-labeled corpora: scores
are lower for WaSR L-* on all test sets because of
a smaller number of role-labeled instances in the
WaSR L-* corpora (see Table 5).

Comparison to FNFT-train. Table 6 compares
the results of WaSR XL-* to the system trained on
FNFT-train. Note that we emulated the lenient eval-
uation setting for FNFT-train by retrieving the label
set Sl in WaSR XL-set for a label l predicted by
the FNFT-train system and counting l as correct if
any of the labels in Sl matches the gold label. We,
however, did not find any difference to the regular
evaluation; it appears that the labeling errors of the
FNFT-train-based system are different from the label
sets resulting from our labeling method.

The precision for WaSR XL-uni matches the pre-
cision for FNFT-train for the Semeval and Fate test
sets (the difference is not significant). This is remark-
able considering that only partially labeled data are
available for training.

For WaSR XL-set, the precision scores for Se-
meval and Fate improve over the FNFT-train system,
significantly for Fate. Recall of the WaSR corpora is
significantly lower allover, as a result of the sparse,
partial labeling and the lower role coverage of our
automatically labeled corpora.

205


FNFT-test Fate MASC Semeval
P R F1 P R F1 P R F1 P R F1

WaSR XL-uni 0.658* 0.333* 0.442 0.619 0.281* 0.387 0.652* 0.253* 0.365 0.689 0.394* 0.501
WaSR XL-set 0.705* 0.398* 0.509 0.733* 0.337* 0.462 0.648* 0.297* 0.408 0.722 0.441* 0.547

FNFT-train 0.741 0.831 0.783 0.652 0.642 0.647 0.724 0.527 0.61 0.705 0.625 0.663

B-WaSR XL-uni 0.728* 0.878* 0.796 0.645 0.698* 0.67 0.718 0.574* 0.638 0.696 0.71* 0.703
U-WaSR XL-uni 0.691* 0.883* 0.776 0.629 0.701* 0.663 0.677* 0.579* 0.624 0.671 0.721* 0.695

Table 6: Role classification P, R, F1; * marks significant differences to the system trained on FNFT-train.

Figure 2: Role classification learning curves.

Comparative analysis. We compare the perfor-
mance of our WaSR XL-uni and the FNFT-train
based system on the intersection of the evaluated
senses between both systems. Precision of FNFT-
train is higher on the intersection, except for Semeval,
where it is similar. FNFT-train evaluates on average
two more roles per sense than the WaSR. Evaluat-
ing only on the difference, the instances not con-
tained in the intersection, we see that WaSR XL-uni
contributes some instances that are not covered by
FNFT-train. These constitute between 7% and 18%
of the total evaluated instances, compared to 26% to
50% instances added by FNFT-train. The precision
of WaSR XL-uni on the intersection for MASC is
high at 0.68, compared to 0.55 for FNFT-test (not
shown in Table 6). These results indicate that our
WaSR XL-uni is complementary to FNFT-train.

Combining training data. To give further evi-
dence of the complementary nature of the automati-
cally labeled corpus, we run experiments that com-
bine WaSR XL-uni with FNFT-train. We again use
the union of the datasets (U-WaSR XL-uni) and back-
ing off to WaSR XL-uni when FNFT-train does not
provide enough roles for a sense (B-WaSR XL-uni).

Table 6 shows better results for the backoff cor-
pus than for the union. Recall is significantly higher
compared to FNFT-train, and precision values are
not significantly lower except for FNFT-test. This
demonstrates that our automatically role-labeled cor-
pora can supplement a manually labeled corpus and
benefit the resulting system.

WaSR sampling. Because our WaSR corpora
show a Zipfian distribution of roles (there are a few
roles with a very large number of instances), we ran-
domly sample nine training sets from WaSR XL with
a different maximal number of training instances per
role s such that s = 5 · 2i for i ∈ {0, 1, .., 8}, i.e.,
s ranges from 5 to 1280. Fig. 2 shows the learning
curves for precision on WaSR XL-*. It shows that
distributional effects occur, i.e., that certain sample
sizes s lead to higher precision for a test set than
using the full corpus. The MASC test set particularly
benefits from the sampling: combining FNFT-train
with the best sample from the WaSR XL-set corpus
(sampling 160 instances per role) results in the allover
highest precision (0.738) and F1 (0.65).

5 German Experiments

To show that our method generalizes to other lan-
guages, we applied it to German data.

We used the SALSA2 corpus (Burchardt et al.,
2006) as a source of German data with FrameNet-
like labels. As SALSA2 does not provide additional
lexical unit examples, we split the corpus into a train-
ing set S-train that is also used for the extraction of
seed patterns, a development set S-dev, and a test set
S-test. The proportion of train, development and test
instances is 0.6, 0.2, 0.2; data statistics are shown
in Table 7. The unlabeled corpus used is based on
deWAC sections 1-5 (Baroni et al., 2009).

VSD corpus and evaluation. The LLR used
to generate more than 22,000 seed patterns consists

206


verbs senses roles inst(s) inst(r)

S-test 390 684 1,045 3,414 8,010
S-dev 390 678 1,071 3,516 8,139
S-train 458 1,167 1,511 9460 22,669

WaS-de
(t=0.07)

333 920 - 602,207 -

WaSR-de-set 193 277 155 80,370 115,332
WaSR-de-uni 172 241 210 51,241 57,822

Table 7: German dataset statistics on verbs.

of S-train and the German Wiktionary based on the
linking by Hartmann and Gurevych (2013).

DeWAC is labeled based on those patterns, and the
thresholds t and f are determined in a VSD task on
S-dev using a subset of the corpus based on sections
1-3. The threshold t=0.07 together with a discriminat-
ing filter of f=0.07 result in the best precision, and
t=0.07 in the best F1 score. Therefore, we perform
extrinsic evaluation in a VSD task on S-test with
WaS-de (t=0.07) and on the combinations U-WaS-de
(union with S-train) and B-WaS-de (backoff-variant).

The results in Table 8 show that the performance
of the WaS-de-based system is worse than the S-
train-based one, but the backoff version reaches best
scores allover, indicating that our WaS-de corpora
are complementary to S-train.

P R F1

WaS-de 0.672* 0.912* 0.773
B-WaS-de 0.711 0.958* 0.816
U-WaS-de 0.676* 0.961* 0.794

S-train 0.707 0.946 0.809

Table 8: German VSD P, R, F1; * marks significant differ-
ences to S-train.

SRL corpus and evaluation. We adapt the rule-
based VerbNet role-labeling to German dependencies
from the mate-tools parser (Seeker and Kuhn, 2012),
and perform Steps 2A and 2B on WaS-de, resulting
in WaSR-de-set/uni (see Table 7).

We train our role classification system on the cor-
pora in order to evaluate them extrinsically. Training
on WaSR-de-uni results in precision of 0.69 – better
than for English, but still significantly lower than for
the S-train system with 0.828. Recall is very low
at 0.17. This is due to the low role coverage of the
WaSR corpora shown in Table 7.

The evaluation shows that our approach can be ap-
plied to German. For VSD, the automatically labeled

data can be used to improve on using S-train alone;
improvements in precision are not significant, which
has several potential causes, e.g., the smaller set of
LLRs used for seed pattern extraction compared to
English, and the smaller size of the resulting corpora.
The smaller corpora also result in very low recall for
the role classification.

Future work could be to extend the German dataset
by adding additional resources to the LLR, for in-
stance GermaNet (Hamp and Feldweg, 1997). Ex-
tending the SemLink mapping to frames unique
to SALSA should additionally contribute to an im-
proved role coverage.

6 Related Work

Relevant related work is research on (i) the automatic
acquisition of sense-labeled data for verbs, (ii) the au-
tomatic acquisition of role-labeled data for FrameNet
SRL, and (iii) approaches to FrameNet SRL using
lexical resources and LLRs, including rule-based and
knowledge-based approaches.

Automatic acquisition of sense-labeled data.
Most previous work on automatically sense-labeling
corpora for WSD focussed on nouns and WordNet
as a sense inventory, e.g., Leacock et al. (1998), Mi-
halcea and Moldovan (1999), Martinez (2008), Duan
and Yates (2010). In this section, we describe work
that specifically considers verbs. Besides the already
introduced work by Cholakov et al. (2014), which
we extended by discriminating patterns and adapted
to the FrameNet verb sense inventory, this includes
work by Kübler and Zhekova (2009), who extract
example sentences from several English dictionar-
ies and various types of corpora, including web cor-
pora. They use a Lesk-like algorithm to annotate
target words in the extracted sentences with Word-
Net senses and use them as training data for WSD.
They evaluate on an all-words task and do not find
performance improvements when training on the au-
tomatically labeled data alone or on a combination
of automatically labeled and gold data.

Automatic acquisition of role-labeled data. Pre-
vious work in the automatic acquisition of role-
labeled data uses annotation projection methods, i.e.,
aligning a role-annotated sentence to a new sentence
on the syntactic level and transferring the role anno-
tations to the aligned words.

207


The goals of Fürstenau and Lapata (2012)’s work
are most similar to our work. They perform an-
notation projection of FrameNet roles for English
verbs. For this, they pair sentences in the British Na-
tional Corpus with frame-annotated sentences, align
their syntactic structures (including arguments), and
project annotations to the new sentences. They simu-
late a “low-resource” scenario that only provides few
training instances (called seed sentences) by vary-
ing the number of seed sentences and added labeled
sentences. They use the automatically labeled data
together with seed training data to train a supervised
system and find improvements over self-training.

A main difference to our approach is that Fürstenau
and Lapata (2012) do not use external information
from LLRs or other lexical resources like WordNet.
Like our approach, their approach creates a sparse
labeling by a) discarding sentences that do not align
well to their seeds, and b) discarding candidate pairs
for which not all roles could be mapped. This leads
to a high-precision approach that does not allow par-
tially labeled data. Such an approach does have disad-
vantages, e.g., a potentially lower domain variability
of the corpus, since they only label sentences very
similar to the seed sentences. Repeating their exper-
iments for German, Fürstenau (2011) finds that the
variety of the automatically annotated sentences de-
creases when a larger expansion corpus is used. In
our approach, the ASP patterns generalize from the
seed sentences (cf. section 3), leading us to assume
that our knowledge-based approach could be more
generous with respect to such variability; we already
successfully evaluated it on four datasets from vari-
ous domains, but would like to further confirm our
assumption in a direct comparison.

Another approach for training data generation for
PropBank-style semantic role labeling is described
in Woodsend and Lapata (2014). Using comparable
corpora they extract rewrite rules to generate para-
phrases of the original PropBank sentences. They
use a model trained on PropBank as the seed corpus
to filter out noise introduced by the rewrite rules. A
model trained on the extended PropBank corpus out-
performs the state of the art system on the CoNLL-
2009 dataset. Recently, Pavlick et al. (2015) pre-
sented a similar method to expand the FNFT corpus
through automatic paraphrasing. Noise was filtered
out using crowdsourcing and the resulting frame-

labeled corpus showed a lexical coverage three times
as high as the original FNFT. However, they did not
evaluate the augmented corpus as training data for
semantic role classification.

FrameNet SRL using lexical resources. Simi-
lar to our approach of automatically creating role-
labeled data in section 4, there are other rule-based
approaches to FrameNet SRL that rely on FrameNet
and other lexical resources (Shi and Mihalcea, 2004;
Shi and Mihalcea, 2005). Both describe a rule-based
system for FrameNet SRL that builds on the results
of syntactic parsing for the rule-based assignment of
semantic roles to syntactic constituents. The role as-
signment uses rules induced from the FrameNet full-
text corpus. These rules encode sentence-level fea-
tures of syntactic realizations of frames; they are com-
bined with word-level semantic features from Word-
Net including the countability of nouns or attribute
relations of an adjective indicating which nouns it
can modify. Since the coverage of the induced rules
is low, they are complemented by default rules.

The approach to SRL introduced by Litkowski
(2010) uses a dictionary built from FrameNet fulltext
annotations to recognize and assign semantic roles.
Their system first performs frame disambiguation and
then tries to match syntactic constituents produced
by a parser with syntactic patterns included the gen-
erated dictionary. Their system is evaluated on the
SemEval-2 task for linking events and their partici-
pants in discourse. It shows very low recall, which
is mainly due to the low coverage of their FrameNet
dictionary with regard to syntactic patterns.

Our approach differs from previous rule-based ap-
proaches to SRL in that we do not use the rule-based
system directly, but use it to create labeled training
data for training a supervised system. This transduc-
tive semi-supervised learning setup should be able to
deal better with the noise introduced by the rule based
system than the inductive rule-based approaches.

The work by Kshirsagar et al. (2015) uses lexi-
cal resources to enhance FrameNet SRL. They also
use the FrameNet sense examples and SemLink, but
in a completely different manner. Regarding the
sense examples, they employ domain adaptation tech-
niques to augment the feature space extracted from
the FrameNet training set with features from the
sense examples, thereby increasing role labeling F1
by 3% compared to the baseline system SEMAFOR.

208


We use the FrameNet example sentences only indi-
rectly: as seed sentences for the frame label transfer
(cf. Step 1), they provide distant supervision for the
automatic frame labeling. Our approach is comple-
mentary to the one by Kshirsagar et al. (2015) who
use the sense examples for role labeling.

Kshirsagar et al. (2015) only briefly report on their
experiments using SemLink. They used the transla-
tion of PropBank labels to FrameNet in the SemLink
corpus as additional training data, but found that this
strategy hurt role labeling performance. They credit
this to the low coverage and errors in SemLink, which
might be amplified by the use of a transitive linking
(from PropBank to FrameNet via VerbNet). In this
work, we successfully employ SemLink: we use the
VerbNet-FrameNet (sense- and role-level) linking
from SemLink in our role label transfer approach
(Step 2). The resulting automatically role-labeled
training data improve role classification in combi-
nation with the FN-train set (cf. section 4.3). We
assume that the large-scale generation of training
data smoothes over the noise resulting from errors in
the SemLink mapping.

Kshirsagar et al. (2015) additionally use features
from PropBank SRL as guide features and exploit
the FrameNet hierarchy to augment the feature space,
a method complementary to our approach. Their
best results combine the use of example sentences
and the FrameNet hierarchy for feature augmenta-
tion. They only evaluate on the FNFT-test set, as
has become standard for FrameNet SRL evaluation.
Our distantly supervised corpus might be useful for
domain adaptation to other datasets, as our role clas-
sification evaluation shows.

According to our above analysis, our strategy is
complementary to the approach by Kshirsagar et al.
(2015). It would be interesting to evaluate to what de-
gree our automatically labeled corpus would benefit
their system.

7 Relation to FrameNet SRL

In this section, we discuss the potential impact of our
work to state-of-the-art FrameNet SRL.

Our experimental setup evaluates frame disam-
biguation and role classification separately, which is a
somewhat artificial setup. We show that our automat-
ically generated training data are of high quality and

contribute to improved classification performance.
This section motivates that the data can also be useful
in a state-of-the-art SRL setting.

For a long time, the SEMAFOR system has been
the state-of-the-art FrameNet SRL system (Das et al.,
2010; Das et al., 2014). Recently, systems were intro-
duced that use new ways of generating training fea-
tures and neural-network based representation learn-
ing strategies. We already introduced (Kshirsagar
et al., 2015). Hermann et al. (2014) use distributed
representations for frame disambiguation. Others
integrate features based on document-level context
into a new open-source SRL system Framat++ (Roth
and Lapata, 2015), or present an efficient dynamic
program formalization for FrameNet role labeling
(Täckström et al., 2015). They all report improve-
ments on SEMAFOR results for full FrameNet SRL.

Hermann et al. (2014) report state-of-the-art re-
sults for FrameNet frame disambiguation. Their
approach is based on distributed representations of
frame instances and their arguments (embeddings)
and performs frame disambiguation by mapping a
new instance to the embedding space and assigning
the closest frame label (conditioned on the the lemma
for seen predicates). They report that they improve
frame identification accuracy over SEMAFOR by 4%
for ambiguous instances in the FNFT-test set, up to
73.39% accuracy. They also improve over the SE-
MAFOR system for full SRL, reporting an F1 of
68.69% compared to 64.54% from Das et al. (2014).

Our frame disambiguation results are not directly
comparable to their results. We also evaluate on
ambiguous instances, but only on verbal predicates,
which are typically more polysemous than nouns and
adjectives and more difficult to disambiguate.

The currently best-performing FrameNet SRL sys-
tem is the one presented by FitzGerald et al. (2015).
They present a multitask learning setup for semantic
role labeling which they evaluate for PropBank and
FrameNet SRL. The setup is based on a specifically
designed neural network model that embeds input
and output data in a shared, dense vector space. Us-
ing the frame identification model from Hermann
et al. (2014), their results significantly improve on
the previous state-of-the-art for full FrameNet SRL,
reaching F1 of 70.9% on FNFT-test – but only when
training the model jointly on FrameNet training data
and PropBank-labeled data in a multitask setup.

209


FitzGerald et al. (2015) report that the perfor-
mance of their system on FrameNet test data suffers
from the small training set available – only training
on FrameNet training data yields similar results to
Täckström et al. (2015). The joint training setup does
not benefit PropBank SRL due to the small size of the
FrameNet training set in comparison to the PropBank
data. This shows that additional training data for
FrameNet, for instance our automatically labeled cor-
pora, could also benefit a state-of-the-art system. An
explicit evaluation of this assumption or comparison
to this system is left to future work.

Based on the discussion above, and on the frame
and role classification experiments evaluated on four
test sets, we expect that the data we generate with our
method are complementary to the standard FrameNet
training data and can be used to enhance state-of-the-
art SRL systems. We leave empirical evaluation of
this claim to future work. By publishing our auto-
matically labeled corpora for research purposes, we
support efforts by other researchers to analyze them
and integrate them into their systems.

8 Discussion and Outlook

The evaluation shows our purely knowledge-based
approach for automatic label transfer results in high-
quality training data for English SRL that is comple-
mentary to the FNFT corpus.

For VSD, our data lead to similar precision to a
standard supervised setup, but at higher recall. Learn-
ing curves indicate that with an even larger corpus
we may be able to further improve precision. For
role classification, the sparse labeling leads to a low
role recall, but high precision is achieved for the cov-
ered roles. One cause for the sparse labeling is the
incomplete mapping between VerbNet and FrameNet
roles in SemLink; in future work we would like to ex-
tend the SemLink mapping automatically to enhance
the coverage of our method, and to disambiguate
ambiguous labels to further increase precision.

As a knowledge-based approach, our method is
particularly well-suited for languages and domains
for which role-labeled corpora are lacking, but LLRs
are available or can be created automatically. We
therefore applied our approach to German data; the
resulting sense-labeled corpus is complementary to
the training data from SALSA2. The role classifica-

tion evaluation should improve with a larger corpus.
State-of-the-art SRL systems still rely on super-

vised training, even when advanced methods such as
deep learning are used. In section 7, we discussed
in detail how our method relates to and comple-
ments the most recent developments in FrameNet
SRL. It would be interesting to evaluate the bene-
fits that our automatically labeled data can add to
an advanced SRL system. We expect particularly
strong benefits in the context of domain adaptation:
currently, FrameNet SRL systems are only evaluted
on in-domain test data.

Our method can be adapted to other sense and role
inventories covered by LLRs (e.g., VerbNet and Prop-
Bank) and to related approaches to SRL and semantic
parsing (e.g., QA-SRL (He et al., 2015)); the latter
requires a mapping of the role inventory to a suitable
LLR, for instance mapping the role labels in QA-SRL
to SemLink. We would also like to evaluate our ap-
proach in comparison to other methods for training
data generation, for instance methods based on align-
ments (Fürstenau and Lapata, 2012), or paraphrasing
(Woodsend and Lapata, 2014).

9 Conclusion

We presented a novel approach to automatically
generate training data for FrameNet SRL. It fol-
lows the distant supervision paradigm and performs
knowledge-based label transfer from rich external
knowledge sources to large-scale corpora without
relying on manually labeled corpora.

By transferring labels to a large, diverse web-
corpus (ukWAC) the potential of our approach for
generating data for different domains becomes appar-
ent. By applying it to German data, we showed that
our approach is applicable across languages. As a
further result of our work, we publish the automati-
cally labeled corpora and release our implementation
for knowledge-based role labeling (cf. Step 2A in
section 4) as open source software.

Automatic label transfer using linked resources
has become popular in relation extraction (Mintz et
al., 2009) and has been applied to VSD (Cholakov et
al., 2014), but not to SRL. In this work, we showed
that knowledge-based label transfer from LLRs to
large-scale corpora offers great opportunities also for
complex semantic tasks like SRL.

210


Acknowledgments

This work has been supported by the German Re-
search Foundation under grant No. GU 798/9-1,
grant No. GU 798/17-1, and grant No. GRK 1994/1.
We thank the action editors and anonymous review-
ers for their thoughtful comments. Additional thanks
go to Nancy Ide and Collin Baker for providing the
MASC dataset.

References
Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and

Eros Zanchetta. 2009. The WaCky Wide Web: A
Collection of Very Large Linguistically Processed Web-
Crawled Corpora. Language Resources and Evalua-
tion, 43(3):209–226.

Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby
Vander Linden, Brittany Harding, Brad Huang, Peter
Clark, and Christopher D. Manning. 2014. Model-
ing Biological Processes for Reading Comprehension.
In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 1499–1510, Doha, Qatar.

Claire Bonial, Kevin Stowe, and Martha Palmer. 2013.
Renewing and Revising SemLink. In Proceedings
of the 2nd Workshop on Linked Data in Linguistics
(LDL-2013): Representing and linking lexicons, termi-
nologies and other language data, pages 9 – 17, Pisa,
Italy.

Aljoscha Burchardt and Marco Pennacchiotti. 2008.
FATE: a FrameNet-Annotated Corpus for Textual En-
tailment. In Proceedings of the Sixth International
Conference on Language Resources and Evaluation
(LREC’08), pages 539–546, Marrakech, Morocco.

Aljoscha Burchardt, Kathrin Erk, Anette Frank, Andrea
Kowalski, Sebastian Padó, and Manfred Pinkal. 2006.
The SALSA Corpus: a German Corpus Resource for
Lexical Semantics. In Proceedings of the 5th Interna-
tional Conference on Language Resources and Evalua-
tion (LREC 2006), pages 969–974, Genoa, Italy.

Kostadin Cholakov, Judith Eckle-Kohler, and Iryna
Gurevych. 2014. Automated Verb Sense Labelling
Based on Linked Lexical Resources. In Proceedings of
the 14th Conference of the European Chapter of the As-
sociation for Computational Linguistics (EACL 2014),
pages 68–77, Gothenburg, Sweden.

Dipanjan Das and Noah A. Smith. 2011. Semi-
Supervised Frame-Semantic Parsing for Unknown Pred-
icates. In Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics: Human
Language Technologies, pages 1435–1444, Portland,
Oregon, USA.

Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A.
Smith. 2010. Probabilistic Frame-Semantic Parsing.
In Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics, pages 948–956,
Los Angeles, CA, USA.

Dipanjan Das, Desai Chen, André F. T. Martins, Nathan
Schneider, and Noah A. Smith. 2014. Frame-Semantic
Parsing. Computational Linguistics, 40(1):9–56.

Marie-Catherine de Marneffe, Bill MacCartney, and
Christopher D. Manning. 2006. Generating Typed
Dependency Parses from Phrase Structure Parses. In
Proceedings of the 5th Edition of the International Con-
ference on Language Resources and Evaluation, pages
449–454, Genoa, Italy.

Weisi Duan and Alexander Yates. 2010. Extracting
Glosses to Disambiguate Word Senses. In Human Lan-
guage Technologies: The 2010 Annual Conference of
the North American Chapter of the Association for
Computational Linguistics, HLT ’10, pages 627–635,
Los Angeles, CA, USA.

Richard Eckart de Castilho and Iryna Gurevych. 2014.
A Broad-Coverage Collection of Portable NLP Com-
ponents for Building Shareable Analysis Pipelines. In
Proceedings of the Workshop on Open Infrastructures
and Analysis Frameworks for HLT (OIAF4HLT) at
COLING 2014, pages 1–11, Dublin, Ireland. Associ-
ation for Computational Linguistics and Dublin City
University.

Parvin Sadat Feizabadi and Sebastian Padó. 2014. Crowd-
sourcing Annotation of Non-Local Semantic Roles. In
Proceedings of the 14th Conference of the European
Chapter of the Association for Computational Linguis-
tics, volume 2: Short Papers, pages 226–230, Gothen-
burg, Sweden.

Charles J. Fillmore, 1982. Linguistics in the Morning
Calm, chapter Frame Semantics, pages 111–137. Han-
shin Publishing Company, Seoul, South Korea.

Nicholas FitzGerald, Oscar Täckström, Kuzman Ganchev,
and Dipanjan Das. 2015. Semantic Role Labeling with
Neural Network Factors. In Proceedings of the 2015
Conference on Empirical Methods in Natural Language
Processing, pages 960–970, Lisbon, Portugal.

Marco Fossati, Claudio Giuliano, and Sara Tonelli. 2013.
Outsourcing FrameNet to the Crowd. In Proceedings
of the 51st Annual Meeting of the Association for Com-
putational Linguistics (Volume 2: Short Papers), pages
742–747, Sofia, Bulgaria.

Hagen Fürstenau and Mirella Lapata. 2012. Semi-
Supervised Semantic Role Labeling via Structural
Alignment. Computational Linguistics, 38(1):135–171.

Hagen Fürstenau. 2011. Semi-Supervised Semantic
Role Labeling via Graph Alignment, volume 32 of

211


Saarbrücken Dissertations in Computational Linguis-
tics and Language Technology. German Research Cen-
ter for Artificial Intelligence and Saarland University,
Saarbrücken, Germany.

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
Pfahringer, Peter Reutemann, and Ian H. Witten.
2009. The WEKA Data Mining Software: An Update.
SIGKDD Explorations, 11(1):10–18.

Birgit Hamp and Helmut Feldweg. 1997. GermaNet - a
Lexical-Semantic Net for German. In Proceedings of
the ACL workshop Automatic Information Extraction
and Building of Lexical Semantic Resources for NLP
Applications, pages 9–15, Madrid, Spain.

Patrick Hanks. 2013. Lexical Analysis: Norms and Ex-
ploitations. MIT Press, Cambridge, MA, USA.

Silvana Hartmann and Iryna Gurevych. 2013. FrameNet
on the Way to Babel: Creating a Bilingual FrameNet
Using Wiktionary as Interlingual Connection. In Pro-
ceedings of the 51st Annual Meeting of the Associa-
tion for Computational Linguistics (ACL 2013), pages
1363–1373, Sofia, Bulgaria.

Kazi Saidul Hasan and Vincent Ng. 2014. Why are
You Taking this Stance? Identifying and Classifying
Reasons in Ideological Debates. In Proceedings of
the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 751–762, Doha,
Qatar.

Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015.
Question-Answer Driven Semantic Role Labeling: Us-
ing Natural Language to Annotate Natural Language.
In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 643–653, Lisbon, Portugal.

Karl Moritz Hermann, Dipanjan Das, Jason Weston, and
Kuzman Ganchev. 2014. Semantic Frame Identifica-
tion with Distributed Word Representations. In Pro-
ceedings of the 52nd Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), pages 1448–1458, Baltimore, Maryland, USA.

Meghana Kshirsagar, Sam Thomson, Nathan Schneider,
Jaime Carbonell, Noah A. Smith, and Chris Dyer. 2015.
Frame-Semantic Role Labeling with Heterogeneous
Annotations. In Proceedings of the 53rd Annual Meet-
ing of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural
Language Processing (Volume 2: Short Papers), pages
218–224, Beijing, China.

Sandra Kübler and Desislava Zhekova. 2009. Semi-
Supervised Learning for Word Sense Disambiguation:
Quality vs. Quantity. In Proceedings of the Inter-
national Conference RANLP-2009, pages 197–202,
Borovets, Bulgaria.

Claudia Leacock, George A. Miller, and Martin Chodorow.
1998. Using Corpus Statistics and WordNet Relations

for Sense Identification. Computational Linguistics,
24(1):147–165.

Ken Litkowski. 2010. CLR: Linking Events and
Their Participants in Discourse Using a Comprehen-
sive FrameNet Dictionary. In Proceedings of the 5th
International Workshop on Semantic Evaluation (Se-
mEval ’10), pages 300–303, Los Angeles, CA, USA.

David Martinez. 2008. On the Use of Automatically
Acquired Examples for All-Nouns Word Sense Disam-
biguation. Journal of Artificial Intelligence Research,
33:79–107.

Rada Mihalcea and Dan Moldovan. 1999. An Auto-
matic Method for Generating Sense Tagged Corpora.
In Proceedings of the American Association for Artifi-
cial Intelligence (AAAI 1999), Orlando, Florida, USA.

Mike Mintz, Steven Bills, Rion Snow, and Daniel Juraf-
sky. 2009. Distant Supervision for Relation Extraction
without Labeled Data. In Proceedings of the Joint Con-
ference of the 47th Annual Meeting of the ACL and
the 4th International Joint Conference on Natural Lan-
guage Processing of the AFNLP, pages 1003–1011,
Suntec, Singapore.

Rebecca J. Passonneau, Collin F. Baker, Christiane Fell-
baum, and Nancy Ide. 2012. The MASC Word Sense
Corpus. In Proceedings of the Eight International
Conference on Language Resources and Evaluation
(LREC’12), pages 3025–3030, Istanbul, Turkey.

Ellie Pavlick, Juri Ganitkevitch, Tsz Ping Chan, Xuchen
Yao, Benjamin Van Durme, and Chris Callison-Burch.
2015. Domain-Specific Paraphrase Extraction. In Pro-
ceedings of the 53rd Annual Meeting of the Association
for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing (Vol-
ume 2: Short Papers), pages 57–62, Beijing, China.

Octavian Popescu, Martha Palmer, and Patrick Hanks.
2014. Mapping CPA Patterns onto OntoNotes Senses.
In Proceedings of the Ninth International Conference
on Language Resources and Evaluation (LREC’14),
pages 882–889, Reykjavik, Iceland.

Michael Roth and Mirella Lapata. 2015. Context-Aware
Frame-Semantic Role Labeling. Transactions of the
Association for Computational Linguistics, 3:449–460.

Josef Ruppenhofer, Caroline Sporleder, Roser Morante,
Collin Baker, and Martha Palmer. 2010. SemEval-
2010 Task 10: Linking Events and Their Participants
in Discourse. In Proceedings of the 5th International
Workshop on Semantic Evaluation, pages 45–50, Upp-
sala, Sweden.

Wolfgang Seeker and Jonas Kuhn. 2012. Making Ellipses
Explicit in Dependency Conversion for a German Tree-
bank. In Nicoletta Calzolari et al., editor, Proceedings
of the Eight International Conference on Language Re-
sources and Evaluation (LREC’12), pages 3132–3139,
Istanbul, Turkey.

212


Lei Shi and Rada Mihalcea. 2004. Open Text Semantic
Parsing Using FrameNet and WordNet. In Demon-
stration Papers at HLT-NAACL 2004, HLT-NAACL–
Demonstrations ’04, pages 19–22, Stroudsburg, PA,
USA.

Lei Shi and Rada Mihalcea. 2005. Putting Pieces To-
gether: Combining FrameNet, VerbNet and WordNet
for Robust Semantic Parsing. In Computational Lin-
guistics and Intelligent Text Processing, pages 100–111.
Springer Berlin Heidelberg.

Oscar Täckström, Kuzman Ganchev, and Dipanjan Das.
2015. Efficient Inference and Structured Learning for
Semantic Role Labeling. Transactions of the Associa-
tion for Computational Linguistics, 3:29–41.

Kristian Woodsend and Mirella Lapata. 2014. Text
Rewriting Improves Semantic Role Labeling. Journal
of Artificial Intelligence Research, 51:133–164.

David Yarowsky. 1995. Unsupervised Word Sense Dis-
ambiguation Rivaling Supervised Methods. In Pro-
ceedings of the 33rd annual meeting on Association for
Computational Linguistics, pages 189–196, Cambridge,
Massachusetts, USA.

Benat Zapirain, Eneko Agirre, Lluis Marquez, and Mi-
hai Surdeanu. 2013. Selectional Preferences for Se-
mantic Role Classification. Computational Linguistics,
39(3):631–663.

213


214