Imitation Learning of Agenda-based Semantic Parsers

Jonathan Berant
Stanford University

yonatan@cs.stanford.edu

Percy Liang
Stanford University

pliang@cs.stanford.edu

Abstract

Semantic parsers conventionally construct
logical forms bottom-up in a fixed order, re-
sulting in the generation of many extraneous
partial logical forms. In this paper, we com-
bine ideas from imitation learning and agenda-
based parsing to train a semantic parser that
searches partial logical forms in a more strate-
gic order. Empirically, our parser reduces the
number of constructed partial logical forms by
an order of magnitude, and obtains a 6x-9x
speedup over fixed-order parsing, while main-
taining comparable accuracy.

1 Introduction

Semantic parsing, the task of mapping natural
language to semantic representations (e.g., log-
ical forms), has emerged in recent years as a
promising paradigm for developing question an-
swering systems (Zelle and Mooney, 1996; Zettle-
moyer and Collins, 2005; Wong and Mooney, 2007;
Kwiatkowski et al., 2010; Liang et al., 2011) and
other natural language interfaces (Zettlemoyer and
Collins, 2007; Tellex et al., 2011; Matuszek et
al., 2012). Recently, there have been two ma-
jor trends: The first is to scale semantic parsing to
large knowledge bases (KB) such as Freebase (Cai
and Yates, 2013; Kwiatkowski et al., 2013; Berant
and Liang, 2014). The second is to learn semantic
parsers without relying on annotated logical forms,
but instead on their denotations (answers) (Clarke et
al., 2010; Liang et al., 2011); this lessens the anno-
tation burden and has been instrumental in fueling
the first trend (Berant et al., 2013).

what city was abraham lincoln born in
391 508362 20 20

>1M

Type.City u PlaceOfBirthOf.AbeLincoln

Type.Loc u ContainedBy.LincolnTown

. . .

AbeLincoln

LincolnTown

USSLincoln

. . .
Type.City

Type.Loc

. . .

PlaceOfBirthOf

PlacesLived

. . .

Figure 1: A parsing chart for the utterance “what city was
abraham lincoln born in”. Numbers in chart cells indicate the
number of possible semantic parses constructed over that span,
and arrows point to some of the logical forms that were con-
structed. There are more than one million possible semantic
parses for this utterance.

In this paper, we are interested in training seman-
tic parsers from denotations on large KBs. The chal-
lenge in this setting is that the vocabulary of the
target logical language often contains thousands of
logical predicates, and there is a mismatch between
the structure of the natural language and the logical
language. As a result, the space of possible seman-
tic parses for even a short utterance grows quickly.
For example, consider the utterance “what city was
abraham lincoln born in”. Figure 1 illustrates the
number of possible semantic parses that can be con-
structed over some of the utterance spans. Just by
combining semantic parses over the spans “city”,
“lincoln” and “born” we already obtain 362·391·20
possible parses; at the root, we get over a million
parses.1 The ambiguity of language thus results in a

1 Even when type constraints are used to prune parses, we
still produce more than a million possible parses at the root.


Root:Type.City u PlaceOfBirthOf.AbeLincoln

what Set:Type.City

city

was Set:PlaceOfBirthOf.AbeLincoln

Entity:AbeLincoln

abraham lincoln

Binary:PlaceOfBirthOf

born in

Join

Intersect

Lex

Lex Lex

Figure 2: An example semantic parse, or derivation, for the
utterance “what city was abraham lincoln born in”. Each node
in the tree has a category (e.g., ENTITY) and a logical form (e.g.,
AbeLincoln).

hard search problem.
To manage this combinatorial explosion, past

approaches (Krishnamurthy and Mitchell, 2012;
Kwiatkowski et al., 2013; Berant et al., 2013) used
beam search, where the number of parses (see Fig-
ure 2) for each chart cell (e.g., (SET,3:5)) is capped
at K. Typical bottom-up parsing is employed, where
we build all parses for spans of length n before
n + 1, etc. This fixed-order parsing strategy con-
structs many unnecessary parses though. For ex-
ample, it would create K parses for the category
ENTITY and the span over “lincoln”, generating the
logical form USSLincoln, although it is unlikely that
this entity would be in the final logical form.

To overcome the problems with fixed-order pars-
ing, we turn to agenda-based parsing (Kay, 1986;
Caraballo and Charniak, 1998; Klein and Manning,
2003; Pauls and Klein, 2009; Auli and Lopez, 2011).
In agenda-based parsing, an agenda (priority queue)
holds partial parses that can be constructed next.
At each step, the parse with the highest priority is
popped from the agenda and put into the chart. This
gives the parser full control over the sequence of
parses constructed. But importantly, agenda-based
parsing requires a good scoring function that can
rank not just full parses but also partial parses on the
agenda. How do we obtain such a scoring function?

To this end, we borrow ideas from imitation learn-
ing for structured prediction (Daume et al., 2009;
Ross et al., 2011; Goldberg and Nivre, 2013; Chang
et al., 2015). Specifically, we cast agenda-based se-
mantic parsing as a Markov decision process, where
the goal is to learn a policy, that given a state (i.e.,
the current chart and agenda), chooses the best next
action (i.e., the parse to pop from the agenda). The
supervision signal is used to generate a sequence of

oracle actions, from which the model is trained.
Our work bears a strong resemblance to Jiang et

al. (2012), who applied imitation learning to agenda-
based parsing, but in the context of syntactic pars-
ing. However, two new challenges arise in seman-
tic parsing. First, syntactic parsing assumes gold
parses, from which it is easy to derive an oracle ac-
tion sequence. In contrast, we train from question-
answer pairs only (rather than parse trees or even
logical forms), so generating an oracle sequence is
more challenging. Second, semantic parsers explore
a much larger search space than syntactic parsers,
due to the high level of uncertainty when translating
to logical form. Thus, we hold a beam of parses for
each chart cell, and modify learning for this setup.

We gain further efficiency by introducing a lazy
agenda, which reduces the number of parses that
need to be scored. For example, the single action
of processing “born”, requires placing 391 logical
forms on the agenda, although only few of them
will be used. Our lazy agenda holds derivation
streams, which implicitly represent a (possibly in-
finite!) group of related parses as a single agenda
item, and lazily materialize parses as needed. Em-
pirically, this reduces the number of parses that are
scored at training time by 35%.

Last, we make modeling contributions by aug-
menting the feature set presented by Berant et al.
(2013) with new features that improve the mapping
of phrases to KB predicates.

We evaluate our agenda-based parser on the WE-
BQUESTIONS dataset (Berant et al., 2013) against
a fixed-order parser, and observe that our parser re-
duces the number of parsing actions by an order of
magnitude, achieves a 6x-9x speedup, and obtains a
comparable accuracy of 49.7%.

To conclude, this paper describes three contribu-
tions: First, a novel agenda-based semantic parser
that learns to choose good parsing actions, train-
ing from question-answer pairs only; Second, a lazy
agenda that packs parses in streams and reduces the
number of generated parses; Last, modeling changes
that substantially improve accuracy.

2 Semantic Parsing Task

While our agenda-based semantic parser applies
more broadly, our exposition will be based on our


primary motivation, question answering on a knowl-
edge base. The semantic parsing task is defined as
follows: Given (i) a knowledge base (KB) K, (ii) a
grammar G (defined shortly), and (iii) a training set
of question-answer pairs {(xi,yi)}ni=1, output a se-
mantic parser that maps new questions x to answers
y via latent logical forms z.

We now briefly describe the KB and logical forms
used in this paper. Let E denote a set of entities (e.g.,
AbeLincoln), and let P denote a set of properties
(e.g., PlaceOfBirthOf). A knowledge base K is
a set of assertions (e1,p,e2) ∈ E × P × E (e.g.,
(Hodgenville,PlaceOfBirthOf,AbeLincoln)).
We use the Freebase KB (Google, 2013), which has
41M entities, 19K properties, and 596M assertions.

To query the KB, we use the logical language
simple λ-DCS. In simple λ-DCS, an entity (e.g.,
AbeLincoln) denotes the singleton set containing
that entity; this is a special case of a unary pred-
icate. A property (a special case of a binary
predicate) can be joined with a unary predicate;
e.g., PlaceOfBirthOf.AbeLincoln denotes all en-
tities that are the place of birth of Abraham Lin-
coln. We also have intersection: Type.City u
PlaceOfBirthOf.AbeLincoln denotes the set of en-
tities that are both cities and the place of birth of
Abraham Lincoln. We write JzKK for the denotation
of a logical form z with respect to a KB K. For a
formal description of λ-DCS, see Liang (2013).

3 Grammars and Semantic Functions

Since we are learning semantic parsers from deno-
tations, we cannot induce a grammar from provided
logical forms (Kwiatkowski et al., 2010). Instead,
we assume a small and flexible grammar that spec-
ifies the space of logical forms. The grammar con-
sists of a backbone CFG, but is atypical in that each
rule is augmented with a semantic (composition)
function that produces a varying number of deriva-
tions using arbitrary context. This flexibility pro-
vides procedural control over the generation of logi-
cal forms.

Formally, a grammar is a tuple (V,N ,R), where
V is a set of terminals (words), N is a set of cate-
gories (such as BINARY, ENTITY, SET and ROOT in
Figure 2, where ROOT is the root category), and R is
a rule set of binary and unary rules, explained below.

A binary rule r ∈ R has the form A → B C [f],
where A ∈ N is the left-hand-side, B C ∈ N2 is
the right-hand-side (RHS), and f is a semantic func-
tion, explained below.

Given an utterance x, the grammar defines a set of
derivations (semantic parse trees) over every span
xi:j = (wi,wi+1,. . . ,wj−1). Define D to be the
set of all derivations, and let dAi:j be a derivation
over the span xi:j of category A. Given the deriva-
tions dBi:k and d

C
k:j and the rule r = A → B C [f],

the semantic function f : D × D → 2D pro-
duces a set of derivations f(dBi:k,d

C
k:j) over xi:j

with category A. In words, the semantic func-
tion takes two child derivations as input and pro-
duces a set of candidate output derivations. For
each output derivation d, let d.r be the rule used
(SET → ENTITY BINARY[JOIN]) and d.z be the
logical form constructed by f, usually created by
combining the logical forms of the child deriva-
tions (PlaceOfBirthOf.AbeLincoln). This com-
pletes our description of binary rules; unary rules
A → B [f] and lexical rules A → w [f] are handled
similarly, where w ∈V+ is a sequence of terminals.

Figure 3 demonstrates the flexibility of seman-
tic functions. The JOIN semantic function takes a
derivation whose logical form is a binary predicate,
and a derivation whose logical form is a unary pred-
icate, and performs a join operation. LEX takes a
derivation representing a phrase and outputs many
candidate derivations. INTERSECT takes two deriva-
tions and attempts to intersect their logical forms (as
defined in Section 2). In this specific case, no out-
put derivations are produced because the KB types
for Type.City and ReleaseDateOf.LincolnFilm
do not match.

In contrast with CFG rules for syntactic pars-
ing, rules with semantic functions generate sets of
derivations rather than a single derivation. We allow
semantic functions to perform arbitrary operations
on the child derivations, access external resources
such as Freebase search API and the KB. In prac-
tice, our grammar employs 11 semantic functions;
in addition to JOIN, LEX, and INTERSECT, we use
BRIDGE, which implements the bridging operation
(see Section 8) from Berant et al. (2013), as well as
ones that recognize dates and filter derivations based
on part-of-speech tags, named entity tags, etc.


Join

( Entity:AbeLincoln
abraham lincoln

,
Binary:PlaceOfBirthOf

born

)
=

{ Set:PlaceOfBirthOf.AbeLincoln
Entity:AbeLincoln

abraham lincoln

Binary:PlaceOfBirthOf

born

}

Lex( lincoln ) =

{ Entity:AbeLincoln
lincoln

,

Entity:LincolnFilm

lincoln

, ...

}
Intersect

( Set:Type.City
city

,
Set:ReleaseDateOf.LincolnFilm

abraham lincoln born

)
= {}

Figure 3: A semantic function (we show JOIN, LEX and INTERSECT) takes one or two child derivations and returns a set of
possible derivations.

4 Fixed-order Parsing

We now describe fixed-order parsing with beam
search, which has been the common practice in
past work (Krishnamurthy and Mitchell, 2012;
Kwiatkowski et al., 2013; Berant et al., 2013).

Let x be the input utterance. We call derivations
dROOT
0:|x| , spanning the utterance x and with root cate-

gory, root derivations, and all other derivations par-
tial derivations. Given a scoring function s : D →
R, a bottom-up fixed-order parser iterates over spans
xi:j of increasing length n and categories A ∈ N ,
and uses the grammar to generate derivations based
on derivations of subspans. We use beam search,
in which for every span xi:j and every category A
we keep a beam that stores up to K derivations in
a chart cell HAi:j (where different derivations usually
correspond to different logical forms). We denote by
H the set of derivations in any chart cell.

A fixed-order parser is guaranteed to compute the
K highest-scoring derivations if the following two
conditions hold: (i) all semantic functions return ex-
actly one derivation, and (ii) the scoring function
decomposes—that is, there is a function srule : R→
R such that for every rule r = A → B C [f],
the score of a derivation produced by the rule is
s(dAi:j) = s(d

B
i:k)+s(d

C
k:j)+srule(r). Unfortunately,

the two conditions generally do not hold in seman-
tic parsing. For example, the INTERSECT function
returns an empty set when type-checking fails, vi-
olating condition (i), and the scoring function s of-
ten depends on the denotation size of the constructed
logical form, violating condition (ii). In general, we
want the flexibility of having the scoring function
depend on the logical forms and sub-derivations, and
therefore we will not be concerned with exactness
in this paper. Note that we could augment the cat-
egories N with the logical form, but this would in-

crease the number of categories exponentially.

Model. We focus on linear scoring functions:
s(d) = φ(d)>θ, where φ(d) ∈ RF is the feature
vector and θ ∈ RF is the parameter vector to be
learned. Given any set of derivations D ⊆ D, we
can define the corresponding log-linear distribution:

pθ(d | D) =
exp{φ(d)>θ}∑

d′∈D exp{φ(d′)>θ}
. (1)

Learning. The training data consists of a set
of utterance-denotation (question-answer) pairs
{(xi,yi)}ni=1. To learn θ, we use an online learn-
ing algorithm, where on each (xi,yi), we use beam
search based on the current parameters to construct
a set of root derivations Di = HROOT0:|x| , and then take
a gradient step on the following objective:

Oi(θ) = log p(yi | xi) (2)

= log
∑
d∈Di

pθ(d | Di)R(d) + λ‖θ‖1, (3)

where R(d) ∈ [0,1] is a reward function that mea-
sures the compatibility of the predicted denotation
Jd.zKK and the true denotation yi,

2. We marginalize
over latent derivations, which are weighted by their
compatibility with the observed denotation yi.

The main drawback of fixed-order parsing is that
to obtain the K root derivations Di, the parser must
first construct K derivations for all spans and all cat-
egories, many of which will not make it into any root
derivation d ∈ Di. Next, we describe agenda-based
parsing, whose goal is to give the parser better con-
trol over the constructed derivations.

2Jd.zKK and yi are both sets of entities, so R is the F1 score.


remove

add

Figure 4: A schematic illustration of a executing a parsing
action, specified by a derivation on the agenda. First, we remove
it from the agenda and put it in the chart. Then, combine it with
other chart derivations to produce new derivations, which are
added back to the agenda.

Algorithm 1 Agenda-based parsing
1: procedure PARSE(x)
2: INITAGENDA()
3: while |Q| > 0∧|HROOT0:|x| | < K do
4: dAi:j ← choose derivation from Q
5: EXECUTEACTION(dAi:j)

6: choose and return derivation from HROOT0:|x|
7: function EXECUTEACTION(dAi:j)
8: remove dAi:j from Q
9: if |HAi:j| < K then

10: HAi:j.add(d
A
i:j)

11: COMBINE(dAi:j)

12: function COMBINE(dAi:j)
13: for k > j and r = B → AC [f] ∈R do
14: for dCj,k ∈ H

C
j:k do

15: Q.addAll(f(dAi:j,d
C
j:k))

16: for k < i and r = B → C A [f] ∈R do
17: for dCk,i ∈ H

C
k:i do

18: Q.addAll(f(dCk:i,d
A
i:j))

19: function INITAGENDA()
20: for A → xi:j [f] ∈R do
21: Q.addAll(f(xi:j))

5 Agenda-based Parsing

The idea of using an agenda for parsing has a long
history (Kay, 1986; Caraballo and Charniak, 1998;
Pauls and Klein, 2009). An agenda-based parser
controls the order in which derivations are con-
structed using an agenda Q, which contains a set of
derivations to be processed. At each point in time the
state of the parser consists of two sets of derivations,
the chart H and the agenda Q. Each parsing action
chooses a derivation from the agenda, moves it to
the chart, combines it with other chart derivations,
and adds new derivations to the agenda (Figure 4).

Algorithm 1 describes agenda-based parsing. The
algorithm shows binary rules; unary rules are treated
similarly. First, we initialize the agenda by applying

all rules whose RHS has only terminals, adding the
resulting derivations to the agenda. Then, we per-
form parsing actions until either the agenda is empty
or we obtain K root derivations. On each action, we
first choose a derivation dAi:j to remove from Q and
add it to HAi:j, unless H

A
i:j already has K derivations.

Then, we combine dAi:j with all derivations dj:k to
the right and dk:i to the left. Upon termination, we
perform a final action, in which we return a single
derivation from all constructed root derivations.

The most natural way to choose an agenda deriva-
tion (and the root derivation in the final action)
is by taking the highest scoring derivation d =
arg maxd∈Q s(d). Most work on agenda-based pars-
ing generally assumed that the scoring function s is
learned separately (e.g., from maximum likelihood
estimation of a generative PCFG). Furthermore, they
assumed that s satisfies the decomposition property
(Section 4), which guarantees obtaining the high-
est scoring root derivation in the end. We, on the
other hand, make no assumptions on s, and follow-
ing Jiang et al. (2012), we learn a scoring function
that is tightly coupled with agenda-based parsing.
This is the topic of the next section.

6 Learning a Scoring Function

The objective in (2) is based on only a distribution
over root derivations. Thus, by optimizing it, we
do not explicitly learn anything about partial deriva-
tions that never make it to the root. Consider the
derivation in Figure 1 over the phrase “lincoln” with
the logical form USSLincoln. If none of the K root
derivations contains this partial derivation, (2) will
not penalize it, and we might repeatedly construct
it even though it is useless. To discourage this, we
need to be sensitive to intermediate parsing stages.

6.1 Imitation learning

We adapt the approach of Jiang et al. (2012) for
agenda-based syntactic parsing to semantic parsing.
Recall that a parsing state is s = (H,Q), where
H ⊆D is the chart and Q ⊆D is the agenda.3

The available actions are exactly the derivations

3To keep the state space discrete, states do not include
derivation scores. This is why in Algorithm 1 we keep a list
of up to K derivations in every chart cell rather than a beam,
which would require actions to depend on derivation scores.


on the agenda Q, and the successor state s′ is com-
puted via EXECUTEACTION() from Algorithm 1.
We model the policy as a log-linear distribution over
(partial) agenda derivations Q: pθ(a | s) = pθ(d =
a | Q), according to (1). Note that the state s only
provides the support of the distribution; the shape
depends on only features φ(a) of the chosen action
a, not on other aspects of s. This simple param-
eterization allows us to follow a policy efficiently:
when we add a derivation a to the agenda, we insert
it with priority equal to its score s(a) = φ(a)>θ.
Computing the best action arg maxa pθ(a | s) sim-
ply involves popping from the priority queue.

A history h = (s1,a1, . . . ,aT ,sT+1) (see Fig-
ure 5) is a sequence of states and actions, such that
s1 has an empty chart and an initial agenda, and
sT+1 is a terminal state reached after performing the
chart action in which we choose a root derivation aT
from HROOT

0:|x| (Algorithm 1). The policy for choosing
parsing actions induces a distribution over histories
pθ(h) =

∏T
t=1 pθ(at | st).

At a high level, our policy is trained using imi-
tation learning to mimic an oracle that takes a opti-
mal action at every step (Daume et al., 2009; Ross
et al., 2011). Because in semantic parsing we train
from questions and answers, we do not have access
to an oracle. Instead, we first parse x by sampling
a history from the current policy pθ; let d∗ be the
root derivation with highest reward out of the K root
derivations constructed (see (2)). We then generate a
target history htarget from d∗ using two ideas—local
reweighting and history compression, which we ex-
plain shortly. The policy parameters θ are then up-
dated as follows:

θ ← θ + η R(htarget)
T∑
t=1

δt(htarget), (4)

δt(h) = ∇θ log pθ(at | st) (5)
= φ(at)−Epθ(a′t|st)[φ(a

′
t)].

The reward R(h) = R(aT ) ∈ [0,1] measures the
compatibility of the returned derivation (see (2)),
and η is the learning rate.4 Note that while our fea-

4 Note that unlike standard policy gradient, our updates are
not invariant (even in expectation) to shifting the reward by a
constant. Our updates do not maximize reward, but the reward
merely provides a way to modulate the magnitude of the up-
dates.

st−1 st st+1
at−1 at

Figure 5: A schematic illustration of a (partial) history of states
and actions. Each ellipse represents a state (chart and agenda),
and the red path marks the actions chosen.

tures φ(a) depend on the action only, the update rule
takes into account all actions that are on the agenda.

Local reweighting. Given the reference d∗, let
I[a in d∗] indicate whether an action a is a sub-
derivation of d∗. We sample htarget from the lo-
cally reweighted distribution p+wθ (a | s) ∝ pθ(a |
s)·exp{β I[a in d∗]} for some β > 0. This is a mul-
tiplicative interpolation of the model distribution pθ
and the oracle. When β is high, this reduces to sam-
pling from the available actions in d∗. When no or-
acle actions are available, this reduces to sampling
from pθ. The probability of a history is defined as
p+wθ (h) =

∏T
t=1 p

+w
θ (at | st).

Recall we construct K root derivations. A prob-
lem with local reweighting is that after adding d∗ to
the chart, there are no more oracle actions on the
agenda and all subsequent actions are simply sam-
pled from the model. We found that updating to-
wards these actions hurts accuracy. To avoid this
problem, we propose performing history compres-
sion, described next.

History compression. Given d∗, we can define for
every history h a sequence of indices (t1, t2, . . .)
such that I[ati in d

∗] = 1 for every i. Then, the
compressed history c(h) = (st1,at1,st2,at2, . . .)
is a sequence of states and actions such that all
actions choose sub-derivations of d∗. Note that
c(h) is not a “real history” in the sense that tak-
ing action ati does not necessarily result in state
sti+1 . In Figure 6, the compressed history c(h) =
(s1,a1,s3,a3,s4,a4,s5).

We can now sample a target history htarget for
(4) from a distribution over compressed histories,


a4 = d
∗

h :

a1 a3

s1 s2 s3 s4 s5

a1 a2 a3 a4

Figure 6: An example history of states and actions, where ac-
tions that are part of the reference derivation d∗ = a4 are in red.
The compressed history is c(h) = (s1,a1,s3,a3,s4,a4,s5).

Algorithm 2 Learning algorithm
procedure LEARN({xi,yi}ni=1)

θ ← 0
for each iteration τ and example i do

h0 ← PARSE(pθ,xi)
d∗ ← CHOOSEORACLE(h0)
htarget ← PARSE(p+cwθ ,xi)
θ ← θ + ητ,i ·R(htarget)

∑T
t=1

δt(htarget)

p+cθ (h) =
∑

h′:c(h′)=h pθ(h), where we marginal-
ize over all histories that have the same compressed
history. To sample from p+cθ , we sample h

′ ∼ pθ
and return htarget = c(h′). This will provide a his-
tory containing only actions leading to the oracle
d∗. In our full model, we sample a history from
p+cwθ , which combines local reweighting and his-
tory compression: we sample h′ ∼ p+wθ and re-
turn htarget = c(h′). We empirically analyze local
reweighting and history compression in Section 9.

In practice, we set β large enough so that the be-
havior of p+cwθ is as follows: we first construct the
reference d∗ by sampling oracle actions. After con-
structing d∗, no oracle actions are on the agenda, so
we construct K −1 more root derivations, sampling
from pθ (but note these actions are not part of the re-
turned compressed history). Finally, the last action
chooses d∗ from the K derivations.

Algorithm 2 summarizes learning. We initialize
our parameters to zero, and then parse each exam-
ple by sampling a history from pθ. We choose the
derivation with highest reward in HROOT

0:|x| as the ref-
erence derivation d∗. This defines p+cwθ , which we
sample from to update parameters. The learning rate
ητ,i is set using AdaGrad (Duchi et al., 2010).

6.2 Related approaches

Our method is related to policy gradient in reinforce-
ment learning (Sutton et al., 1999): if in (4), we

sample from the model distribution pθ without an
oracle, then our update is exactly the policy gradi-
ent update, which maximizes the expected reward
Epθ(h) [R(h)]. We do not use policy gradient since
the gradient is almost zero during the beginning of
training, leading to slow convergence. This corrob-
orates Jiang et al. (2012).

Our method extends Jiang et al. (2012) to seman-
tic parsing, which poses the following challenges:
(a) We train from denotations, and must obtain a ref-
erence to guide learning. (b) To combat lexical un-
certainty we maintain a beam of size K in each pars-
ing state (we show this is important in Section 9). (c)
We introduce history compression, which focuses
the learner on the actions that produce the correct
derivation rather than incorrect ones on the beam.
Interestingly, Jiang et al. (2012) found that imitation
learning did not work well, and obtained improve-
ments from interpolating with policy gradient. We
found that imitation learning worked well, and in-
terpolating with policy gradient did not offer further
improvements. A possible explanation is that the
uncertainty preserved in the K derivations in each
chart cell allowed imitation learning to generalize
properly, compared to Jiang et al. (2012), who had
just a single item in each chart cell.

7 Lazy Agenda

As we saw in Section 3, a single semantic function
(e.g., LEX, BRIDGE) can create hundreds of deriva-
tions. Scoring all these derivations when adding
them to the agenda is wasteful, because most have
low probability. In this section, we assume semantic
functions return a derivation stream, i.e., an itera-
tor that lazily computes derivations on demand. Our
lazy agenda G will hold derivation streams rather
than derivations, and the actual agenda Q will be
defined only implicitly. The intuition is similar to
lazy K-best parsing (Huang and Chiang, 2005), but
is applied to agenda-based semantic parsing.

Our main assumption is that every derivation
stream g = [d1,d2, . . . ], is sorted by decreasing
score: s(d1) ≥ s(d2) ≥ ··· (in practice, this is only
approximated as we explain at the end of this sec-
tion). We define the score of a derivation stream as
s(g) = s(d1). At test time the only change to Al-
gorithm 1 is in line 4, where instead of popping the


G s(g[1]) |g| U
[d1] 7 1 0.88

[d2, d3, d4, . . . ] 5 100 11.92

G s(g[1]) |g| U
[d1] 7 1 0.88

[d2] 5 1 0.12

[d3] 1 1 0.006

[d4, . . . ] −2 98 0.004

Figure 7: Unrolling a derivation where � = 0.01 and |G+| =
1. The stream in red on the left violates the stopping condition,
and so we unroll two derivations until all streams satisfy the
condition.

highest scoring derivation, we pop the highest scor-
ing derivation stream and process the first derivation
on the stream. Then, we featurize and score the next
derivation on the stream if the stream is not empty,
and push the stream back to the agenda. This guar-
antees we will obtain the highest scoring derivation
in every parsing action.

However, during training we sample from a distri-
bution over derivations, not just return the argmax.
Sampling from the distribution over streams can be
quite inaccurate. Suppose the agenda contains two
derivation streams: g1 contains one derivation with
score 1 and g2 contains 50 derivations with score 0.
Then we would assign g1 probability

e1

e1+e0
= 0.73

instead of the true model probability e
1

e1+50e0
=

0.05. The issue is that the first derivation of g is
not indicative of the actual probability mass in g.

Our solution is simple: before sampling (line 4 in
Algorithm 1), we process the agenda to guarantee
that the sum of probabilities of all unscored deriva-
tions is smaller than �. Let G be the lazy agenda and
G+ ⊆ G be the subset of derivation streams that
contain more than one derivation (where unscored
derivations exist). If for every g ∈ G+, pθ(g) =∑

d∈g pθ(d) ≤
�
|G+|, then the probability sum of all

unscored derivation is small:
∑

g∈G+ p(g) ≤ �.
To guarantee that pθ(g) ≤ �|G+|, we unroll g un-

til this stopping condition is satisfied. Unrolling a
stream from g = [d1,d2, . . . ] means popping d1
from g, constructing a singleton derivation stream
gnew = [d1], pushing gnew to the agenda and scoring
the remaining stream based on the next derivation
s(g) = s(d2) (Figure 7).

To check if p(g) ≤ �|G+|, we define the follow-
ing upper bound U on p(g), which is based on the
number of derivations in the stream |g|:

pθ(g) =

∑
d∈g e

s(d)∑
g′∈G

∑
d′∈g′ e

s(d′)
≤

|g|es(g[1])∑
g′∈G e

s(g′[1])
= U

where g[1] is the first derivation in g. Checking that
U ≤ �|G+| is easy, since it is based only on the first
derivation of every stream. Once all streams meet
this criterion, we know that the total unscored prob-
ability is less than �. As learning progresses, there
are be many low probability derivations which we
can skip entirely.

The last missing piece is ensuring that streams are
sorted without explicitly scoring all derivations. We
make a best effort to preserve this property.

Sorting derivation streams. All derivations in a
stream g have the same child derivations, as they
were constructed by one application of a seman-
tic function f. Thus, the difference in their scores
is only due to new features created when applying
f. We can decompose these new features into two
disjoint feature sets. One set includes features that
depend on the grammar rule only and are indepen-
dent of the input utterance x, and another also de-
pends on x. For example, the semantic function
f = LEX maps phrases, such as “born in”, to log-
ical forms, such as PlaceOfBirthOf. Most features
extracted by LEX do not depend on x: the con-
junction of “born in” and PlaceOfBirthOf, the fre-
quency of the phrase “born in” in a corpus, etc.
However, some features may depend on x as well.
For example, if x is “what city was abraham lin-
coln born in”, we can conjoin PlaceOfBirthOf with
the first two words “what city”. As another ex-
ample, the semantic function BRIDGE takes unary
predicates, such as AbeLincoln, and joins them
with any type-compatible binary to produce logi-
cal forms, such as PlaceOfBirthOf.AbeLincoln.
Again, a feature such as the number of assertions
in K that contain PlaceOfBirthOf does not de-
pend on x, while a feature that conjoins the intro-
duced binary (PlaceOfBirthOf) with the main verb
(“born”), does depend on x (see Section 8).

Our strategy is to pre-compute all features that are
independent of x before training,5 and sort streams

5For LEX, this requires going over all lexicon entries once.
For BRIDGE, this requires going once over the KB.


based on these features only, as an approximation
for the true order. Let’s assume that derivations re-
turned by an application of a semantic function f are
parameterized by an auxiliary set B. For example,
when applying LEX on “born in”, B will include
all lexical entries that map “born in” to a binary
predicate. When applying BRIDGE on AbeLincoln,
B will include all binary predicates that are type-
compatible with AbeLincoln. We equip each b ∈B
with a feature vector φB(b) (computed before train-
ing) of all features that are independent of x. This
gives rise to a score sB(b) = φB(b)>θ that depends
on the semantic function only. Thus, we can sort B
before parsing, so that when the function f is called,
we do not need to instantiate the derivations.

Note that the parameters θ and thus sB change
during learning, so we re-sort B after every iteration
(of going through all training examples), yielding an
approximation to the true ordering of B. In practice,
features extracted by LEX depend mostly on the lex-
ical entry itself and our approximation is accurate,
while for BRIDGE some features depend on x, as we
explain next.

8 Features

The feature set in our model includes all features de-
scribed in Berant et al. (2013).6 In addition, we add
new lexicalized features that connect natural lan-
guage phrases to binary predicates.

In Berant et al. (2013), a binary predicate is
generated using a lexicon constructed offline via
alignment, or through the bridging operation. As
mentioned above, bridging allows us to join unary
predicates with binary predicates that are type-
compatible, even when no word in the utterance trig-
gers the binary predicate. For example, given the ut-
terance “what money to take to sri lanka”, the parser
will identify the entity SriLanka, and bridging will
propose all possible binaries, including Currency.
We add a feature template that conjoins binaries
suggested by bridging (Currency) with all content
word lemmas (“what”, “money”, “take”). After
observing enough examples, we expect the feature
corresponding to “money” and Currency to be up-
weighted. Generating freely and reweighting using

6 As in previous work, some features use the fact that the
spelling of KB predicates is often similar to English words.

features can be viewed as a soft way to expand the
lexicon during training, similar to lexicon generation
(Zettlemoyer and Collins, 2005). Note that this fea-
ture depends on the utterance x, and is not used for
sorting streams (Section 7).

Finally, each feature is actually duplicated: one
copy fires when choosing derivations on the agenda
(Algorithm 1, line 4), and the other copy fires when
choosing the final root derivation (line 6). We found
that the increased expressivity from separating fea-
tures improves accuracy.

9 Experiments

We evaluate our semantic parser on the WEBQUES-
TIONS dataset (Berant et al., 2013), which con-
tains 5,810 question-answer pairs. The questions are
about popular topics (e.g., “what movies does taylor
lautner play in?”) and answers are sets of entities
obtained through crowdsourcing (all questions are
answerable by Freebase). We use the provided train-
test split and perform three random 80%-20% splits
of the training data for development.

We perform lexical lookup for Freebase entities
using the Freebase Search API and obtain 20 can-
didate entities for every named entity identified by
Stanford CoreNLP (Manning et al., 2014). We use
the lexicon released by Berant et al. (2013) to re-
trieve unary and binary predicates. We execute λ-
DCS logical forms by converting them to SPARQL
and querying our local Virtuoso-backed copy of
Freebase. During training, we use L1 regularization,
and crudely tune hyperparameters on the develop-
ment set (beam size K = 200, tolerance for the lazy
agenda � = 0.01, local reweighting β = 1000, and
L1 regularization strength λ = 10−5).

We evaluated our semantic parser using the re-
ward of the predictions, i.e., average F1 score on
predicted vs. true entities over all test examples.7

9.1 Main Results

Table 1 provides our key result comparing the
fixed-order parser (FIXEDORDER) and our proposed
agenda-based parser (AGENDAIL). In all subse-
quent tables, Train, Dev., and Test denote train-
ing, development and test accuracies, |Act.| denotes

7We use the official evaluation script from http://
www-nlp.stanford.edu/software/sempre/.

http://www-nlp.stanford.edu/software/sempre/
http://www-nlp.stanford.edu/software/sempre/


Test Train |Act.| |Feat.| Time
FIXEDORDER 49.6 60.6 18,127 18,127 1,782
AGENDAIL 49.7 61.1 1,346 1,814 291

Table 1: Test set results for the standard fixed-order
parser (FIXEDORDER) and our new agenda-based parser
(AGENDAIL), which substantially reduces parsing time and the
number of parsing actions at no cost to accuracy.

System Authors Acc.
YV14 Yao and Van-Durme (2014) 35.4
BCFL13 Berant et al. (2013) 35.7
BDZZ14 Bao et al. (2014) 37.5
BWC14 Bordes et al. (2014) 39.2
BL14 Berant and Liang (2014) 39.9
YDZR14 Yang et al. (2014) 41.3
BWC14+ BL14 Bordes et al. (2014) 41.8
WYWH14 Wang et al. (2014) 45.3
WYWH14 Wang et al. (2014) 45.3
YCHG15 Yih et al. (2015) 52.5
FIXEDORDER this work 49.6
AGENDAIL this work 49.7

Table 2: Results on the WEBQUESTIONS test set.

the average number of parsing actions (pops from
agenda in AGENDAIL and derivations placed on
chart in FIXEDORDER) per utterance, |Feat.| de-
notes the average number of featurized derivations
per utterance, and Time is average parsing time in
milliseconds.

We found that AGENDAIL is 6x faster than FIXE-
DORDER, performs 13x fewer parsing actions, and
reduces the number of featurized derivations by an
order of magnitude, without loss of accuracy.

Table 2 presents test set results of our systems,
compared to recently published results. We note
that most systems perform question answering with-
out semantic parsing. Our fixed-order parser, FIXE-
DORDER, and agenda-based parser, AGENDAIL,
obtain an accuracy of 49.6 and 49.7 respectively.
This improves accuracy compared to all previous
systems, except for a recently published semantic
parser presented by Yih et al. (2015), whose accu-
racy is 52.5. We attribute our accuracy improvement
compared to previous systems to the new features
and changes to the model, as we discuss below.

BCFL13 also used a fixed-order parser, but ob-
tained lower performance. The main differences be-
tween the systems are that (i) our model includes
new features (Section 8) combined with L1 regular-
ization, (ii) we use the Freebase search API rather
than string matching, and (iii) our grammar gener-

Algorithm Dev. |Act.| |Feat.| Time
AGENDAIL 48.0 1,421 1,912 214
FIXEDORDER 49.1 18,259 18,259 1,972
AGENDA 45.9 6,211 6,320 419
FIXED+AGENDA 47.1 6,281 6,615 775
α = 1000 47.8 11,279 11,279 1,216
α = 100 35.6 3,858 3,858 174
α = 10 27.0 1,604 1,604 78
p+w
θ

43.3 1,706 2,121 238
p+c
θ

36.8 3,758 4,278 358
pθ 1.2 12,302 15,524 1,497
-BINARYANDLEMMA 40.5 1,561 2,110 167

Table 3: Development set results for variants of AGENDAIL.

ates a larger space of derivations.

9.2 Analysis

To gain insight into our system components, we per-
form extensive experiments on the development set.

Comparison with fixed-order parsing. Figure 8
compares accuracy, speed at test time, and number
of derivations for AGENDAIL and FIXEDORDER.
For AGENDAIL, we show both the number of
derivations popped from the agenda, as well as num-
ber of derivations scored, which is slightly higher
due to scored derivations on the agenda. We ob-
serve that for small beam sizes, AGENDAIL sub-
stantially outperforms FIXEDORDER. This is since
AGENDAIL exploits small beams more efficiently in
intermediate parsing states. For large beams perfor-
mance is similar. In terms of speed and number of
derivations, we see that AGENDAIL is dramatically
more efficient than FIXEDORDER: with beam size
200–400, it is roughly as efficient as FIXEDORDER
with beam size 10–20. For the chosen beam size
(K = 200), AGENDAIL is 9x faster than FIXE-
DORDER.

For K = 1, performance is poor for AGENDAIL
and zero for FIXEDORDER. This highlights the in-
herent difficulty of mapping to logical forms com-
pared to more shallow tasks, as maintaining just a
single best derivation for each parsing state is not
sufficient.

A common variant on beam parsing is to replace
the fixed beam size K with a threshold α, and prune
any derivation whose probability is at least α times
smaller than the best derivation in that state (Zhang
et al., 2010; Bodenstab et al., 2011). We imple-
mented this baseline and compared it to AGENDAIL


1 10 20 50 100 200 400
beam size

0

10

20

30

40

50

a
c
c
u
ra

c
y

FixedOrder
AgendaIL

1 10 20 50 100 200 400
beam size

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

ti
m

e
 (

se
c
)

FixedOrder
AgendaIL

1 10 20 50 100 200 400
beam size

0

5

10

15

20

25

30

#
d
e
ri

v
a
ti

o
n
s 

(t
h
o
u
sa

n
d
s)

FixedOrder
AgendaIL (scored)
AgendaIL (popped)

Figure 8: Comparing AGENDAIL and FIXEDORDER for various beam sizes (left: accuracy, middle: parsing time at test time in
seconds, right: number of thousands of derivations scored and popped). The x-axis is on a logarithmic scale.

and FIXEDORDER in Table 3. We see that for α =
1000, we get a faster algorithm, but a minor drop in
performance compared to FIXEDORDER. However,
this baseline still featurizes 6x more derivations and
is 6x slower than AGENDAIL.

Impact of learning. The AGENDA baseline uses
an agenda-based parser to approximate the gradients
of (2). That is, we update parameters as in FIXE-
DORDER, but search for K root derivations using
the agenda-based parser, described in Algorithm 1
(where we pop the highest scoring derivation). We
observe that AGENDA featurizes 3x more deriva-
tions compared to AGENDAIL, and results in a 2.1
drop in accuracy. This demonstrates the importance
of explicitly learning to choose correct actions dur-
ing intermediate stages of parsing.

Since on the development set, FIXEDORDER
outperformed AGENDAIL by 1.1 points, we im-
plemented FIXED+AGENDA, where a fixed-order
parser is used at training time, but an agenda-based
parser is used at test time. This parser featurized
3.5x more derivations compared to AGENDAIL, is
3.5x slower, and has slightly lower accuracy.

Recall that AGENDAIL samples a history from
p+cwθ , that is, using local reweighting and history
compression. Table 3 shows the impact of sampling
from p+wθ (local reweighting), p

+c
θ (history compres-

sion), and directly from pθ, which reduces to policy
gradient. We observe that sampling from pθ directly
according to policy gradient results in very low ac-
curacy, as this produces derivations with zero re-
ward most of the time. Both local reweighting and
history compression alone improve accuracy (local

Acc. |Feat.| Time
Tr. Dev. Tr. Dev. Tr. Dev.

� = 102 56.4 46.7 1,650 2,121 1,159 345
� = 10−1 59.5 47.0 2,043 1,890 1,425 279
� = 10−2 61.0 48.0 2,600 1,912 1,830 214
� = 10−3 61.4 48.5 3,063 1,740 2,110 220
NOSTREAM 60.0 47.6 4,049 4,931 3,155 293

Table 4: Accuracy, number of featurized derivations, and pars-
ing time for both the training set and development set when
varying the value of the tolerance parameter �.

reweighting is more important), but both perform
worse than AGENDAIL.

Impact of lazy agenda. We now examine the con-
tribution of the lazy agenda. Note that the lazy
agenda affects training time much more than test
time for two reasons: (a) at test time we only need to
pop the highest scoring derivation, and the overhead
of a priority queue only grows logarithmically with
the size of the agenda. During training, we need take
a full pass over the agenda when sampling, and thus
the number of items on the agenda is important; (b)
at test time we never unroll derivation streams, only
pop the highest scoring derivation (see Section 7).

In brief, using the lazy agenda results in a 1.5x
speedup at training time. To understand the savings
of the lazy agenda, we vary the value of the tolerance
parameter �. When � is very high, we will never
unroll derivation streams, because for all derivation
streams U ≤ �|G+| (Section 7). This will be fast,
but sampling could be inaccurate. As � decreases,
we unroll more derivations. We also compared to
the NOSTREAM baseline, where the agenda holds
derivations rather than derivation streams.


AgendaIL:

FixedOrder:

what currency does jamaica accept ?
2 91 2 127 10 2

0 85 0 20 2

2 249 0 0

186 4 2

14 0

400

what currency does jamaica accept ?
5 486 524 8, 714 59 5

3 211 1, 099 761 3

3 1, 519 403 3

604 419 3

159 3

558

Figure 9: Number of derivations in every chart cell for the
utterance “what currency does jamaica accept?”. AGENDAIL
reduces the number of derivations in chart cells compared to
FIXEDORDER.

Table 4 shows the results of these experiments.
Naturally, the number of featurized derivations in
training increases as � decreases. In particular,
NOSTREAM results in a 2.5x increase in number
of featurized derivations compared to no unrolling
(� = 102), and 1.5x increase compared to � = 10−2,
which is the chosen value. Similarly, average train-
ing time is about 1.5x slower for NOSTREAM com-
pared to � = 10−2.

Accuracy does not change much for various val-
ues of �. Even when � = 102, accuracy decreases by
only 1.8 points compared to � = 10−3. Unexpect-
edly, NOSTREAM yields a slight drop in accuracy.

Feature ablation. Table 3 shows an ablation test
on the new feature template we introduced that
conjoins binaries and lemmas during bridging (-
BINARYANDLEMMA). Removing this feature tem-
plate substantially reduces accuracy compared to
AGENDAIL, highlighting the importance of learning
new lexical associations during training.

Example. As a final example, Figure 9 shows typ-
ical parse charts for AGENDAIL and FIXEDORDER.
AGENDAIL generates only 1,198 derivations, while
FIXEDORDER constructs 15,543 derivations, many
of which are unnecessary.

In summary, we demonstrated that training an
agenda-based parser to choose good parsing actions
through imitation learning dramatically improves ef-
ficiency and speed at test time, while maintaining

comparable accuracy.

10 Discussion and Related Work

Learning. In this paper, we sampled histories
from a distribution that tries to target the reference
derivation d∗ whenever possible. Work in imita-
tion learning (Abbeel and Ng, 2004; Daume et al.,
2009; Ross et al., 2011; Goldberg and Nivre, 2013)
has shown that interpolating with the model (corre-
sponding to smaller β) can improve generalization.
We were unable to improve accuracy by annealing
β from 1000 to 0, so understanding this dynamic re-
mains an open question.

Parsing. In this paper, we avoided computing K
derivations in each chart cell using an agenda and
learning a scoring function for choosing agenda
items. A complementary and purely algorithmic so-
lution is lazy K-best parsing (Huang and Chiang,
2005), or cube growing (Huang and Chiang, 2007),
which do not involve learning or an agenda. Simi-
lar to our work, cube growing approximates the best
derivations in each chart cell in the case where fea-
tures do not decompose

Work in the past attempted to speed up inference
using a simple model that is trained separately and
used to prune the hypotheses considered by the main
parsing model (Bodenstab et al., 2011; FitzGerald et
al., 2013). We on the other hand speed up inference
by training a single model that learns to follow good
parsing actions.

Work in agenda-based syntactic parsing (Klein
and Manning, 2003; Pauls and Klein, 2009) focused
on A* algorithms where each derivation has a prior-
ity based on the derivation score (inside score), and
a completion estimate (outside score). Good esti-
mates for the outside score result in a decrease in the
number of derivations. Currently actions depend on
the inside score, but we could add features based on
chart derivations to provide “outside” information.
Adding such features would present computational
challenges as scores on the agenda would have to be
updated as the agenda and chart are modified.

Semantic parsing has been gaining momentum in
recent years, but still there has been relatively lit-
tle work on developing faster algorithms, especially
compared to syntactic parsing (Huang, 2008; Kum-
merfeld et al., 2010; Rush and Petrov, 2012; Lewis


and Steedman, 2014). While we have obtained sig-
nificant speedups, we hope to encourage new ideas
that exploit the structure of semantic parsing to yield
better algorithms.

Reproducibility. All code,8 data, and experiments
for this paper are available on the CodaLab platform
at https://www.codalab.org/worksheets/
0x8fdfad310dd84b7baf683b520b4b64d5/.

Acknowledgments

We thank the anonymous reviewers and the action
editor, Jason Eisner, for their thorough reviews and
constructive feedback. We also gratefully acknowl-
edge the support of the DARPA Communicating
with Computers (CwC) program under ARO prime
contract no. W911NF-15-1-0462.

References

P. Abbeel and A. Ng. 2004. Apprenticeship learning
via inverse reinforcement learning. In International
Conference on Machine Learning (ICML).

M. Auli and A. Lopez. 2011. Efficient CCG parsing:
A* versus adaptive supertagging. In Association for
Computational Linguistics (ACL).

J. Bao, N. Duan, M. Zhou, and T. Zhao. 2014.
Knowledge-based question answering as machine
translation. In Association for Computational Linguis-
tics (ACL).

J. Berant and P. Liang. 2014. Semantic parsing via para-
phrasing. In Association for Computational Linguis-
tics (ACL).

J. Berant, A. Chou, R. Frostig, and P. Liang. 2013.
Semantic parsing on Freebase from question-answer
pairs. In Empirical Methods in Natural Language Pro-
cessing (EMNLP).

N. Bodenstab, A. Dunlop, K. Hall, and B. Roark. 2011.
Beam-width prediction for efficient context-free pars-
ing. In Association for Computational Linguistics
(ACL), pages 440–449.

A. Bordes, S. Chopra, and J. Weston. 2014. Question
answering with subgraph embeddings. In Empirical
Methods in Natural Language Processing (EMNLP).

Q. Cai and A. Yates. 2013. Large-scale semantic parsing
via schema matching and lexicon extension. In Asso-
ciation for Computational Linguistics (ACL).

8Our system uses the SEMPRE toolkit (http://nlp.
stanford.edu/software/sempre).

S. A. Caraballo and E. Charniak. 1998. New figures of
merit for best-first probabilistic chart parsing. Compu-
tational Linguistics, 24:275–298.

K. Chang, A. Krishnamurthy, A. Agarwal, H. Daume,
and J. Langford. 2015. Learning to search better than
your teacher. arXiv.

J. Clarke, D. Goldwasser, M. Chang, and D. Roth.
2010. Driving semantic parsing from the world’s re-
sponse. In Computational Natural Language Learn-
ing (CoNLL), pages 18–27.

H. Daume, J. Langford, and D. Marcu. 2009. Search-
based structured prediction. Machine Learning,
75:297–325.

J. Duchi, E. Hazan, and Y. Singer. 2010. Adaptive
subgradient methods for online learning and stochas-
tic optimization. In Conference on Learning Theory
(COLT).

N. FitzGerald, Y. Artzi, and L. S. Zettlemoyer. 2013.
Learning distributions over logical forms for refer-
ring expression generation. In Empirical Methods in
Natural Language Processing (EMNLP), pages 1914–
1925.

Y. Goldberg and J. Nivre. 2013. Training determinis-
tic parsers with non-deterministic oracles. Transac-
tions of the Association for Computational Linguistics
(TACL), 1.

Google. 2013. Freebase data dumps (2013-06-
09). https://developers.google.com/
freebase/data.

L. Huang and D. Chiang. 2005. Better k-best parsing. In
Proceedings of the Ninth International Workshop on
Parsing Technology, pages 53–64.

L. Huang and D. Chiang. 2007. Forest rescoring: Faster
decoding with integrated language models. In Associ-
ation for Computational Linguistics (ACL).

L. Huang. 2008. Forest reranking: Discriminative pars-
ing with non-local features. In Association for Com-
putational Linguistics (ACL).

J. Jiang, A. Teichert, J. Eisner, and H. Daume. 2012.
Learned prioritization for trading off accuracy and
speed. In Advances in Neural Information Processing
Systems (NIPS).

M. Kay. 1986. Algorithm Schemata and Data Struc-
tures in Syntactic Processing. Readings in Natural
Language Processing.

D. Klein and C. Manning. 2003. A* parsing: Fast ex-
act viterbi parse selection. In Human Language Tech-
nology and North American Association for Computa-
tional Linguistics (HLT/NAACL).

J. Krishnamurthy and T. Mitchell. 2012. Weakly super-
vised training of semantic parsers. In Empirical Meth-
ods in Natural Language Processing and Computa-
tional Natural Language Learning (EMNLP/CoNLL),
pages 754–765.

https://www.codalab.org/worksheets/0x8fdfad310dd84b7baf683b520b4b64d5/
https://www.codalab.org/worksheets/0x8fdfad310dd84b7baf683b520b4b64d5/
http://nlp.stanford.edu/software/sempre
http://nlp.stanford.edu/software/sempre
https://developers.google.com/freebase/data
https://developers.google.com/freebase/data


J. Kummerfeld, J. Roesner, T. Dawborn, J. Haggerty,
J. Curran, and S. Clark. 2010. Faster parsing by
supertagger adaptation. In Association for Computa-
tional Linguistics (ACL).

T. Kwiatkowski, L. Zettlemoyer, S. Goldwater, and
M. Steedman. 2010. Inducing probabilistic CCG
grammars from logical form with higher-order unifi-
cation. In Empirical Methods in Natural Language
Processing (EMNLP), pages 1223–1233.

T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer.
2013. Scaling semantic parsers with on-the-fly ontol-
ogy matching. In Empirical Methods in Natural Lan-
guage Processing (EMNLP).

M. Lewis and M. Steedman. 2014. A* CCG parsing with
a supertag-factored model. In Empirical Methods in
Natural Language Processing (EMNLP).

P. Liang, M. I. Jordan, and D. Klein. 2011. Learning
dependency-based compositional semantics. In As-
sociation for Computational Linguistics (ACL), pages
590–599.

P. Liang. 2013. Lambda dependency-based composi-
tional semantics. arXiv.

C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J.
Bethard, and D. McClosky. 2014. The Stanford
CoreNLP natural language processing toolkit. In ACL
system demonstrations.

C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and
D. Fox. 2012. A joint model of language and per-
ception for grounded attribute learning. In Inter-
national Conference on Machine Learning (ICML),
pages 1671–1678.

A. Pauls and D. Klein. 2009. K-best A* parsing. In As-
sociation for Computational Linguistics (ACL), pages
958–966.

S. Ross, G. Gordon, and A. Bagnell. 2011. A reduction
of imitation learning and structured prediction to no-
regret online learning. In Artificial Intelligence and
Statistics (AISTATS).

A. Rush and S. Petrov. 2012. Vine pruning for efficient
multi-pass dependency parsing. In Human Language
Technology and North American Association for Com-
putational Linguistics (HLT/NAACL).

R. Sutton, D. McAllester, S. Singh, and Y. Mansour.
1999. Policy gradient methods for reinforcement
learning with function approximation. In Advances in
Neural Information Processing Systems (NIPS).

S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G.
Banerjee, S. J. Teller, and N. Roy. 2011. Understand-
ing natural language commands for robotic navigation
and mobile manipulation. In Association for the Ad-
vancement of Artificial Intelligence (AAAI).

Z. Wang, S. Yan, H. Wang, and X. Huang. 2014.
An overview of Microsoft deep QA system on Stan-

ford WebQuestions benchmark. Technical report, Mi-
crosoft Research.

Y. W. Wong and R. J. Mooney. 2007. Learning syn-
chronous grammars for semantic parsing with lambda
calculus. In Association for Computational Linguis-
tics (ACL), pages 960–967.

M. Yang, N. Duan, M. Zhou, and H. Rim. 2014. Joint re-
lational embeddings for knowledge-based question an-
swering. In Empirical Methods in Natural Language
Processing (EMNLP).

X. Yao and B. Van-Durme. 2014. Information extraction
over structured data: Question answering with Free-
base. In Association for Computational Linguistics
(ACL).

W. Yih, M. Chang, X. He, and J. Gao. 2015. Semantic
parsing via staged query graph generation: Question
answering with knowledge base. In Association for
Computational Linguistics (ACL).

M. Zelle and R. J. Mooney. 1996. Learning to parse
database queries using inductive logic programming.
In Association for the Advancement of Artificial Intel-
ligence (AAAI), pages 1050–1055.

L. S. Zettlemoyer and M. Collins. 2005. Learning to
map sentences to logical form: Structured classifica-
tion with probabilistic categorial grammars. In Uncer-
tainty in Artificial Intelligence (UAI), pages 658–666.

L. S. Zettlemoyer and M. Collins. 2007. Online learn-
ing of relaxed CCG grammars for parsing to logical
form. In Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language Learn-
ing (EMNLP/CoNLL), pages 678–687.

Y. Zhang, B. Ahn, S. Clark, C. V. Wyk, J. R. Curran, and
L. Rimell. 2010. Chart pruning for fast lexicalised-
grammar parsing. In International Conference on
Computational Linguistics (COLING).