Incremental Tree Substitution Grammar
for Parsing and Sentence Prediction

Federico Sangati and Frank Keller
Institute for Language, Cognition, and Computation

School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, UK

federico.sangati@gmail.com keller@inf.ed.ac.uk

Abstract

In this paper, we present the first incremental
parser for Tree Substitution Grammar (TSG).
A TSG allows arbitrarily large syntactic frag-
ments to be combined into complete trees;
we show how constraints (including lexical-
ization) can be imposed on the shape of the
TSG fragments to enable incremental process-
ing. We propose an efficient Earley-based al-
gorithm for incremental TSG parsing and re-
port an F-score competitive with other incre-
mental parsers. In addition to whole-sentence
F-score, we also evaluate the partial trees that
the parser constructs for sentence prefixes;
partial trees play an important role in incre-
mental interpretation, language modeling, and
psycholinguistics. Unlike existing parsers, our
incremental TSG parser can generate partial
trees that include predictions about the up-
coming words in a sentence. We show that
it outperforms an n-gram model in predicting
more than one upcoming word.

1 Introduction

When humans listen to speech, the input becomes
available gradually as the speech signal unfolds.
Reading happens in a similarly gradual manner
when the eyes scan a text. There is good evidence
that the human language processor is adapted to this
and works incrementally, i.e., computes an interpre-
tation for an incoming sentence on a word-by-word
basis (Tanenhaus et al., 1995; Altmann and Kamide,
1999). Also language processing systems often deal
with speech as it is spoken, or text as it is being
typed. A dialogue system should start interpreting
a sentence while it is being spoken, and a question

answering system should start retrieving answers be-
fore the user has finished typing the question.

Incremental processing is therefore essential both
for realistic models of human language processing
and for NLP applications that react to user input
in real time. In response to this, a number of in-
cremental parsers have been developed, which use
context-free grammar (Roark, 2001; Schuler et al.,
2010), dependency grammar (Chelba and Jelinek,
2000; Nivre, 2007; Huang and Sagae, 2010), or tree-
adjoining grammar (Demberg et al., 2014). Typical
applications of incremental parsers include speech
recognition (Chelba and Jelinek, 2000; Roark, 2001;
Xu et al., 2002), machine translation (Schwartz
et al., 2011; Tan et al., 2011), reading time modeling
(Demberg and Keller, 2008), or dialogue systems
(Stoness et al., 2004). Another potential use of incre-
mental parsers is sentence prediction, i.e., the task
of predicting upcoming words in a sentence given a
prefix. However, so far only n-gram models and clas-
sifiers have been used for this task (Fazly and Hirst,
2003; Eng and Eisner, 2004; Grabski and Scheffer,
2004; Bickel et al., 2005; Li and Hirst, 2005).

In this paper, we present an incremental parser for
Tree Substitution Grammar (TSG). A TSG contains
a set of arbitrarily large tree fragments, which can be
combined into new syntax trees by means of a sub-
stitution operation. An extensive tradition of parsing
with TSG (also referred to as data-oriented parsing)
exists (Bod, 1995; Bod et al., 2003), but none of the
existing TSG parsers are incremental. We show how
constraints can be imposed on the shape of the TSG
fragments to enable incremental processing. We pro-
pose an efficient Earley-based algorithm for incre-
mental TSG parsing and report an F-score competi-
tive with other incremental parsers.

Transactions of the Association for Computational Linguistics 1: May, 111–124, 2013.


TSG fragments can be arbitrarily large and can
contain multiple lexical items. This property enables
our incremental TSG parser to generate partial parse
trees that include predictions about the upcoming
words in a sentence. It can therefore be applied di-
rectly to the task of sentence prediction, simply by
reading off the predicted items in a partial tree. We
show that our parser outperforms an n-gram model
in predicting more than one upcoming word.

The rest of the paper is structured as follows. In
Section 2, we introduce the ITSG framework and re-
late it to the original TSG formalism. Section 3 de-
scribes the chart-parser algorithm, while Section 4
details the experimental setup and results. Sections 5
and 6 present related work and conclusions.

2 Incremental Tree Substitution Grammar

The current work is based on Tree Substitu-
tion Grammar (TSG, Schabes 1990; for a recent
overview see Bod et al. 2003). A TSG is composed
of (i) a set of arbitrarily large fragments, usually ex-
tracted from an annotated phrase-structure treebank,
and (ii) the substitution operation by means of which
fragments can be combined into complete syntactic
analyses (derivations) of novel sentences.

Every fragment’s node is either a lexical node
(word), a substitution site (a non-lexical node in the
yield of the structure),1 or an internal node. An inter-
nal node must always keep the same daughter nodes
as in the original tree. For an example of a binarized2

tree and a fragment extracted from it see Figure 1.
A TSG derivation is constructed in a top-down

generative process starting from a fragment in the
grammar rooted in S (the unique non-lexical node
all syntactic analysis are rooted in). A partial deriva-
tion is extended by subsequently introducing more
fragments: if X is the left-most substitution site in
the yield of the current partial derivation, a fragment

1For example nodes NP, VP, S@ are the substitution sites of
the right fragment in Figure 1.

2The tree is right-binarized via artificial nodes with @ sym-
bols, as explained in Section 4.1. The original tree is

S

.

“.”

VP

VP

VBN

“disclosed”

VBD

“were”

NP

NNS

“Terms”

S

S@

S@

.

“.”

VP

VP

VBN

“disclosed”

VBD

“were”

NP

NNS

“Terms”

S

S@

S@VP

VPVBD

“were”

NP

Figure 1: An example of a binarized2 parse tree and a
lexicalized fragment extracted from it.

rooted in X is chosen from the grammar and sub-
stituted into it. When there are no more substitution
sites (all nodes in the yield are lexical items) the gen-
erative process terminates.

2.1 Incrementality

In this work we are interested in defining an incre-
mental TSG (in short ITSG). The new generative
process, while retaining the original mechanism for
combining fragments (by means of the substitution
operation), must ensure a way for deriving syntactic
analyses of novel sentences in an incremental man-
ner, i.e., one word at the time from left to right. More
precisely, at each stage of the generative process, the
partially derived structure must be connected (as in
standard TSG) and have a prefix of the sentence at
the beginning of its yield. A partial derivation is con-
nected if it has tree shape, i.e., all the nodes are dom-
inated by a common root node (which does not nec-
essarily have to be the root node of the sentence).
For instance, the right fragment in Figure 1 shows a
possible way of starting a standard TSG derivation
which does not satisfy the incrementality constraint:
the partial derivation has a substitution site as the
first element in its yield.

In order to achieve incrementality while maintain-
ing connectedness, we impose one further constraint
on the type of fragments which are allowed in an
ITSG: each fragment should be lexicalized, i.e., con-
tain at least one word (lexical anchor) at the first or
the second position in its yield. Allowing more than
one substitution site at the beginning of a fragment’s
yield would lead to a violation of the incrementality
requirement (as will become clear in Section 2.2).

2


The generative process starts with a fragment an-
chored in the first word of the sentence being gener-
ated. At each subsequent step, a lexicalized fragment
is introduced (by means of the substitution opera-
tion) to extend the current partial derivation in such
a way that the prefix of the yield of the partial struc-
ture is lengthened by one word (the lexical anchor of
the fragment being introduced). The lexicalization
constraint allows a fragment to have multiple lexical
items, not necessarily adjacent to one another. This
is useful to capture the general ability of TSG to pro-
duce in one single step an arbitrarily big syntactic
construction ranging from phrasal verbs (e.g., ask
someone out), to parallel constructions (e.g., either
X or Y), and idiomatic expressions (e.g., took me
to the cleaners). For an example of a fragment with
multiple lexical anchors see the fragment in the mid-
dle of Figure 2.

2.2 Symbolic Grammar

An ITSG is a tuple 〈N ,L ,F ,4,5,�〉, where N
and L are the set of non-lexical and lexical nodes
respectively, F is a collection of lexicalized frag-
ments, 4 and 5 are two variants of the substitution
operation (backward and forward) used to combine
fragments into derivations, and � is the stop opera-
tion which terminates the generative process.

Fragments A fragment f ∈ F belongs to one of
the three sets Finit,F Xlex,F

Y
sub:

• An initial fragment ( finit ) has the lexical anchor
in the first position of the yield, being the initial
word of a sentence (the left-most lexical node
of the parse tree from which it was extracted).

• A lex-first fragment ( f Xlex) has the lexical anchor
(non sentence-initial) in the first position of the
yield, and is rooted in X .3

• A sub-first fragment ( f Ysub) has the lexical an-
chor in the second position of its yield, and a
substitution site Y in the first.

Fringes We will use fringes (Demberg et al.,
2014) as a compressed representation of fragments,

3A fragment can be both an initial and a lex-first fragment
(e.g., if the lexical anchor is a proper noun). We will make use
of two separate instances of the same fragment in the two sets.

NP

NNS

“Terms”

5 S

S@

S@

.

“.”

VP

VPVBD

“were”

NP

4 VP

VBN

“disclosed”

�S

Figure 2: An example of an ITSG derivation yielding the
tree on the left side of Figure 1. The second and third frag-
ment are introduced by means of forward and backward
substitution, respectively.

in which the internal structure is replaced by a trian-
gle (

a
or �) and only the root and the yield are vis-

ible. It is possible in a grammar that multiple frag-
ments map to the same fringe; we will refer to those
as ambiguous fringes. We use both vertical (

a
, e.g.,

in Figure 3 and 4) and horizontal (�) fringe nota-
tion. The latter is used for describing the states in our
chart-parsing algorithm in Section 3. For instance,
the horizontal fringe representation of the right frag-
ment in Figure 1 is S � NP “were” VP S@.

Incremental Derivation An incremental deriva-
tion is a sequence of lexicalized fragments
〈 f1, f2,..., fn〉 which, combined together in the
specified order, give rise to a complete parse tree
(see Figure 2 for an example). The first fragment f1
being introduced in the derivation must be an initial
fragment, and its lexical anchor constitutes the one-
word prefix of the sentence being generated. Sub-
sequent fragments are introduced by means of the
substitution operation, which has two variants: back-
ward substitution (4), which is used to substitute
lex-first fragments into the partial derivation gener-
ated so far, and forward substitution (5), which is
used to substitute sub-first fragments into the partial
derivation. After a number of fragments are intro-
duced, a stop operation (�) may terminate the gen-
erative process.

Operations The three ITSG operations take place
under specific conditions within an incremental
derivation, as illustrated in Figure 3 and explained
hereafter. At a given stage of the generative process
(after an initial fragment has been inserted), the con-
nected partial structure may or may not have sub-

3


Partial Structure Operation Accepted Fragment Resulting Structure Terminated

Y

`1
lex... `i X α...

4
(backward)

X

`i+1
β...

Y

`1
lex... `i+1

β... α...

NO

Y

`1
lex... `i

5
(forward)

X

Y `i+1 α...

X

`1
lex... `i `i+1

α...

NO

Y

`1
lex... `n

�
(stop)

∅

Y #

∅

`1
lex... `n #

YES

Figure 3: Schemata of the three ITSG operations. All tree structures (partial structure and fragments) are represented
in a compact notation, which displays only the root nodes and the yields. The i-th words in the structure’s yield is
represented as `i, while α and β stands for any (possibly empty) sequence of words and substitution sites.

stitution sites present in its yield. In the first case,
a backward substitution (4) must take place in the
following generative step: if X is the left-most sub-
stitution site, a new fragment of type f Xlex is chosen
from the grammar and substituted into X . If the par-
tially derived structure has no substitution site (all
the nodes in its yield are lexical nodes) and it is
rooted in Y , two possible choices exist: either the
generative process terminates by means of the stop
operation (�Y ), or the generative process contin-
ues. In the latter case a forward substitution (5) is
performed: a new f Ysub fragment is chosen from the
grammar, and the partial structure is substituted into
the left-most substitution site Y of the fragment.4

Multiple Derivations As in TSG, an ITSG may
be able to generate the same parse tree in multiple
ways: multiple incremental derivations yielding the
same tree. Figure 4 shows one such example.

Generative Capacity It is useful to clarify the dif-
ference between ITSG and the more general TSG
formalism in terms of generative capacity. Although
both types of grammar make use of the substitu-
tion operation to combine fragments, an ITSG im-
poses more constraints on (i) the type of fragments
which are allowed in the grammar (initial, lex-first,

4A stop operation can be viewed as a forward substitution
when using an artificial sub-first fragment ∅ � Y # (stop frag-
ment), where # is an artificial lexical node indicating the termi-
nation of the sentence. For simplicity, stop fragments are omit-
ted in Figure 2 and 4 and Y is attached to the stop symbol (�Y ).

S

S@

S@

.

“.”

VP

VP

VBN

“disclosed”

VBD

“were”

NP

NNS

“Terms”

NP

“Terms”

5 S

NP “were” VP “.”

4 VP

“disclosed”

�S

S

“Terms” S@

4 S@

“were” VP “.”

4 VP

“disclosed”

�S

Figure 4: Above: an example of a set of fragments ex-
tracted from the tree in Figure 1. Below: two incremental
derivations that generate it. Colors (and lines strokes) in-
dicate which derivation fragments belong to.

and sub-first fragments), and (ii) the generative pro-
cess with which fragments are combined (incremen-
tally left to right instead of top-down). If we com-
pare a TSG and an ITSG on the same set of (ITSG-
compatible) fragments, then there are cases in which
the TSG can generate more tree structures than the
ITSG.

In the following, we provide a more formal char-
acterization of the strong and weak generative power

4


S

X“a”

X

“c”X

X

“b”

S

X

“c”X

“c”X

“c”X

“b”

“a”

Figure 5: Left: an example of a CFG with left recursion.
Right: one of the structures the CFG can generate.

of ITSG with respects to context-free grammar
(CFG) and TSG. (However, a full investigation of
this issue is beyond the scope of this paper.) We can
limit our analysis to CFG, as TSG is strongly equiv-
alent to CFG. The weak equivalence between ITSG
and CFG is straightforward: for any CFG there is
a way to produce a weakly equivalent grammar in
Greibach Normal Form in which any production has
a right side beginning with a lexical item (Aho and
Ullman, 1972). The grammar that results from this
transformation is an ITSG which uses only back-
ward substitutions.

Left-recursion seems to be the main obstacle for
strong equivalence between ITSG and CFG. As an
example, the left side of Figure 5 shows a CFG that
contains a left-recursive rule. The types of structures
this grammar can generate (such as the one given on
the right side of the same figure) are characterized by
an arbitrarily long chain of rules that can intervene
before the second word of the string, “b”, is gener-
ated. Given the incrementality constraints, there is
no ITSG that can generate the same set of struc-
tures that this CFG can generate. However, it may
be possible to circumvent this problem by applying
the left-corner transform (Rosenkrantz and Lewis,
1970; Aho and Ullman, 1972) to generate an equiv-
alent CFG without left-recursive rules.

2.3 Probabilistic Grammar

In the generative process presented above there are
a number of choices which are left open, i.e., which
fragment is being introduced at a specific stage of
a derivation, and when the generative process ter-
minates. A symbolic ITSG can be equipped with

a probabilistic component which deals with these
choices. A proper probability model for ITSG needs
to define three probability distributions over the
three types of fragments in the grammar, such that:

∑
finit∈Finit

P( finit) = 1 (1)

∑
f Xlex∈F

X
lex

P( f Xlex) = 1 (∀X ∈ N ) (2)

P(�Y )+ ∑
f Ysub∈F

Y
sub

P( f Ysub) = 1 (∀Y ∈ N ) (3)

The probability that an ITSG generates a specific
derivation d is obtained by multiplying the probabil-
ities of the fragments taking part in the derivation:

P(d) = ∏
f∈d

P( f ) (4)

Since the grammar may generate a tree t via multiple
derivations D(t) = d1,d2,...,dm, the probability of
the parse tree is the sum of the probabilities of the
ITSG derivations in D(t):

P(t) = ∑
d∈D(t)

P(d) = ∑
d∈D(t)

∏
f∈d

P( f ) (5)

3 Probabilistic ITSG Parser

We introduce a probabilistic chart-parsing algorithm
to efficiently compute all possible incremental de-
rivations that an ITSG can generate given an input
sentence (presented one word at the time). The pars-
ing algorithm is an adaptation of the Earley algo-
rithm (Earley, 1970) and its probabilistic instantia-
tion (Stolcke, 1995).

3.1 Parsing Chart
A TSG incremental derivation is represented in the
chart as a sequence of chart states, i.e., a path.

For a given fringe in an incremental derivation,
there will be one or more states in the chart, depend-
ing on the length of the fringe’s yield. This is be-
cause we need to keep track of the extent to which
the yield of each fringe has been consumed within
a derivation as the sentence is processed incremen-
tally.5 At the given stage of the derivation, the states
offer a compact representation over the partial struc-
tures generated so far.

5A fringe (state) may occur in multiple derivations (paths):
for instance in Figure 4 the two derivations will correspond to
two separate paths that will converge to the same fringe (state).

5


Start(`0)
X �`0ν

0 : 0X �•`0ν [α,γ,β]

α = γ = P(X �`0ν) β = β(1 : 0X �`0 •ν)

Backward Substitution(`i)
i : kX � λ•Y µ [α,γ,β] Y �`iν

i : iY �•`iν [α′,γ′,β′]

α
′ += α ·P(Y �`iν) γ

′ = P(Y �`iν)

Forward Substitution(`i)
i : 0Y � ν• [α,γ,β] X �Y `iµ

i : 0X �Y •`iµ [α′,γ′,β′]

α
′ += α ·P(X �Y `iµ)

γ
′ += γ ·P(X �Y `iµ)

β
+
= β′ ·P(X �Y `iµ)

Completion
i : jY �` jν• [α,γ,β] j : kX � λ•Y µ [α′,γ′,β′]

i : kX � λY •µ [α′′,γ′′,β′′]

α
′′ += α′ ·γ

γ
′′ += γ′ ·γ

β
+
= β′′ ·γ′

β
′ += β′′ ·γ′′

Scan(`i)
i : kX � λ•`iµ [α,γ,β]

i + 1 : kX � λ`i •µ [α′,γ′,β′]

α
′ = α

γ
′ = γ

β = β′

Stop(#)
n : 0Y � ν• [α = γ,β] Ø �Y #

n : 0Ø �Y •# [α′,γ′,β′]

α
′ = γ′ = α ·P(�Y ) β

′ = 1
β = P(�Y )

Figure 6: Chart operations with forward (α), inner (γ),
and outer (β) probabilities.

Each state is composed of a fringe and some ad-
ditional information which keeps track of where the
fringe is located within a path. A chart state can be
generally represented as

i : kX � λ•µ (6)

where X � λµ is the state’s fringe, Greek letters are
(possibly empty) sequences of words and substitu-
tion sites, and • is a placeholder indicating to which
extent the fragment’s yield has been consumed: all
the elements in the yield preceding the dot have
been already accepted. Finally, i and k are indices

of words in the input sentence:

• i signifies that the current state is introduced
after the first i words in the sentence have
been scanned. All states in the chart will be
grouped according to this index, and will con-
stitute state-set i.

• k indicates that the fringe associated with the
current state was first introduced in the chart
after the first k words in the input sentence had
been scanned. The index k is therefore called
the start index.

For instance when generating the first incremental
derivation in Figure 4, the parser will pass through
state 1 : 1S � NP • “were” VP “.” indicating that
the second fringe is introduced right after the parser
has scanned the first word in the sentence and before
having scanned the second word.

3.2 Parsing Algorithm
We will first introduce the symbolic part of the
parsing algorithm, and then discuss its probabilistic
component in Section 3.3. In line with the generative
process illustrated in Section 2.2, the parser operates
on the chart states in order to keep track of all pos-
sible ITSG derivations as new words are fed in. It
starts by reading the first word `0 and introducing
new states to state-set 0 in the chart, those mapping
to initial fragments in the grammar with `0 as lexi-
cal anchor. At a given stage, after i words have been
scanned, the parser reads the next word (`i) and in-
troduces new states in state-sets i and i + 1 by apply-
ing specific operations on states present in the chart,
and fringes in the grammar.

Parser Operations The parser begins with the
start operation just described, and continues with a
cycle of four operations for every word in the input
sentence `i (for i ≥ 0). The order of the four opera-
tions is the following: completion, backward substi-
tution (4), forward substitution (5), and scan. When
there are no more words in input, the parser termi-
nates with a stop operation. We will now describe
the parser operations (see Figure 6 for their formal
definition), ignoring the probabilities for now.

Start(`0): For every initial fringe in the grammar
anchored in `0, the parser inserts a (scan) state for
that fringe in the state-set 0.

6


Backward Substitution(`i) applies to acceptor
states, i.e., those with a substitution site following
the dot, say X . For each acceptor state in state-set i,
and any lex-first fringe in the grammar rooted in X
and anchored in `i, the parser inserts a (scan) state
for that fringe in state-set i.

Forward Substitution(`i) applies to donor
states, i.e., those that have no elements following
the dot and with start index 0. For each donor state
in state-set i, rooted in Y , and any sub-first fringe in
the grammar with Y as the left-most element in its
yield, the parser inserts a (scan) state for that fringe
in state-set i, with the dot placed after Y .

Completion applies to complete states, i.e., those
with no elements following the dot and with start
index j > 0. For every complete state in state-set i,
rooted in Y , with starting index j, and every acceptor
state in set j with Y following the dot, the parser
inserts a copy of the acceptor state in state-set i, and
advances the dot.

Scan(`i) applies to scan states, i.e., those with a
word after the dot. For every scan state in state-set
i having `i after the dot, the parser inserts a copy of
that state in state-set (i + 1), and advances the dot.

Stop(#) is a special type of forward substitution
and applies to donor states, but only when the input
word is the terminal symbol #. For every donor state
in state-set n (the length of the sentence), if the root
of the fringe’s state is Y , the parser introduces a stop
state whose fringe is a stop fringe with Y as the left
most substitution site.

Comparison with the Earley Algorithm It is
useful to clarify the differences between the pro-
posed ITSG parsing algorithm and the original Ear-
ley algorithm. Primarily, the ITSG parser is based
on a left-right processing order, whereas the Ear-
ley algorithm uses a top-down generative process.
Moreover, our parser presupposes a restricted in-
ventory of fragments in the grammar (the ones al-
lowed by an ITSG) as opposed to the general CFG
rules allowed by the Earley algorithm. In particular,
the Backward Substitution operation is more limited
than the corresponding Prediction step of the Earley
algorithm: only lex-first fragments can be introduced
using Backward Substitution, and therefore left re-
cursion (allowed by the Earley algorithm) is not pos-

sible here.6 This restriction is compensated for by
the existence of the Forward Substitution operation,
which has no analog in the Earley algorithm.7 The
worst case complexity of Earley algorithm is domi-
nated by the Completion operation which is identical
to that in our parser, and therefore the original total
time complexity applies, i.e., O(l3) for an input sen-
tence of length l, and O(n3) in terms of the number
of non-lexical nodes n in the grammar.

Derivations Incremental (partial) derivations are
represented in the chart as (partial) paths along
states. Each state can lead to one or more succes-
sors, and come from one or more antecedents. Scan
is the only operation which introduces, for every
scan state, a new single successor state (which can
be of any of the four types) in the following state-
set. Complete states may lead to several states within
the current state-set, which may belong to any of the
four types. An acceptor state may lead to a number
of scan states via backward substitution (depending
on the number of lex-first fringes that can combine
with it). Similarly, a donor state may lead to a num-
ber of scan states via forward substitution.

After i words have been scanned, we can retrieve
(partial) paths from the chart. This is done in a back-
ward direction starting from scan states in state-set i
all the way back to the initial states. This is possible
since all the operations are reversible, i.e., given a
state it is possible to retrieve its antecedent state(s).

As an example, consider the ITSG grammar con-
sisting of the fragments in Figure 7 and the two de-
rivations of the same parse tree in the same figure;
Figure 7 represents the parsing chart of the same
grammar, containing the two corresponding paths.

3.3 Probabilistic Parser

In the probabilistic version of the parser, each fringe
in the grammar has a given probability, such that
Equations (1)–(3) are satisfied.8 In the probabilistic
chart, every state i : kX �λ•µ is decorated with three

6This further simplifies the probabilistic version of our
parser, as there is no need to resort to the probabilistic reflex-
ive, transitive left-corner relation described by Stolcke (1995).

7This operation would violate Earley’s top-down constraint;
donor states are in fact the terminal states in Earley algorithm.

8The probability of an ambiguous fringe is the marginal
probability of the fragments mapping to it.

7


0 – “Terms”

S
| 0NP �• “Terms” [1/2, 1/2, 1]
|| 0S �• “Terms” S@ [1/2, 1/2, 1]

1 – “were”

S
| 0S � NP • “were” V P “.” [1/2, 1/2, 1]
|| 1S@ �• “were” V P “.” [1/2, 1, 1/2]

4 || 0S � “Terms” • S@ [1/2, 1/2, 1] ∗ || S@ � “were” V P “.” [1]
5 | 0NP � “Terms” • [1/2, 1/2, 1] | S � NP “were” V P “.” [1]

2 – “disclosed”
S 2V P �• “disclosed” [1, 1, 1]

4 | 0S � NP “were” • V P “.” [1/2, 1/2, 1] ∗∗
|| 1S@ � “were” • V P “.” [1/2, 1, 1/2] ∗∗∗

V P � “disclosed” [1]

3 – “.”

S
| 0S � NP “were” V P • “.” [1/2, 1/2, 1]
|| 1S@ � “were” V P • “.” [1/2, 1, 1/2]

C 2V P � “disclosed” • [1, 1, 1]
| **
|| ***

4 – #
S 0∅� S • # [1, 1, 1]

� || 0S � “Terms” S@ • [1/2, 1/2, 1]
| 0S � NP “were” V P “.” • [1/2, 1/2, 1]

∅� S # [1]

C || 1S@ � “were” V P “.” • [1/2, 1, 1/2] || *

Figure 7: The parsing chart of the two derivations in Figure 4. Blue states or fringes (also marked with |) are the ones in
the first derivation, red (||) in the second, and yellow (no marks) are the ones in common. Each state-set is represented
as a separate block in the chart, headed by the state-set index and the next word. Each row maps to a chart operation
(specified in the first column, with S and C standing for ‘scan’ and ‘complete’ respectively) and follows the same
notation of figure 6. Symbols ∗ are used as state placeholders.

probabilities [α,γ,β] as shown in the chart example
in Figure 7.

• The forward probability α is the marginal prob-
ability of all the paths starting with an initial
state, scanning all initial words in the sentence
until `i−1 included, and passing through the
current state.

• The inner probability γ is the marginal proba-
bility of all the paths passing through the state
k : kX �•λµ, scanning words `k,...,`i−1 and
passing through the current state.

• The outer probability β is the marginal prob-
ability of all the paths starting with an initial
state, scanning all initial words in the sentence
until `k−1 included, passing through the current
state, and reaching a stop state.

Forward (α) and inner (γ) probabilities are propa-
gated while filling the chart incrementally, whereas

outer probabilities (β) are back-propagated from the
stop states, for which β = 1 (see Figure 6). These
probabilities are used for computing prefix and sen-
tence probabilities, and for obtaining the most prob-
able partial derivation (MPD) of a prefix, the MPD
of a sentence, its minimum risk parse (MRP), and to
approximate its most probable parse (MPP).

Prefix probabilities are obtained by summing over
the forward probabilities of all scan states in state-set
i having `i after the dot:9

P(`0,...,`i) = ∑
state s

i:k X�λ•`iµ

α(s) (7)

3.4 Most Probable Derivation (MPD)
The Most Probable (partial) Derivation (MPD) can
be obtained from the chart by backtracking the
Viterbi path. Viterbi forward and inner probabilities

9Sentence probability is obtained by marginalizing the for-
ward probabilities of the stop states in the last state-set n.

8


(α∗,γ∗) are propagated as standard forward and in-
ner probabilities except that summation is replaced
by maximization, and the probability of an ambigu-
ous fringe is the maximum probability among all the
fragments mapping into it (instead of the marginal
one). The Viterbi partial path for the prefix `0,...,`i
can then be retrieved by backtracking from the scan
state in state-set i with max α∗: for each state, the
most probable preceding state is retrieved, i.e., the
state among its antecedents with maximum α∗. The
Viterbi complete path of a sentence can be obtained
by backtracking the Viterbi path from the stop state
with max α∗. Given a Viterbi path, it is possible to
obtain the corresponding MPD. This is done by re-
trieving the associated sequence of fragments10 and
connecting them.

3.5 Most Probable Parse (MPP)

According to Equation (5), if we want to compute
the MPP we need to retrieve all possible derivations
of the current sentence, sum up the probabilities of
those generating the same tree, and returning the
tree with max marginal probability. Unfortunately
the number of possible derivations grows exponen-
tially with the length of the sentence, and computing
the exact MPP is NP-hard (Sima’an, 1996). In our
implementation, we approximate the MPP by per-
forming this marginalization over the Viterbi-best
derivations obtained from all stop states in the chart.

3.6 Minimum Risk Parse (MRP)

MPD and MPP aim at obtaining the structure of a
sentence which is more likely as a whole under the
current probabilistic model. Alternatively, we may
want to focus on the single components of a tree
structures, e.g., CFG rules covering a certain span of
the sentence, and search for the structure which has
the highest number of correct constituents, as pro-
posed by Goodman (1996). Such structure is more
likely to obtain higher results according to standard
parsing evaluations, as the objective being maxi-
mized is closely related to the metric used for eval-
uation (recall/precision on the number of correct la-
beled constituents).

10For each scan state in the path, we obtain the fragment in
the grammar that maps into the state’s fringe. For ambiguous
fringes the most probable fragment that maps into it is selected.

In order to obtain the minimum risk parse (MRP)
we utilize both inner (γ) and outer (β) probabilities.
The product of these two probabilities equals the
marginal probability of all paths generating the en-
tire current sentence and passing through the current
state. We can therefore compute the probability of a
fringe f = X � λ•µ covering a specific span [s,t] of
the sentence:

P( f ,[s,t]) = γ(t : s f•)·β(t : s f•) (8)

We can then compute the probability of each frag-
ment spanning [s,t],11 and the probability P(r,[s,t])
of a CFG-rule r spanning [s,t].12 Finally the MRP is
computed as

MRP = arg max
T

∏
r∈T

P(r,[s,t]) (9)

4 Experiments

For training and evaluating the ITSG parser, we em-
ploy the Penn WSJ Treebank (Marcus et al., 1993).
We use sections 2–21 for training, section 22 and 24
for development and section 23 for testing.

4.1 Grammar Extraction
Following standard practice, we start with some pre-
processing of the treebank. After removing traces
and functional tags, we apply right binarization on
the training treebank (Klein and Manning, 2003),
with no horizontal and vertical conditioning. This
means that when a node X has more than two chil-
dren, new artificial constituents labeled X @ are cre-
ated in a right recursive fashion (see Figure 1).13 We
then replace words appearing less than five times in
the training data by one of 50 unknown word cate-
gories based on the presence of lexical features as
described in Petrov (2009).

Fragment Extraction In order to equip the gram-
mar with a representative set of lexicalized frag-
ments, we use the extraction algorithm of Sangati

11For an ambiguous fringe, the spanning probability of each
fragment mapping into it is the fraction of the fringe’s spanning
probability with respect to the marginal fringe probability.

12Marginalizing the probabilities of all fragments having r
spanning [s,t].

13This shallow binarization (H0V1) was used based on gold
coverage of the unsmoothed grammar (extracted from the train-
ing set) on trees in section 22: H0V1 binarization results on a
coverage of 88.0% of the trees, compared to 79.2% for H1V1.

9


et al. (2010) which finds maximal fragments recur-
ring twice or more in the training treebank. To en-
sure better coverage, we additionally extract one-
word fragments from each training parse tree: for
each lexical node ` in the parse tree we percolate
up till the root node, and for every encountered in-
ternal node X0,X1,...,Xi we extract the lexicalized
fragment whose spine is Xi − Xi−1 − ...− X0 −`,
and where all the remaining children of the inter-
nal nodes are substitution sites (see for instance the
right fragment in Figure 1). Finally, we remove all
fragments which do not comply with the restrictions
presented in Section 2.1.14

For each extracted fragment we keep track of its
frequency, i.e., the number of times it occurs in the
training corpus. Each fragment’s probability is then
derived according to its relative frequency in the
corresponding set of fragments ( finit, f Xlex, f

Y
sub), so

that equations(1)–(3) are satisfied. The final gram-
mar consists of 2.2M fragments mapping to 2.0M
fringes.

Smoothing Two types of smoothing are per-
formed over the grammar’s fragments: Open class
smoothing adds simple CFG rewriting rules to the
grammar for open-class15 〈PoS,word〉 pairs not en-
countered in the training corpus, with frequency
10−6. Initial fragments smoothing adds each lex-first
fragment f to the initial fragment set with frequency
10−2 ·freq( f ).16

All ITSG experiments we report used exhaustive
search (no beam was used to prune the search space).

4.2 Evaluation

In addition to standard full-sentence parsing results,
we propose a novel way of evaluating our ITSG on
partial trees, i.e., those that the parser constructs for
sentence prefixes. More precisely, for each prefix of
the input sentence (length two words or longer) we
compute the parsing accuracy on the minimal struc-
ture spanning that prefix. The minimal structure is
obtained from the subtree rooted in the minimum

14The fragment with no lexical items, and those with more
than one substitution site at the beginning of the yield.

15A PoS belongs to the open class if it rewrites to at least 50
different words in the training corpus. A word belongs to the
open class if it has been seen only with open-class PoS tags.

16The parameters were tuned on section 24 of the WSJ.

common ancestor of the prefix nodes, after pruning
those nodes not yielding any word in the prefix.

As observed in the example derivations of Fig-
ure 4, our ITSG generates partial trees for a given
prefix which may include predictions about unseen
parts of the sentence. We propose three new mea-
sures for evaluating sentence prediction:17

Word prediction PRD(m): For every prefix of
each test sentence, if the model predicts m′ ≥ m
words, the prediction is correct if the first m pre-
dicted words are identical to the m words following
the prefix in the original sentence.

Word presence PRS(m): For every prefix of each
test sentence, if the model predicts m′ ≥ m words,
the prediction is correct if the first m predicted words
are present, in the same order, in the words following
the prefix in the original sentence (i.e., the predicted
word sequence is a subsequence of the sequence of
words following the prefix).18

Longest common subsequence LCS: For every
prefix of each test sentence, it computes the longest
common subsequence between the sequence of pre-
dicted words (possibly none) and the words follow-
ing the prefix in the original sentence.

Recall and precision can be computed in the usual
way for these three measures. Recall is the total
number (over all prefixes) of correctly predicted
words (as defined by PRD(m), PRS(m), or LCS)
over the total number of words expected to be pre-
dicted (according to m), while precision is the num-
ber of correctly predicted words over the number of
words predicted by the model.

We compare the ITSG parser with the incremental
parsers of Schuler et al. (2010) and Demberg et al.
(2014) for full-sentence parsing, with the Roark
(2001) parser19 for full-sentence and partial pars-

17We also evaluated our ITSG model using perplexity; the
results obtained were substantially worse than those obtained
using Roark’s parsers.

18Note that neither PRD(m) nor PRS(m) correspond to word
error rate (WER). PRD requires the predicted word sequence to
be identical to the original sequence, while PRS only requires
the predicted words to be present in the original. In contrast,
WER measures the minimum number of substitutions, inser-
tions, and deletions needed to transform the predicted sequence
into the original sequence.

19Apart from reporting the results in Roark (2001), we also
run the latest version of Roark’s parser, used in Roark et al.
(2009), which has higher results compared to the original work.

10


R P F1
Demberg et al. (2014) 79.4 79.4 79.4
Schuler et al. (2010) 83.4 83.7 83.5
Roark (2001) 86.6 86.5 86.5
Roark et al. (2009) 87.7 87.5 87.6
ITSG (MPD) 81.5 83.5 82.5
ITSG (MPP) 81.6 83.6 82.6
ITSG (MRP) 82.6 85.8 84.1
ITSG Smoothing (MPD) 83.0 83.5 83.2
ITSG Smoothing (MPP) 83.2 83.6 83.4
ITSG Smoothing (MRP) 83.9 85.6 84.8

Table 1: Full-sentence parsing results for sentences in the
test set of length up to 40 words.

ing, and with a language model built using SRILM
(Stolcke, 2002) for sentence prediction. We used a
standard 3-gram model trained on the sentences of
the training set using the default setting and smooth-
ing (Kneser-Ney) provided by the SRILM pack-
age. (Higher n-gram model do not seem appropriate,
given the small size of the training corpus.) For ev-
ery prefix in the test set we compute the most prob-
able continuation predicted by the n-gram model.20

4.3 Results

Table 1 reports full-sentence parsing results for our
parser and three comparable incremental parsers
from the literature. While Roark (2001) obtains the
best results, the ITSG parser without smoothing per-
forms on a par with Schuler et al. (2010), and out-
performs Demberg et al. (2014).21 Adding smooth-
ing results in a gain of 1.2 points F-score over the
Schuler parser. When we compare the different pars-
ing objectives of the ITSG parser, MRP is the best
one, followed by MPP and MPD.

Incremental Parsing The graphs in Figure 8 com-
pare the ITSG and Roark’s parser on the incremental
parsing evaluation, when parsing sentences of length
10, 20, 30 and 40. The performance of both models
declines as the length of the prefix increases, with
Roark’s parser outperforming the ITSG parser on
average, although the ITSG parser seems more com-

20We used a modified version of a script by Nathaniel Smith
available at https://github.com/njsmith/pysrilm.

21Note that the scores reported by Demberg et al. (2014) are
for TAG structures, not for the original Penn Treebank trees.

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Prefix Length

86

88

90

92

94

96

F-
sc

or
e

Roark (last)
ITSG Smooth. (MPD)
Roark et al. (2009)

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Prefix Length

86

88

90

92

94

96

F-
sc

or
e

Roark (last)
ITSG Smooth. (MPD)ITSG Smooth. (MPD)

F
-s

co
re 2 3 4 5 6 7 8 9 1091

92

93

94

95

96

97

98

99

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
85

86

87

88

89

90

91

92

93

94

95

96

97

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

78

80

82

84

86

88

90

92

94

96

98

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

78

80

82

84

86

88

90

92

94

96

98

Prefix Length

Figure 8: Partial parsing results for sentences of length
10, 20, 30, and 40 (from upper left to lower right).

petitive when parsing prefixes for longer (and there-
fore more difficult) sentences.

Sentence Prediction Table 2 compares the sen-
tence prediction results of the ITSG and the lan-
guage model (SRILM). The latter is outperforming
the former when predicting the next word of a pre-
fix, i.e. PRD(1), whereas ITSG is better than the lan-
guage model at predicting a single future word, i.e.
PRS(1). When more than one (consecutive) word
is considered, the SRILM model exhibits a slightly
better recall while ITSG achieves a large gain in pre-
cision. This illustrates the complementary nature of
the two models: while the language model is better
at predicting the next word, the ITSG predicts future
words (rarely adjacent to the prefix) with high con-
fidence (89.4% LCS precision). However, it makes
predictions for only a small number of words (5.9%
LCS recall). Examples of sentence predictions can
be found in Table 3.

5 Related Work

To the best of our knowledge, there are no other in-
cremental TSG parsers in the literature. The parser
of Demberg et al. (2014) is closely related, but uses
tree-adjoining grammar, which includes both sub-
stitution and adjunction. That parser makes predic-
tions, but only for upcoming structure, not for up-
coming words, and thus cannot be used directly
for sentence prediction. The incremental parser of
Roark (2001) uses a top-down algorithm and works

11


ITSG SRILM
Correct R P Correct R P

PRD(1) 4,637 8.7 12.5 11,430 21.5 21.6
PRD(2) 864 1.7 13.9 2,686 5.3 5.7
PRD(3) 414 0.9 20.9 911 1.9 2.1
PRD(4) 236 0.5 23.4 387 0.8 1.0
PRS(1) 34,831 65.4 93.9 21,954 41.2 41.5
PRS(2) 4,062 8.0 65.3 5,726 11.3 12.2
PRS(3) 1,066 2.2 53.7 1,636 3.4 3.8
PRS(4) 541 1.2 53.7 654 1.4 1.7
LCS 44,454 5.9 89.4 92,587 12.2 18.4

Table 2: Sentence prediction results.

Prefix Shares of UAL , the parent PRD(3) PRS(3)
ITSG company of United Airlines , − −
SRILM company , which is the − −
Goldstd of United Airlines , were extremely active all day

Friday .
Prefix PSE said it expects to report earnings of $ 1.3

million to $ 1.7 million , or 14
ITSG cents a share , − +
SRILM % to $ UNK − −
Goldstd cents to 18 cents a share .

Table 3: Examples comparing sentence predictions for
ITSG and SRILM (UNK: unknown word).

on the basis of context-free rules. These are aug-
mented with a large number of non-local fea-
tures (e.g., grandparent categories). Our approach
avoids the need for such additional features, as
TSG fragments naturally contain non-local informa-
tion. Roark’s parser outperforms ours in both full-
sentence and incremental F-score (see Section 4),
but cannot be used for sentence prediction straight-
forwardly: to obtain a prediction for the next word,
we would need to compute an argmax over the
whole vocabulary, then iterate this for each word af-
ter that (the same is true for the parsers of Schuler
et al., 2010 and Demberg et al., 2014). Most in-
cremental dependency parsers use a discriminative
model over parse actions (Nivre, 2007), and there-
fore cannot predict upcoming words either (but see
Huang and Sagae 2010).

Turning to the literature on sentence prediction,
we note that ours is the first attempt to use a parser
for this task. Existing approaches either use n-gram
models (Eng and Eisner, 2004; Bickel et al., 2005) or
a retrieval approach in which the best matching sen-
tence is identified from a sentence collection given a

set of features (Grabski and Scheffer, 2004). There
is also work combining n-gram models with lexical
semantics (Li and Hirst, 2005) or part-of-speech in-
formation (Fazly and Hirst, 2003).

In the language modeling literature, more sophis-
ticated models than simple n-gram models have
been developed in the past few years, and these
could potentially improve sentence prediction. Ex-
amples include syntactic language models which
have applied successfully for speech recognition
(Chelba and Jelinek, 2000; Xu et al., 2002) and ma-
chine translation (Schwartz et al., 2011; Tan et al.,
2011), as well as discriminative language models
(Mikolov et al., 2010; Roark et al., 2007). Future
work should evaluate these approaches against the
ITSG model proposed here.

6 Conclusions

We have presented the first incremental parser for
tree substitution grammar. Incrementality is moti-
vated by psycholinguistic findings, and by the need
for real-time interpretation in NLP. We have shown
that our parser performs competitively on both full
sentence and sentence prefix F-score. We also intro-
duced sentence prediction as a new way of evaluat-
ing incremental parsers, and demonstrated that our
parser outperforms an n-gram model in predicting
more than one upcoming word.

The performance of our approach is likely to im-
prove by implementing better binarization and more
advanced smoothing. Also, our model currently con-
tains no conditioning on lexical information, which
is also likely to yield a performance gain. Finally,
future work could involve replacing the relative fre-
quency estimator that we use with more sophisti-
cated estimation schemes.

Acknowledgments

This work was funded by EPSRC grant
EP/I032916/1 “An integrated model of syntac-
tic and semantic prediction in human language
processing”. We are grateful to Brian Roark for
clarifying correspondence and for guidance in using
his incremental parser. We would also like to thank
Katja Abramova, Vera Demberg, Mirella Lapata,
Andreas van Cranenburgh, and three anonymous
reviewers for useful comments.

12


References

Alfred V. Aho and Jeffrey D. Ullman. 1972. The
theory of parsing, translation, and compiling.
Prentice-Hall, Inc., Upper Saddle River, NJ.

Gerry T. M. Altmann and Yuki Kamide. 1999. Incre-
mental interpretation at verbs: Restricting the do-
main of subsequent reference. Cognition, 73:247–
264.

Steffen Bickel, Peter Haider, and Tobias Scheffer.
2005. Predicting sentences using n-gram lan-
guage models. In Proceedings of the Conference
on Human Language Technology and Empirical
Methods in Natural Language Processing, pages
193–200. Vancouver.

Rens Bod. 1995. The problem of computing the
most probable tree in data-oriented parsing and
stochastic tree grammars. In Proceedings of the
7th Conference of the European Chapter of the
Association for Computational Linguistics, pages
104–111. Association for Computer Linguistics,
Dublin.

Rens Bod, Khalil Sima’an, and Remko Scha. 2003.
Data-Oriented Parsing. University of Chicago
Press, Chicago, IL.

Ciprian Chelba and Frederick Jelinek. 2000. Struc-
tured language modeling. Computer Speech and
Language, 14:283–332.

Vera Demberg and Frank Keller. 2008. Data from
eye-tracking corpora as evidence for theories
of syntactic processing complexity. Cognition,
101(2):193–210.

Vera Demberg, Frank Keller, and Alexander Koller.
2014. Parsing with psycholinguistically moti-
vated tree-adjoining grammar. Computational
Linguistics, 40(1). In press.

Jay Earley. 1970. An efficient context-free pars-
ing algorithm. Communications of the ACM,
13(2):94–102.

John Eng and Jason M. Eisner. 2004. Radiology
report entry with automatic phrase completion
driven by language modeling. Radiographics,
24(5):1493–1501.

Afsaneh Fazly and Graeme Hirst. 2003. Testing
the efficacy of part-of-speech information in word

completion. In Proceedings of the EACL Work-
shop on Language Modeling for Text Entry Meth-
ods, pages 9–16. Budapest.

Joshua Goodman. 1996. Parsing algorithms and
metrics. In Proceedings of the 34th Annual Meet-
ing on Association for Computational Linguistics,
pages 177–183. Association for Computational
Linguistics, Santa Cruz.

Korinna Grabski and Tobias Scheffer. 2004. Sen-
tence completion. In Proceedings of the 27th An-
nual International ACM SIR Conference on Re-
search and Development in Information Retrieval,
pages 433–439. Sheffield.

Liang Huang and Kenji Sagae. 2010. Dynamic pro-
gramming for linear-time incremental parsing. In
Proceedings of the 48th Annual Meeting of the
Association for Computational Linguistics, pages
1077–1086. Association for Computational Lin-
guistics, Uppsala.

Dan Klein and Christopher D. Manning. 2003. Ac-
curate unlexicalized parsing. In Proceedings of
the 41st Annual Meeting on Association for Com-
putational Linguistics, pages 423–430. Associa-
tion for Computational Linguistics, Sapporo.

Jianhua Li and Graeme Hirst. 2005. Semantic
knowledge in a word completion task. In Pro-
ceedings of the 7th International ACM SIGAC-
CESS Conference on Computers and Accessibil-
ity, pages 121–128. Baltimore.

Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
Beatrice Santorini. 1993. Building a large anno-
tated corpus of english: The penn treebank. Com-
putational Linguistics, 19(2):313–330.

Tomas Mikolov, Martin Karafiat, Jan Cernocky, and
Sanjeev. 2010. Recurrent neural network based
language model. In Proceedings of the 11th
Annual Conference of the International Speech
Communication Association, pages 2877–2880.
Florence.

Joakim Nivre. 2007. Incremental non-projective
dependency parsing. In Proceedings of Human
Language Technologies: The Annual Conference
of the North American Chapter of the Associa-
tion for Computational Linguistics, pages 396–
403. Association for Computational Linguistics,
Rochester.

13


Slav Petrov. 2009. Coarse-to-Fine Natural Lan-
guage Processing. Ph.D. thesis, University of
California at Bekeley, Berkeley, CA.

Brian Roark. 2001. Probabilistic top-down parsing
and language modeling. Computational Linguis-
tistics, 27:249–276.

Brian Roark, Asaf Bachrach, Carlos Cardenas, and
Christophe Pallier. 2009. Deriving lexical and
syntactic expectation-based measures for psy-
cholinguistic modeling via incremental top-down
parsing. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing,
pages 324–333. Association for Computational
Linguistics, Singapore.

Brian Roark, Murat Saraclar, and Michael Collins.
2007. Discriminative n-gram language modeling.
Computer Speech and Language, 21(2):373–392.

D. J. Rosenkrantz and P. M. Lewis. 1970. Deter-
ministic left corner parsing. In Proceedings of
the 11th Annual Symposium on Switching and Au-
tomata Theory, pages 139–152. IEEE Computer
Society, Washington, DC.

Federico Sangati, Willem Zuidema, and Rens Bod.
2010. Efficiently extract recurring tree fragments
from large treebanks. In Nicoletta Calzolari,
Khalid Choukri, Bente Maegaard, Joseph Mar-
iani, Jan Odijk, Stelios Piperidis, Mike Rosner,
and Daniel Tapias, editors, Proceedings of the 7th
InternationalConference on Language Resources
and Evaluation. European Language Resources
Association, Valletta, Malta.

Yves Schabes. 1990. Mathematical and Computa-
tional Aspects of Lexicalized Grammars. Ph.D.
thesis, University of Pennsylvania, Philadelphia,
PA.

William Schuler, Samir AbdelRahman, Tim Miller,
and Lane Schwartz. 2010. Broad-coverage pars-
ing using human-like memory constraints. Com-
putational Linguististics, 36(1):1–30.

Lane Schwartz, Chris Callison-Burch, William
Schuler, and Stephen Wu. 2011. Incremental syn-
tactic language models for phrase-based transla-
tion. In Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics:
Human Language Technologies, Volume 1, pages

620–631. Association for Computational Linguis-
tics, Portland, OR.

Khalil Sima’an. 1996. Computational complexity
of probabilistic disambiguation by means of tree-
grammars. In Proceedings of the 16th Confer-
ence on Computational Linguistics, pages 1175–
1180. Association for Computational Linguistics,
Copenhagen.

Andreas Stolcke. 1995. An efficient probabilis-
tic context-free parsing algorithm that computes
prefix probabilities. Computational Linguistics,
21(2):165–201.

Andreas Stolcke. 2002. SRILM – an extensible lan-
guage modeling toolkit. In Proceedings Interna-
tional Conference on Spoken Language Process-
ing, pages 257–286. Denver, CO.

Scott C. Stoness, Joel Tetreault, and James Allen.
2004. Incremental parsing with reference inter-
action. In Frank Keller, Stephen Clark, Matthew
Crocker, and Mark Steedman, editors, Proceed-
ings of the ACL Workshop Incremental Parsing:
Bringing Engineering and Cognition Together,
pages 18–25. Association for Computational Lin-
guistics, Barcelona.

Ming Tan, Wenli Zhou, Lei Zheng, and Shaojun
Wang. 2011. A large scale distributed syntac-
tic, semantic and lexical language model for ma-
chine translation. In Proceedings of the 49th
Annual Meeting of the Association for Compu-
tational Linguistics: Human Language Technolo-
gies, Volume 1, pages 201–210. Association for
Computational Linguistics, Portland, OR.

Michael K. Tanenhaus, Michael J. Spivey-
Knowlton, Kathleen M. Eberhard, and Julie C.
Sedivy. 1995. Integration of visual and linguistic
information in spoken language comprehension.
Science, 268:1632–1634.

Peng Xu, Ciprian Chelba, and Frederick Jelinek.
2002. A study on richer syntactic dependencies
for structured language modeling. In Proceedings
of the 40th Annual Meeting on Association for
Computational Linguistics, pages 191–198. As-
sociation for Computational Linguistics, Philadel-
phia.

14