TREETALK: Composition and Compression of Trees for Image Descriptions

Polina Kuznetsova†
† Stony Brook University

Stony Brook, NY
pkuznetsova

@cs.stonybrook.edu

Vicente Ordonez‡ Tamara L. Berg‡
‡ UNC Chapel Hill

Chapel Hill, NC
{vicente,tlberg}

@cs.unc.edu

Yejin Choi††
††University of Washington

Seattle, WA
yejin@cs.washington.edu

Abstract

We present a new tree based approach to
composing expressive image descriptions that
makes use of naturally occuring web images
with captions. We investigate two related
tasks: image caption generalization and gen-
eration, where the former is an optional sub-
task of the latter. The high-level idea of our
approach is to harvest expressive phrases (as
tree fragments) from existing image descrip-
tions, then to compose a new description by
selectively combining the extracted (and op-
tionally pruned) tree fragments. Key algo-
rithmic components are tree composition and
compression, both integrating tree structure
with sequence structure. Our proposed system
attains significantly better performance than
previous approaches for both image caption
generalization and generation. In addition,
our work is the first to show the empirical ben-
efit of automatically generalized captions for
composing natural image descriptions.

1 Introduction
The web is increasingly visual, with hundreds of bil-
lions of user contributed photographs hosted online.
A substantial portion of these images have some sort
of accompanying text, ranging from keywords, to
free text on web pages, to textual descriptions di-
rectly describing depicted image content (i.e. cap-
tions). We tap into the last kind of text, using natu-
rally occuring pairs of images with natural language
descriptions to compose expressive descriptions for
query images via tree composition and compression.

Such automatic image captioning efforts could
potentially be useful for many applications: from

automatic organization of photo collections, to facil-
itating image search with complex natural language
queries, to enhancing web accessibility for the vi-
sually impaired. On the intellectual side, by learn-
ing to describe the visual world from naturally exist-
ing web data, our study extends the domains of lan-
guage grounding to the highly expressive language
that people use in their everyday online activities.

There has been a recent spike in efforts to au-
tomatically describe visual content in natural lan-
guage (Yang et al., 2011; Kulkarni et al., 2011; Li
et al., 2011; Farhadi et al., 2010; Krishnamoorthy et
al., 2013; Elliott and Keller, 2013; Yu and Siskind,
2013; Socher et al., 2014). This reflects the long
standing understanding that encoding the complex-
ities and subtleties of image content often requires
more expressive language constructs than a set of
tags. Now that visual recognition algorithms are be-
ginning to produce reliable estimates of image con-
tent (Perronnin et al., 2012; Deng et al., 2012a; Deng
et al., 2010; Krizhevsky et al., 2012), the time seems
ripe to begin exploring higher level semantic tasks.

There have been two main complementary direc-
tions explored for automatic image captioning. The
first focuses on describing exactly those items (e.g.,
objects, attributes) that are detected by vision recog-
nition, which subsequently confines what should be
described and how (Yao et al., 2010; Kulkarni et al.,
2011; Kojima et al., 2002). Approaches in this direc-
tion could be ideal for various practical applications
such as image description for the visually impaired.
However, it is not clear whether the semantic expres-
siveness of these approaches can eventually scale up
to the casual, but highly expressive language peo-

351

Transactions of the Association for Computational Linguistics, 2 (2014) 351–362. Action Editor: Hal Daume III.
Submitted 2/2014; Revised 5/2014; Published 10/2014. c©2014 Association for Computational Linguistics.


Target'Image'

A"cow!standing!in!the!
water!

I!no/ced!that!this!funny!
cow!was"staring"at"me"

A!bird!hovering!in"the"
grass"

You!can!see!these!
beau/ful!hills!only!in"

the"countryside"

Object' Ac/on' Stuff' Scene'

Figure 1: Harvesting phrases (as tree fragments) for the target image based on (partial) visual match.

ple naturally use in their online activities. In Fig-
ure 1, for example, it would be hard to compose “I
noticed that this funny cow was staring at me” or
“You can see these beautiful hills only in the coun-
tryside” in a purely bottom-up manner based on the
exact content detected. The key technical bottleneck
is that the range of describable content (i.e., objects,
attributes, actions) is ultimately confined by the set
of items that can be reliably recognized by state-of-
the-art vision techniques.

The second direction, in a complementary avenue
to the first, has explored ways to make use of the
rich spectrum of visual descriptions contributed by
online citizens (Kuznetsova et al., 2012; Feng and
Lapata, 2013; Mason, 2013; Ordonez et al., 2011).
In these approaches, the set of what can be described
can be substantially larger than the set of what can be
recognized, where the former is shaped and defined
by the data, rather than by humans. This allows the
resulting descriptions to be substantially more ex-
pressive, elaborate, and interesting than what would
be possible in a purely bottom-up manner. Our work
contributes to this second line of research.

One challenge in utilizing naturally existing mul-
timodal data, however, is the noisy semantic align-
ment between images and text (Dodge et al., 2012;
Berg et al., 2010). Therefore, we also investi-
gate a related task of image caption generalization
(Kuznetsova et al., 2013), which aims to improve
the semantic image-text alignment by removing bits
of text from existing captions that are less likely to
be transferable to other images.

The high-level idea of our system is to harvest
useful bits of text (as tree fragments) from exist-
ing image descriptions using detected visual content
similarity, and then to compose a new description
by selectively combining these extracted (and op-
tionally pruned) tree fragments. This overall idea

of composition based on extracted phrases is not
new in itself (Kuznetsova et al., 2012), however, we
make several technical and empirical contributions.

First, we propose a novel stochastic tree compo-
sition algorithm based on extracted tree fragments
that integrates both tree structure and sequence co-
hesion into structural inference. Our algorithm per-
mits a substantially higher level of linguistic expres-
siveness, flexibility, and creativity than those based
on rules or templates (Kulkarni et al., 2011; Yang et
al., 2011; Mitchell et al., 2012), while also address-
ing long-distance grammatical relations in a more
principled way than those based on hand-coded con-
straints (Kuznetsova et al., 2012).

Second, we address image caption generalization
as an optional subtask of image caption generation,
and propose a tree compression algorithm that per-
forms a light-weight parsing to search for the op-
timal set of tree branches to prune. Our work is
the first to report empirical benefits of automatically
compressed captions for image captioning.

The proposed approaches attain significantly bet-
ter performance for both image caption generaliza-
tion and generation tasks over competitive baselines
and previous approaches. Our work results in an im-
proved image caption corpus with automatic gener-
alization, which is publicly available.1

2 Harvesting Tree Fragments
Given a query image, we retrieve images that are vi-
sually similar to the query image, then extract po-
tentially useful segments (i.e., phrases) from their
corresponding image descriptions. We then com-
pose a new image description using these retrieved
text fragments (§3). Extraction of useful phrases
is guided by both visual similarity and the syn-
tactic parse of the corresponding textual descrip-

1http://ilp-cky.appspot.com/

352


tion. This extraction strategy, originally proposed
by Kuznetsova et al. (2012), attempts to make the
best use of linguistic regularities with respect to
objects, actions, and scenes, making it possible to
obtain richer textual descriptions than what cur-
rent state-of-the-art vision techniques can provide
in isolation. In all of our experiments we use the
captioned image corpus of Ordonez et al. (2011),
first pre-processing the corpus for relevant content
by running deformable part model object detec-
tors (Felzenszwalb et al., 2010). For our study, we
run detectors for 89 object classes set a high confi-
dence threshold for detection.

As illustrated in Figure 1, for a query image de-
tection, we extract four types of phrases (as tree
fragments). First, we retrieve relevant noun phrases
from images with visually similar object detections.
We use color, texture (Leung and Malik, 1999), and
shape (Dalal and Triggs, 2005; Lowe, 2004) based
features encoded in a histogram of vector quantized
responses to measure visual similarity. Second, we
extract verb phrases for which the corresponding
noun phrase takes the subject role. Third, from
those images with “stuff ” detections, e.g.“water”,
or “sky” (typically mass nouns), we extract preposi-
tional phrases based on similarity of both visual ap-
pearance and relative spatial relationships between
detected objects and “stuff”. Finally, we use global
“scene” similarity2 to extract prepositional phrases
referring to the overall scene, e.g., “at the confer-
ence,” or “in the market”.

We perform this phrase retrieval process for each
detected object in the query image and generate one
sentence for each object. All sentences are then
combined together to produce the final description.
Optionally, we apply image caption generalization
(via compression) (§4) to all captions in the corpus
prior to the phrase extraction and composition.

3 Tree Composition
We model tree composition as constraint optimiza-
tion. The input to our algorithm is the set of re-
trieved phrases (i.e., tree fragments), as illustrated
in §2. Let P = {p0, ...,pL−1} be the set of all
phrases across the four phrase types (objects, ac-
tions, stuff and scene). We assume a mapping func-

2L2 distance between classification score vectors (Xiao et
al., 2010)

tion pt : [0,L) → T , where T is the set of phrase
types, so that the phrase type of pi is pt(i). In ad-
dition, let R be the set of PCFG production rules
and NT be the set of nonterminal symbols of the
PCFG. The goal is to find and combine a good se-
quence of phrases G, |G| ≤ |T| = N = 4, drawn
from P , into a final sentence. More concretely, we
want to select and order a subset of phrases (at most
one phrase of each phrase type) while considering
both the parse structure and n-gram cohesion across
phrasal boundaries.

Figure 2 shows a simplified example of a com-
posed sentence with its corresponding parse struc-
ture. For brevity, the figure shows only one phrase
for each phrase type, but in actuality there would be
a set of candidate phrases for each type. Figure 3
shows the CKY-style representation of the internal
mechanics of constraint optimization for the exam-
ple composition from Figure 2. Each cell ij of the
CKY matrix corresponds to Gij, a subsequence of
G starting at position i and ending at position j. If
a cell in the CKY matrix is labeled with a nontermi-
nal symbol s, it means that the corresponding tree of
Gij has s as its root.

Although we visualize the operation using a CKY-
style representation in Figure 3, note that composi-
tion requires more complex combinatorial decisions
than CKY parsing due to two additional considera-
tions. We are: (1) selecting a subset of candidate
phrases, and (2) re-ordering the selected phrases
(hence making the problem NP-hard). Therefore,
we encode our problem using Integer Linear Pro-
gramming (ILP) (Roth and tau Yih, 2004; Clarke
and Lapata, 2008) and use the CPLEX (ILOG, Inc,
2006) solver.

3.1 ILP Variables

Variables for Sequence Structure: Variables α en-
code phrase selection and ordering:

αik = 1 iff phrase i ∈ P is selected (1)
for position k ∈ [0,N)

Where k is one of the N=4 positions in a sentence.3

Additionally, we define variables for each pair of ad-
jacent phrases to capture sequence cohesion:

3The number of positions is equal to the number of phrase
types, since we select at most one from each type.

353


A"cow in"the"countryside was"staring"at"me in#the#grass 

NP PP VP PP 

NP 

S 

i=0$ j=2$k=1$

0 1 2 3

level and each node of that level, algorithm has to
decide, which parse tag to choose. This process is
represented by assignment of a particular tag to a
matrix cell. The chosen tag must be a head of a rule,
fi example cell 12 is assigned tag V P , correspond-
ing to rule V P ! V P PP . This rule connects leafs
“going out to sea” and “in the ocean”. The prob-
lem is to find tag assignment for each cell of the ma-
trix, given some cells can be empty, if they do not
connect children cells. latter correspond to children
branches of the tree and belong to the previous diag-
onal in the left-to-right order. Also we do not try all
possible pairs5 of children from previous diagonal.
We use technique similar to the one used in CKY
parsing approach. Matrix cell pairs corresponding
to <right,left> children pairs are < ik, (k + 1)j >,
where k 2 [i, j). Here and for the remainder of the
paper, notation [i, j) means {i, i + 1, ..., j � 1} and
r is h  pq unless otherwise stated.

The problem of choosing phrase order together
with the best parse tree of the description is a com-
plex optimization problem, which we solve using
Integer Linear Programming (ILP). We use a sepa-
rate ILP formulation for for sentence reordering and
salient object selection, which we omit for brevity.
As mentioned earlier, overall for each object we
have four types of phrases. We use CKY-driven ILP
formulation to combine them together into a plausi-
ble descriptions which obeys PCFG rules. For the
remainder of the paper we will call our ILP formu-
lation ILP-TREE. We exploit Cplex (ILOG, Inc,
2006) to solve ILP problem.

Todo:[mention cplex parameters. For instance,
30sec limit on generation]

3.0.2 ILP variables
Phrase Indicator Variables: We define variables ↵
which indicate phrase selection and phrase ordering.

↵ijk = 1 iff phrase i of type j (1)
is selected
for position k

5There is only two children as we use Chomsky Normal
Form

↵ij0 = 1

↵ij1pq2 = 1

�02 S = 1

�010(NP!NP PP) = 1

�021 = 1

Where k 2 [0, N)Todo:[check for the whole pa-
per if k ranges from 0] indexes one of N=4 positions
in a sentence6.

Phrase ordering is captured by indicator variables
for adjacent pairs of phrases:

↵ijkpq(k+1) = 1 iff ↵ijk = ↵pq(k+1) = 1 (2)

An example of ILP-CKY at Figure 3 shows selec-
tion of phrases and their ordering: “The little boat”,
“going out to sea” and “in the ocean”.
Tree Indicator Variables: We also define variables
�, which are indicators of CKY matrix content (Fig-
ure 3).

�ijs = 1 iff cell ij of the matrix is assigned (3)
parse tree symbol s

Todo:[Rename symbols to tags throughout the pa-
per]

Where i 2 [0, N) indexes CKY matrix diagonals
and j 2 [0, N � i) indexes elements of diagonal i.

In order to model rule selection at each CKY step,
we define variables, which correspond to a PCFG
rule used at the given cell ij of CKY matrix:

�ijkr = 1 iff �ijh = �ikp (4)
= �(k+1)jq = 1,

Where r = h  pq 2 R and k 2 [i, j). Value k
corresponds to the choice of children for the current
cell.

6The number of positions is equal to the number of phrase
types

Figure 2: An example scenario of tree composition. Only
the first three phrases are chosen for the composition.

αijk = 1 iff αik = αj(k+1) = 1 (2)

Variables for Tree Structure: Variables β encode
the parse structure:
βijs = 1 iff the phrase sequence Gij (3)

maps to the nonterminal symbol s ∈ NT
Where i ∈ [0,N) and j ∈ [i,N) index rows and
columns of the CKY-style matrix in Figure 3. A cor-
responding example tree is shown in Figure 2, where
the phrase sequence G02 corresponds to the cell la-
beled with S. We also define variables to indicate
selected PCFG rules in the resulting parse:

βijkr = 1 iff βijh = βikp (4)
= β(k+1)jq = 1,

Where r = h → pq ∈ R and k ∈ [i,j). Index k
points to the boundary of split between two children
as shown in Figure 2 for the sequence G02.
Auxiliary Variables: For notational convenience,
we also include:

γijk = 1 iff
∑

s∈NT
βijs (5)

=
∑

s∈NT
βiks

=
∑

s∈NT
β(k+1)js = 1

3.2 ILP Objective Function
We model tree composition as maximization of the
following objective function:

F =
∑

i

Fi ×
N−1∑

k=0

αik (6)

+
∑

ij

Fij ×
N−2∑

k=0

αijk

+
∑

ij

j−1∑

k=i

∑

r∈R
Fr ×βijkr

NP NP S 

A"cow PP PP-VP 

in"the"
countryside 

VP VP 

was"staring"
at"me 

PP 

in#the#grass 

00" 01" 02" 03"

11" 12" 13"

33"

22" 23"

k=1$
k=0$

level and each node of that level, algorithm has to
decide, which parse tag to choose. This process is
represented by assignment of a particular tag to a
matrix cell. The chosen tag must be a head of a rule,
fi example cell 12 is assigned tag V P , correspond-
ing to rule V P ! V P PP . This rule connects leafs
“going out to sea” and “in the ocean”. The prob-
lem is to find tag assignment for each cell of the ma-
trix, given some cells can be empty, if they do not
connect children cells. latter correspond to children
branches of the tree and belong to the previous diag-
onal in the left-to-right order. Also we do not try all
possible pairs5 of children from previous diagonal.
We use technique similar to the one used in CKY
parsing approach. Matrix cell pairs corresponding
to <right,left> children pairs are < ik, (k + 1)j >,
where k 2 [i, j). Here and for the remainder of the
paper, notation [i, j) means {i, i + 1, ..., j � 1} and
r is h  pq unless otherwise stated.

The problem of choosing phrase order together
with the best parse tree of the description is a com-
plex optimization problem, which we solve using
Integer Linear Programming (ILP). We use a sepa-
rate ILP formulation for for sentence reordering and
salient object selection, which we omit for brevity.
As mentioned earlier, overall for each object we
have four types of phrases. We use CKY-driven ILP
formulation to combine them together into a plausi-
ble descriptions which obeys PCFG rules. For the
remainder of the paper we will call our ILP formu-
lation ILP-TREE. We exploit Cplex (ILOG, Inc,
2006) to solve ILP problem.

Todo:[mention cplex parameters. For instance,
30sec limit on generation]

3.0.2 ILP variables
Phrase Indicator Variables: We define variables ↵
which indicate phrase selection and phrase ordering.

↵ijk = 1 iff phrase i of type j (1)
is selected
for position k

5There is only two children as we use Chomsky Normal
Form

↵ij0 = 1

↵ij1pq2 = 1

�02 S = 1

�010(NP!NP PP) = 1

�021 = 1

Where k 2 [0, N)Todo:[check for the whole pa-
per if k ranges from 0] indexes one of N=4 positions
in a sentence6.

Phrase ordering is captured by indicator variables
for adjacent pairs of phrases:

↵ijkpq(k+1) = 1 iff ↵ijk = ↵pq(k+1) = 1 (2)

An example of ILP-CKY at Figure 3 shows selec-
tion of phrases and their ordering: “The little boat”,
“going out to sea” and “in the ocean”.
Tree Indicator Variables: We also define variables
�, which are indicators of CKY matrix content (Fig-
ure 3).

�ijs = 1 iff cell ij of the matrix is assigned (3)
parse tree symbol s

Todo:[Rename symbols to tags throughout the pa-
per]

Where i 2 [0, N) indexes CKY matrix diagonals
and j 2 [0, N � i) indexes elements of diagonal i.

In order to model rule selection at each CKY step,
we define variables, which correspond to a PCFG
rule used at the given cell ij of CKY matrix:

�ijkr = 1 iff �ijh = �ikp (4)
= �(k+1)jq = 1,

Where r = h  pq 2 R and k 2 [i, j). Value k
corresponds to the choice of children for the current
cell.

6The number of positions is equal to the number of phrase
types

level and each node of that level, algorithm has to
decide, which parse tag to choose. This process is
represented by assignment of a particular tag to a
matrix cell. The chosen tag must be a head of a rule,
fi example cell 12 is assigned tag V P , correspond-
ing to rule V P ! V P PP . This rule connects leafs
“going out to sea” and “in the ocean”. The prob-
lem is to find tag assignment for each cell of the ma-
trix, given some cells can be empty, if they do not
connect children cells. latter correspond to children
branches of the tree and belong to the previous diag-
onal in the left-to-right order. Also we do not try all
possible pairs5 of children from previous diagonal.
We use technique similar to the one used in CKY
parsing approach. Matrix cell pairs corresponding
to <right,left> children pairs are < ik, (k + 1)j >,
where k 2 [i, j). Here and for the remainder of the
paper, notation [i, j) means {i, i + 1, ..., j � 1} and
r is h  pq unless otherwise stated.

The problem of choosing phrase order together
with the best parse tree of the description is a com-
plex optimization problem, which we solve using
Integer Linear Programming (ILP). We use a sepa-
rate ILP formulation for for sentence reordering and
salient object selection, which we omit for brevity.
As mentioned earlier, overall for each object we
have four types of phrases. We use CKY-driven ILP
formulation to combine them together into a plausi-
ble descriptions which obeys PCFG rules. For the
remainder of the paper we will call our ILP formu-
lation ILP-TREE. We exploit Cplex (ILOG, Inc,
2006) to solve ILP problem.

Todo:[mention cplex parameters. For instance,
30sec limit on generation]

3.0.2 ILP variables
Phrase Indicator Variables: We define variables ↵
which indicate phrase selection and phrase ordering.

↵ijk = 1 iff phrase i of type j (1)
is selected
for position k

5There is only two children as we use Chomsky Normal
Form

↵ij0 = 1

↵ij1pq2 = 1

�02 S = 1

�010(NP!NP PP) = 1

�021 = 1

Where k 2 [0, N)Todo:[check for the whole pa-
per if k ranges from 0] indexes one of N=4 positions
in a sentence6.

Phrase ordering is captured by indicator variables
for adjacent pairs of phrases:

↵ijkpq(k+1) = 1 iff ↵ijk = ↵pq(k+1) = 1 (2)

An example of ILP-CKY at Figure 3 shows selec-
tion of phrases and their ordering: “The little boat”,
“going out to sea” and “in the ocean”.
Tree Indicator Variables: We also define variables
�, which are indicators of CKY matrix content (Fig-
ure 3).

�ijs = 1 iff cell ij of the matrix is assigned (3)
parse tree symbol s

Todo:[Rename symbols to tags throughout the pa-
per]

Where i 2 [0, N) indexes CKY matrix diagonals
and j 2 [0, N � i) indexes elements of diagonal i.

In order to model rule selection at each CKY step,
we define variables, which correspond to a PCFG
rule used at the given cell ij of CKY matrix:

�ijkr = 1 iff �ijh = �ikp (4)
= �(k+1)jq = 1,

Where r = h  pq 2 R and k 2 [i, j). Value k
corresponds to the choice of children for the current
cell.

6The number of positions is equal to the number of phrase
types

level and each node of that level, algorithm has to
decide, which parse tag to choose. This process is
represented by assignment of a particular tag to a
matrix cell. The chosen tag must be a head of a rule,
fi example cell 12 is assigned tag V P , correspond-
ing to rule V P ! V P PP . This rule connects leafs
“going out to sea” and “in the ocean”. The prob-
lem is to find tag assignment for each cell of the ma-
trix, given some cells can be empty, if they do not
connect children cells. latter correspond to children
branches of the tree and belong to the previous diag-
onal in the left-to-right order. Also we do not try all
possible pairs5 of children from previous diagonal.
We use technique similar to the one used in CKY
parsing approach. Matrix cell pairs corresponding
to <right,left> children pairs are < ik, (k + 1)j >,
where k 2 [i, j). Here and for the remainder of the
paper, notation [i, j) means {i, i + 1, ..., j � 1} and
r is h  pq unless otherwise stated.

The problem of choosing phrase order together
with the best parse tree of the description is a com-
plex optimization problem, which we solve using
Integer Linear Programming (ILP). We use a sepa-
rate ILP formulation for for sentence reordering and
salient object selection, which we omit for brevity.
As mentioned earlier, overall for each object we
have four types of phrases. We use CKY-driven ILP
formulation to combine them together into a plausi-
ble descriptions which obeys PCFG rules. For the
remainder of the paper we will call our ILP formu-
lation ILP-TREE. We exploit Cplex (ILOG, Inc,
2006) to solve ILP problem.

Todo:[mention cplex parameters. For instance,
30sec limit on generation]

3.0.2 ILP variables
Phrase Indicator Variables: We define variables ↵
which indicate phrase selection and phrase ordering.

↵ijk = 1 iff phrase i of type j (1)
is selected
for position k

5There is only two children as we use Chomsky Normal
Form

↵ij0 = 1

↵ij1pq2 = 1

�02 S = 1

�010(NP!NP PP) = 1

�021 = 1

Where k 2 [0, N)Todo:[check for the whole pa-
per if k ranges from 0] indexes one of N=4 positions
in a sentence6.

Phrase ordering is captured by indicator variables
for adjacent pairs of phrases:

↵ijkpq(k+1) = 1 iff ↵ijk = ↵pq(k+1) = 1 (2)

An example of ILP-CKY at Figure 3 shows selec-
tion of phrases and their ordering: “The little boat”,
“going out to sea” and “in the ocean”.
Tree Indicator Variables: We also define variables
�, which are indicators of CKY matrix content (Fig-
ure 3).

�ijs = 1 iff cell ij of the matrix is assigned (3)
parse tree symbol s

Todo:[Rename symbols to tags throughout the pa-
per]

Where i 2 [0, N) indexes CKY matrix diagonals
and j 2 [0, N � i) indexes elements of diagonal i.

In order to model rule selection at each CKY step,
we define variables, which correspond to a PCFG
rule used at the given cell ij of CKY matrix:

�ijkr = 1 iff �ijh = �ikp (4)
= �(k+1)jq = 1,

Where r = h  pq 2 R and k 2 [i, j). Value k
corresponds to the choice of children for the current
cell.

6The number of positions is equal to the number of phrase
types

k=0$

of two variables have been discussed by Clarke and
Lapata (2008). For Equation 2, we add the follow-
ing constraints (similar constraints are also added for
Equations 4,5).

8ijkpqm, ↵ijk  ↵ik (7)
↵ijk  ↵j(k+1)

↵ijk + (1 � ↵ik) + (1 � ↵j(k+1)) � 1

Consistency between Tree Leafs and Sequences:
The ordering of phrases implied by ↵ijk must be
consistent with the ordering of phrases implied by
the � variables. This can be achieved by aligning the
leaf cells (i.e., �kks) in the CKY-style matrix with ↵
variables as follows:

8ik, ↵ik 
X

s2Si
�kks (8)

8k,
X

i

↵ik =
X

s2S
�kks (9)

Where Si refers to the set of PCFG nonterminals
that are compatible with the phrase type of pi. For
example, Si = {NN,NP, ...} if pi corresponds to
an “object” (noun-phrase). Thus, Equation 8 en-
forces the correspondence between phrase types and
nonterminal symbols at the tree leafs. Equation 9
enforces the constraint that the number of selected
phrases and instantiated tree leafs must be the same.

Tree Congruence Constraints: To ensure that
each CKY cell has at most one symbol we require

8ij,
X

s2S
�ijs  1 (10)

We also require that

8i,j>i,h, �ijh =
j�1X

k=i

X

r2Rh
�ijkr (11)

Where Rh = {r 2 R : r = h ! pq}. We enforce
these constraints only for non-leafs. This constraint
forbids instantiations where a nonterminal symbol h
is selected for cell ij without selecting a correspond-
ing PCFG rule.

We also ensure that we produce a valid tree struc-
ture. For instance, if we select 3 phrases as shown
in Figure 3, we must have the root of the tree at the
corresponding cell 02.

8k2[1,N),
X

s2S
�kks 

N�1X

t=k

X

s2S
�0ts (12)

We also require cells that are not selected for the
resulting parse structure to be empty:

8ij
X

k

�ijk  1 (13)

↵i0 = 1 (14)

↵ij1 = 1 (15)

Additionally, we penalize solutions without the S
tag at the parse root as a soft-constraint.

Miscellaneous Constraints: Finally, we include
several constraints to avoid degenerate solutions or
otherwise to enhance the composed output: (1) en-
force that a noun-phrase is selected (to ensure se-
mantic relevance to the image content), (2) allow at
most one phrase of each type, (3) do not allow mul-
tiple phrases with identical headwords (to avoid re-
dundancy), (4) allow at most one scene phrase for
all sentences in the description. We find that han-
dling of sentence boundaries is important if the ILP
formulation is based only on sequence structure, but
with the integration of tree-based structure, we need
not handle sentence boundaries.

3.4 Discussion

An interesting aspect of description generation ex-
plored in this paper is that building blocks of com-
position are tree fragments, rather than individual
words. There are three practical benefits: (1) syn-
tactic and semantic expressiveness, (2) correctness,
and (3) computational efficiency. Because we ex-
tract nice segments from human written captions, we
are able to use expressive language, and less likely
to make syntactic or semantic errors. Our phrase
extraction process can be viewed at a high level as
visually-grounded or visually-situated paraphrasing.

Also, because the unit of operation is tree frag-
ments, the ILP formulation encoded in this work is
computationally lightweight. If the unit of compo-
sition was words, the ILP instances would be sig-
nificantly more computationally intensive, and more
likely to suffer from grammatical and semantic er-
rors.

of two variables have been discussed by Clarke and
Lapata (2008). For Equation 2, we add the follow-
ing constraints (similar constraints are also added for
Equations 4,5).

8ijkpqm, ↵ijk  ↵ik (7)
↵ijk  ↵j(k+1)

↵ijk + (1 � ↵ik) + (1 � ↵j(k+1)) � 1

Consistency between Tree Leafs and Sequences:
The ordering of phrases implied by ↵ijk must be
consistent with the ordering of phrases implied by
the � variables. This can be achieved by aligning the
leaf cells (i.e., �kks) in the CKY-style matrix with ↵
variables as follows:

8ik, ↵ik 
X

s2Si
�kks (8)

8k,
X

i

↵ik =
X

s2S
�kks (9)

Where Si refers to the set of PCFG nonterminals
that are compatible with the phrase type of pi. For
example, Si = {NN,NP, ...} if pi corresponds to
an “object” (noun-phrase). Thus, Equation 8 en-
forces the correspondence between phrase types and
nonterminal symbols at the tree leafs. Equation 9
enforces the constraint that the number of selected
phrases and instantiated tree leafs must be the same.

Tree Congruence Constraints: To ensure that
each CKY cell has at most one symbol we require

8ij,
X

s2S
�ijs  1 (10)

We also require that

8i,j>i,h, �ijh =
j�1X

k=i

X

r2Rh
�ijkr (11)

Where Rh = {r 2 R : r = h ! pq}. We enforce
these constraints only for non-leafs. This constraint
forbids instantiations where a nonterminal symbol h
is selected for cell ij without selecting a correspond-
ing PCFG rule.

We also ensure that we produce a valid tree struc-
ture. For instance, if we select 3 phrases as shown
in Figure 3, we must have the root of the tree at the
corresponding cell 02.

8k2[1,N),
X

s2S
�kks 

N�1X

t=k

X

s2S
�0ts (12)

We also require cells that are not selected for the
resulting parse structure to be empty:

8ij
X

k

�ijk  1 (13)

↵i0 = 1 (14)

↵ij1 = 1 (15)

Additionally, we penalize solutions without the S
tag at the parse root as a soft-constraint.

Miscellaneous Constraints: Finally, we include
several constraints to avoid degenerate solutions or
otherwise to enhance the composed output: (1) en-
force that a noun-phrase is selected (to ensure se-
mantic relevance to the image content), (2) allow at
most one phrase of each type, (3) do not allow mul-
tiple phrases with identical headwords (to avoid re-
dundancy), (4) allow at most one scene phrase for
all sentences in the description. We find that han-
dling of sentence boundaries is important if the ILP
formulation is based only on sequence structure, but
with the integration of tree-based structure, we need
not handle sentence boundaries.

3.4 Discussion

An interesting aspect of description generation ex-
plored in this paper is that building blocks of com-
position are tree fragments, rather than individual
words. There are three practical benefits: (1) syn-
tactic and semantic expressiveness, (2) correctness,
and (3) computational efficiency. Because we ex-
tract nice segments from human written captions, we
are able to use expressive language, and less likely
to make syntactic or semantic errors. Our phrase
extraction process can be viewed at a high level as
visually-grounded or visually-situated paraphrasing.

Also, because the unit of operation is tree frag-
ments, the ILP formulation encoded in this work is
computationally lightweight. If the unit of compo-
sition was words, the ILP instances would be sig-
nificantly more computationally intensive, and more
likely to suffer from grammatical and semantic er-
rors.

of two variables have been discussed by Clarke and
Lapata (2008). For Equation 2, we add the follow-
ing constraints (similar constraints are also added for
Equations 4,5).

8ijkpqm, ↵ijk  ↵ik (7)
↵ijk  ↵j(k+1)

↵ijk + (1 � ↵ik) + (1 � ↵j(k+1)) � 1

Consistency between Tree Leafs and Sequences:
The ordering of phrases implied by ↵ijk must be
consistent with the ordering of phrases implied by
the � variables. This can be achieved by aligning the
leaf cells (i.e., �kks) in the CKY-style matrix with ↵
variables as follows:

8ik, ↵ik 
X

s2Si
�kks (8)

8k,
X

i

↵ik =
X

s2S
�kks (9)

Where Si refers to the set of PCFG nonterminals
that are compatible with the phrase type of pi. For
example, Si = {NN,NP, ...} if pi corresponds to
an “object” (noun-phrase). Thus, Equation 8 en-
forces the correspondence between phrase types and
nonterminal symbols at the tree leafs. Equation 9
enforces the constraint that the number of selected
phrases and instantiated tree leafs must be the same.

Tree Congruence Constraints: To ensure that
each CKY cell has at most one symbol we require

8ij,
X

s2S
�ijs  1 (10)

We also require that

8i,j>i,h, �ijh =
j�1X

k=i

X

r2Rh
�ijkr (11)

Where Rh = {r 2 R : r = h ! pq}. We enforce
these constraints only for non-leafs. This constraint
forbids instantiations where a nonterminal symbol h
is selected for cell ij without selecting a correspond-
ing PCFG rule.

We also ensure that we produce a valid tree struc-
ture. For instance, if we select 3 phrases as shown
in Figure 3, we must have the root of the tree at the
corresponding cell 02.

8k2[1,N),
X

s2S
�kks 

N�1X

t=k

X

s2S
�0ts (12)

We also require cells that are not selected for the
resulting parse structure to be empty:

8ij
X

k

�ijk  1 (13)

Fi (14)

Fij (15)

Additionally, we penalize solutions without the S
tag at the parse root as a soft-constraint.

Miscellaneous Constraints: Finally, we include
several constraints to avoid degenerate solutions or
otherwise to enhance the composed output: (1) en-
force that a noun-phrase is selected (to ensure se-
mantic relevance to the image content), (2) allow at
most one phrase of each type, (3) do not allow mul-
tiple phrases with identical headwords (to avoid re-
dundancy), (4) allow at most one scene phrase for
all sentences in the description. We find that han-
dling of sentence boundaries is important if the ILP
formulation is based only on sequence structure, but
with the integration of tree-based structure, we need
not handle sentence boundaries.

3.4 Discussion

An interesting aspect of description generation ex-
plored in this paper is that building blocks of com-
position are tree fragments, rather than individual
words. There are three practical benefits: (1) syn-
tactic and semantic expressiveness, (2) correctness,
and (3) computational efficiency. Because we ex-
tract nice segments from human written captions, we
are able to use expressive language, and less likely
to make syntactic or semantic errors. Our phrase
extraction process can be viewed at a high level as
visually-grounded or visually-situated paraphrasing.

Also, because the unit of operation is tree frag-
ments, the ILP formulation encoded in this work is
computationally lightweight. If the unit of compo-
sition was words, the ILP instances would be sig-
nificantly more computationally intensive, and more
likely to suffer from grammatical and semantic er-
rors.

of two variables have been discussed by Clarke and
Lapata (2008). For Equation 2, we add the follow-
ing constraints (similar constraints are also added for
Equations 4,5).

8ijkpqm, ↵ijk  ↵ik (7)
↵ijk  ↵j(k+1)

↵ijk + (1 � ↵ik) + (1 � ↵j(k+1)) � 1

Consistency between Tree Leafs and Sequences:
The ordering of phrases implied by ↵ijk must be
consistent with the ordering of phrases implied by
the � variables. This can be achieved by aligning the
leaf cells (i.e., �kks) in the CKY-style matrix with ↵
variables as follows:

8ik, ↵ik 
X

s2Si
�kks (8)

8k,
X

i

↵ik =
X

s2S
�kks (9)

Where Si refers to the set of PCFG nonterminals
that are compatible with the phrase type of pi. For
example, Si = {NN,NP, ...} if pi corresponds to
an “object” (noun-phrase). Thus, Equation 8 en-
forces the correspondence between phrase types and
nonterminal symbols at the tree leafs. Equation 9
enforces the constraint that the number of selected
phrases and instantiated tree leafs must be the same.

Tree Congruence Constraints: To ensure that
each CKY cell has at most one symbol we require

8ij,
X

s2S
�ijs  1 (10)

We also require that

8i,j>i,h, �ijh =
j�1X

k=i

X

r2Rh
�ijkr (11)

Where Rh = {r 2 R : r = h ! pq}. We enforce
these constraints only for non-leafs. This constraint
forbids instantiations where a nonterminal symbol h
is selected for cell ij without selecting a correspond-
ing PCFG rule.

We also ensure that we produce a valid tree struc-
ture. For instance, if we select 3 phrases as shown
in Figure 3, we must have the root of the tree at the
corresponding cell 02.

8k2[1,N),
X

s2S
�kks 

N�1X

t=k

X

s2S
�0ts (12)

We also require cells that are not selected for the
resulting parse structure to be empty:

8ij
X

k

�ijk  1 (13)

Fi (14)

Fij (15)

Additionally, we penalize solutions without the S
tag at the parse root as a soft-constraint.

Miscellaneous Constraints: Finally, we include
several constraints to avoid degenerate solutions or
otherwise to enhance the composed output: (1) en-
force that a noun-phrase is selected (to ensure se-
mantic relevance to the image content), (2) allow at
most one phrase of each type, (3) do not allow mul-
tiple phrases with identical headwords (to avoid re-
dundancy), (4) allow at most one scene phrase for
all sentences in the description. We find that han-
dling of sentence boundaries is important if the ILP
formulation is based only on sequence structure, but
with the integration of tree-based structure, we need
not handle sentence boundaries.

3.4 Discussion

An interesting aspect of description generation ex-
plored in this paper is that building blocks of com-
position are tree fragments, rather than individual
words. There are three practical benefits: (1) syn-
tactic and semantic expressiveness, (2) correctness,
and (3) computational efficiency. Because we ex-
tract nice segments from human written captions, we
are able to use expressive language, and less likely
to make syntactic or semantic errors. Our phrase
extraction process can be viewed at a high level as
visually-grounded or visually-situated paraphrasing.

Also, because the unit of operation is tree frag-
ments, the ILP formulation encoded in this work is
computationally lightweight. If the unit of compo-
sition was words, the ILP instances would be sig-
nificantly more computationally intensive, and more
likely to suffer from grammatical and semantic er-
rors.

Figure 3: CKY-style representation of decision variables
as defined in §3.1 for the tree example in Fig 2. Non-
terminal symbols in boldface (in blue) and solid arrows
(also in blue) represent the chosen PCFG rules to com-
bine the selected set of phrases. Nonterminal symbols in
smaller font (in red) and dotted arrows (also in red) rep-
resent possible other choices that are not selected.

This objective is comprised of three types of weights
(confidence scores): Fi,Fij,Fr.4 Fi represents the
phrase selection score based on visual similarity, de-
scribed in §2. Fij quantifies the sequence cohe-
sion across phrase boundaries. For this, we use n-
gram scores (n ∈ [2, 5]) between adjacent phrases
computed using the Google Web 1-T corpus (Brants
and Franz., 2006). Finally, Fr quantifies PCFG rule
scores (log probabilities) estimated from the 1M im-
age caption corpus (Ordonez et al., 2011) parsed us-
ing the Stanford parser (Klein and Manning, 2003).

One can view Fi as a content selection score,
while Fij and Fr correspond to linguistic fluency
scores capturing sequence and tree structure respec-
tively. If we set positive values for all of these
weights, the optimization function would be biased
toward verbose production, since selecting an addi-
tional phrase will increase the objective function. To
control for verbosity, we set scores corresponding
to linguistic fluency, i.e., Fij and Fr using negative
values (smaller absolute values for higher fluency),
to balance dynamics between content selection and
linguistic fluency.

3.3 ILP Constraints
Soundness Constraints: We need constraints to
enforce consistency between different types of vari-

4All weights are normalized using z-score.

354


ables (Equations 2, 4, 5). Constraints for a product
of two variables have been discussed by Clarke and
Lapata (2008). For Equation 2, we add the follow-
ing constraints (similar constraints are also added for
Equations 4,5).

∀ijk, αijk ≤ αik (7)
αijk ≤ αj(k+1)

αijk + (1 −αik) + (1 −αj(k+1)) ≥ 1

Consistency between Tree Leafs and Sequences:
The ordering of phrases implied by αijk must be
consistent with the ordering of phrases implied by
the β variables. This can be achieved by aligning the
leaf cells (i.e., βkks) in the CKY-style matrix with α
variables as follows:

∀ik,αik ≤
∑

s∈NT i
βkks (8)

∀k,
∑

i

αik =
∑

s∈NT
βkks (9)

Where NT i refers to the set of PCFG nonterminals
that are compatible with a phrase type pt(i) of pi.
For example, NT i = {NN,NP, ...} if pi corresponds
to an “object” (noun-phrase). Thus, Equation 8 en-
forces the correspondence between phrase types and
nonterminal symbols at the tree leafs. Equation 9
enforces the constraint that the number of selected
phrases and instantiated tree leafs must be the same.

Tree Congruence Constraints: To ensure that
each CKY cell has at most one symbol we require

∀ij,
∑

s∈NT
βijs ≤ 1 (10)

We also require that

∀i,j>i,h, βijh =
j−1∑

k=i

∑

r∈Rh
βijkr (11)

Where Rh = {r ∈ R : r = h → pq}. We enforce
these constraints only for non-leafs. This constraint
forbids instantiations where a nonterminal symbol h
is selected for cell ij without selecting a correspond-
ing PCFG rule.

We also ensure that we produce a valid tree struc-
ture. For instance, if we select 3 phrases as shown
in Figure 3, we must have the root of the tree at the
corresponding cell 02.

∀k∈[1,N),
∑

s∈NT
βkks ≤

N−1∑

t=k

∑

s∈NT
β0ts (12)

We also require cells that are not selected for the
resulting parse structure to be empty:

∀ij
∑

k

γijk ≤ 1 (13)

Additionally, we penalize solutions without the S
tag at the parse root as a soft-constraint.

Miscellaneous Constraints: Finally, we include
several constraints to avoid degenerate solutions or
to otherwise enhance the composed output. We: (1)
enforce that a noun-phrase is selected (to ensure se-
mantic relevance to the image content), (2) allow at
most one phrase of each type, (3) do not allow mul-
tiple phrases with identical headwords (to avoid re-
dundancy), (4) allow at most one scene phrase for
all sentences in the description. We find that han-
dling of sentence boundaries is important if the ILP
formulation is based only on sequence structure, but
with the integration of tree-based structure, we do
not need to specifically handle sentence boundaries.

3.4 Discussion
An interesting aspect of description generation ex-
plored in this paper is using tree fragments as the
building blocks of composition rather than individ-
ual words. There are three practical benefits: (1)
syntactic and semantic expressiveness, (2) correct-
ness, and (3) computational efficiency. Because we
extract phrases from human written captions, we are
able to use expressive language, and less likely to
make syntactic or semantic errors. Our phrase ex-
traction process can be viewed at a high level as
visually-grounded or visually-situated paraphrasing.
Also, because the unit of operation is tree fragments,
the ILP formulation encoded in this work is com-
putationally lightweight. If the unit of composition
was words, the ILP instances would be significantly
more computationally intensive, and more likely to
suffer from grammatical and semantic errors.

4 Tree Compression
As noted by recent studies (Mason and Charniak,
2013; Kuznetsova et al., 2013; Jamieson et al.,
2010), naturally existing image captions often in-
clude contextual information that does not directly
describe visual content, which ultimately hinders
their usefulness for describing other images. There-
fore, to improve the fidelity of the generated descrip-
tions, we explore image caption generalization as an

355


Late%in%the%day,%a,er%my%sunset%shot%
a2empts,%my%cat%strolled%along%the%

fence%and%posed%for%this%classic%profile%

Late%in%the%day%%%cat%%
%

posed%for%this%profile%

Generaliza)on+

This%bridge%stands%
late%in%the%day,%

a,er%my%sunset%shot%
a2empts%

A%cat%
strolled%along%the%fence%

and%posed%for%this%classic%profile%

Figure 4: Compressed captions (on the left) are more ap-
plicable for describing new images (on the right).

optional pre-processing step. Figure 4 illustrates a
concrete example of image caption generalization in
the context of image caption generation.

We cast caption generalization as sentence com-
pression. We encode the problem as tree pruning via
lightweight CKY parsing, while also incorporating
several other considerations such as leaf-level ngram
cohesion scores and visually informed content selec-
tion. Figure 5 shows an example compression, and
Figure 6 shows the corresponding CKY matrix.

At a high level, the compression operation resem-
bles bottom-up CKY parsing, but in addition to pars-
ing, we also consider deletion of parts of the trees.
When deleting parts of the original tree, we might
need to re-parse the remainder of the tree. Note that
we consider re-parsing only with respect to the orig-
inal parse tree produced by a state-of-the-art parser,
hence it is only a light-weight parsing.5

4.1 Dynamic Programming

Input to the algorithm is a sentence, represented as a
vector x = x0...xn−1 = x[0 : n− 1], and its PCFG
parse π(x) obtained from the Stanford parser. For
simplicity of notation, we assume that both the parse
tree and the word sequence are encoded in x. Then,
the compression can be formalized as:

5Integrating full parsing into the original sentence would be
a straightforward extension conceptually, but may not be an em-
pirically better choice when parsing for compression is based on
vanilla unlexicalized parsing.

ŷ = arg max
y

∏

i

φi(x,y) (14)

Where each φi is a potential function, corresponding
to a criteria of the desired compression:

φi(x,y) = exp(θi ·fi(x,y)) (15)
Where θi is the weight for a particular criteria (de-
scribed in §4.2), whose scoring function is fi.

We solve the decoding problem (Equation 14) us-
ing dynamic programming. For this, we need to
solve the compression sub-problems for sequences
x[i : j], which can be viewed as branches ŷ[i,j] of
the final tree ŷ[0 : n− 1]. For example, in Figure 5,
the final solution is ŷ[0 : 7], while a sub-solution of
x[4 : 7] corresponds to a tree branch PP . Notice
that sub-solution ŷ[3 : 7] represents the same branch
as ŷ[4 : 7] due to branch deletion. Some computed
sub-solutions, e.g., ŷ[1 : 4], get dropped from the
final compressed tree.

We define a matrix of scores D[i,j,h] (Equa-
tion 17), where h is one of the nonterminal symbols
being considered for a cell indexed by i,j, i.e. a can-
didate for the root symbol of a branch ŷ[i : j]. When
all values D[i,j,h] are computed, we take

ĥ = arg max
h

D[0,n− 1,h] (16)

and backtrack to reconstruct the final compression
(the exact solution to equation 14).

D[i,j,h] = max
k ∈ [i, j)
r ∈ Rh





(1) D[i,k,p] + D[k + 1,j,q]
+∆φ[r,ij]

(2) D[i,k,p] + ∆φ[r,ij]

(3) D[k + 1,j,p] + ∆φ[r,ij]

(17)

Where Rh = {r ∈ R : r = h → pq ∨ r = h → p}.
Index k determines a split point for child branches
of a subtree ŷ[i : j]. For example, in the Figure 5 the
split point for children of the subtree ŷ[0 : 7] is k =
2. The three cases ((1) – (3)) of the above equation
correspond to the following tree pruning cases:

Pruning Case (1): None of the children of the cur-
rent node is deleted. For example, in Figures 5 and
6, the PCFG rule PP → IN PP , corresponding
to the sequence “in black and white”, is retained.
Another situation that can be encountered is tree re-
parsing.

356


Vintage! motorcycle! shot! done! in! black! and! white!

JJ! NN! NN! VBN! IN! JJ! JJ!CC!

NP, NN!

NP!

CC-JJ 

VP,  PP 

NP!

PP 

S 

Dele%on!
probability!

Rule!
probability!

Vision!
confidence!

Ngram!
cohesion!

(Dele%on,)case)2))
(Dele%on,)case)1))

0 1 2 3 4 5 6 7
k=2$

Figure 5: CKY compression. Both the chosen rules and
phrases (blue bold font and blue solid arrows) and not
chosen rules and phrases (red italic smaller font and red
dashed lines) are shown.

Pruning Case (2)/(3): Deletion of the left/right
child respectively. There are two types of deletion,
as illustrated in Figures 5 and 6. The first corre-
sponds to deletion of a child node. For example,
the second child NN of rule NP → NP NN is
deleted, which yields deletion of “shot”. The sec-
ond type is a special case of propagating a node
to a higher-level of the tree. In Figure 6, this sit-
uation occurs when deleting JJ “Vintage”, which
causes the propagation of NN from cell 11 to cell
01. For this purpose, we expand the set of rules R
with additional special rules of the form h → h,
e.g., NN → NN, which allows propagation of tree
nodes to higher levels of the compressed tree.6

4.2 Modeling Compression Criteria
The ∆φ term7 in Equation 17 denotes the sum of log
of potential functions for each criteria q:

∆φ[r,ij] =
∑

q

θ · ∆fq(r,ij) (18)

Note that ∆φ depends on the current rule r, along
with the historical information before the current
step ij, such as the original rule rij, and ngrams on
the border between left and right child branches of
rule rij. We use the following four criteria fq in our
model, which are demonstrated in Figures 5 and 6.
I. Tree Structure: We capture PCFG rule prob-
abilities estimated from the corpus as ∆fpcfg =
log Ppcfg(r).

6We assign probabilities of these special propagation rules
to 1 so that they will not affect the final parse tree score. Turner
and Charniak (2005) handled propagation cases similarly.

7We use ∆ to distinguish the potential value for the whole
sentence from the gain of the potential during a single step of
the algorithm.

JJ NP, NN NP S 

Vintage NN 

motorcycle NN 

shot VBN VP, PP 

done IN PP 

in JJ NP 

black CC CC-JJ 

and JJ 

white 

00"

11"

01" Rule%
probability%

Ngram%
cohesion%

Dele6on%
probability%

Vision%
Confidence%

i"

j"

Figure 6: CKY compression. Both the chosen rules and
phrases (blue bold font and blue solid arrows) and not
chosen rules and phrases (red italic smaller font and red
dashed lines) are shown.

II. Sequence Structure: We incorporate ngram
cohesion scores only across the border between two
branches of a subtree.
III. Branch Deletion Probabilities: We compute
probabilities of deletion for children as:

∆fdel = log P(rt|rij) = log
count(rt,rij)

count(rij)
(19)

Where count(rt,rij) is the frequency in which rij is
transformed to rt by deletion of one of the children.
We estimate this probability from a training corpus,
described in §4.3. count(rij) is the count of rij in
uncompressed sentences.

IV. Vision Detection (Content Selection): We
want to keep words referring to actual objects in
the image. Thus, we use V (xj), a visual similarity
score, as our confidence of an object corresponding
to word xj. This similarity is obtained from the vi-
sual recognition predictions of (Deng et al., 2012b).

Note that some test instances include rules that
we have not observed during training. We default
to the original caption in those cases. The weights
θi are set using a tuning dataset. We control over-
compression by setting the weight for fdel to a small
value relative to the other weights.

4.3 Human Compressed Captions
Although we model image caption generalization as
sentence compression, in practical applications we
may want the outputs of these two tasks to be differ-
ent. For example, there may be differences in what
should be deleted (named entities in newswire sum-
maries could be important to keep, while they may

357


Orig:"Note"the"pillows,"they"match"the"
chair"that"goes"with"it,"plus"the"table"
in"the"picture"is"included.%
SeqCompression:%The"table"in"the"
picture."
"
TreePruning:"The"chair"with"the"table"
in"the"picture."

Orig:"Only"in"winter;me"we"see"
these"birds"here"in"the"river."
%
SeqCompression:"See"these"birds"
in"the"river."
"
TreePruning:"These"birds"in"the"
river.""

Orig:"The"world's"most"powerful"
lighthouse"si@ng"beside"the"house"
with"the"world's"thickest"curtains."
SeqCompression:%Si@ng"beside"
the"house"
"
TreePruning:"Powerful"lighthouse"
beside"the"house"with"the"
curtains.""

Orig:"Orange"cloud"on"street"
light"C"near"Lanakila"Street"
(phone"camera)."
"
SeqCompression:%Orange"street"
"
TreePruning:"Phone"camera.%

Relevance(problem(

Orig:"There's"something"about"
having"5"trucks"parked"in"front"of"my"
house"that"makes"me"feel"all"
importantClike."
SeqCompression:%Front"of"my"house."
"
TreePruning:"Trucks"in"front"my"
house.%

Grammar(mistakes(

Figure 7: Caption generalization: good/bad examples.

be extraneous for image caption generalization). To
learn the syntactic patterns for caption generaliza-
tion, we collect a small set of example compressed
captions (380 in total) using Amazon Mechanical
Turk (AMT) (Snow et al., 2008). For each image,
we asked 3 turkers to first list all visible objects in
an image and then to write a compressed caption by
removing not visually verifiable bits of text. We then
align the original and compressed captions to mea-
sure rule deletion probabilities, excluding misalign-
ments, similar to Knight and Marcu (2000). Note
that we remove this dataset from the 1M caption cor-
pus when we perform description generation.

5 Experiments
We use the 1M captioned image corpus of Ordonez
et al. (2011). We reserve 1K images as a test set, and
use the rest of the corpus for phrase extraction. We
experiment with the following approaches:
Proposed Approaches:
• TREEPRUNING: Our tree compression ap-

proach as described in §4.
• SEQ+TREE: Our tree composition approach as

described in §3.
• SEQ+TREE+PRUNING: SEQ+TREE using

compressed captions of TREEPRUNING as
building blocks.

Baselines for Composition:
• SEQ+LINGRULE: The most equivalent to the

older sequence-driven system (Kuznetsova et
al., 2012). Uses a few minor enhancements,
such as sentence-boundary statistics, to im-
prove grammaticality.
• SEQ: The §3 system without tree models and

mentioned enhancements of SEQ+LINGRULE.

Method Bleu Meteor
w/ (w/o)
penalty P R M

SEQ+LINGRULE 0.152 (0.152) 0.13 0.17 0.095
SEQ 0.138 (0.138) 0.12 0.18 0.094
SEQ+TREE 0.149 (0.149) 0.13 0.14 0.082
SEQ+PRUNING 0.177 (0.177) 0.15 0.16 0.101
SEQ+TREE+PRUNING 0.140 (0.189) 0.16 0.12 0.088

Table 1: Automatic Evaluation

• SEQ+PRUNING: SEQ using compressed cap-
tions of TREEPRUNING as building blocks.

We also experiment with the compression of human
written captions, which are used to generate image
descriptions for the new target images.
Baselines for Compression:
• SEQCOMPRESSION (Kuznetsova et al., 2013):

Inference operates over the sequence structure.
Although optimization is subject to constraints
derived from dependency parse, parsing is not
an explicit part of the inference structure. Ex-
ample outputs are shown in Figure 7.

5.1 Automatic Evaluation
We perform automatic evaluation using two mea-
sures widely used in machine translation: BLEU (Pa-
pineni et al., 2002)8 and METEOR (Denkowski and
Lavie, 2011).9 We remove all punctuation and con-
vert captions to lower case. We use 1K test im-
ages from the captioned image corpus,10 and as-
sume the original captions as the gold standard cap-
tions to compare against. The results in Table 1

8We use the unigram NIST implementation: ftp://jaguar.
ncsl.nist.gov/mt/resources/mteval-v13a-20091001.tar.gz

9With equal weight between precision and recall in Table 1.
10Except for those for which image URLs are broken, or

CPLEX did not return a solution.

358


Method-1 Method-2 Criteria Method-1 preferred over Method-2 (%)
all turkers turkers w/ κ > 0.55 turkers w/ κ > 0.6

Image Description Generation
SEQ+TREE SEQ Rel 72 72 72
SEQ+TREE SEQ Gmar 83 83 83
SEQ+TREE SEQ All 68 69 66
SEQ+TREE+PRUNING SEQ+TREE Rel 68 72 72
SEQ+TREE+PRUNING SEQ+TREE Gmar 41 38 41
SEQ+TREE+PRUNING SEQ+TREE All 63 64 66
SEQ+TREE SEQ+LINGRULE All 62 64 62
SEQ+TREE+PRUNING SEQ+LINGRULE All 67 75 77
SEQ+TREE+PRUNING SEQ+PRUNING All 73 75 75
SEQ+TREE+PRUNING HUMAN All 24 19 19

Image Caption Generalization
TREEPRUNING SEQCOMPRESSION∗ Rel 65 65 66

Table 2: Human Evaluation: posed as a binary question “which of the two options is better?” with respect to Relevance
(Rel), Grammar (Gmar), and Overall (All). According to Pearson’s χ2 test, all results are statistically significant.

show that both the integration of the tree structure
(+TREE) and the generalization of captions using
tree compression (+PRUNING) improve the BLEU
score without brevity penalty significantly,11 while
improving METEOR only moderately (due to an im-
provement on precision with a decrease in recall.)

5.2 Human Evaluation
Neither BLEU nor METEOR directly measure
grammatical correctness over long distances and
may not correspond perfectly to human judgments.
Therefore, we supplement automatic evaluation with
human evaluation. For human evaluations, we
present two options generated from two compet-
ing systems, and ask turkers to choose the one that
is better with respect to: relevance, grammar, and
overall. Results are shown in Table 2 with 3 turker
ratings per image. We filter out turkers based on
a control question. We then compute the selec-
tion rate (%) of preferring method-1 over method-2.
The agreement among turkers is a frequent concern.
Therefore, we vary the set of dependable users based
on their Cohen’s kappa score (κ) against other users.
It turns out, filtering users based on κ does not make
a big difference in determining the winning method.

As expected, tree-based systems significantly out-
perform sequence-based counterparts. For example,

11While 4-gram BLEU with brevity penalty is found to cor-
relate better with human judges by recent studies (Elliott and
Keller, 2014), we found that this is not the case for our task.
This may be due to the differences in the gold standard cap-
tions. We use naturally existing ones, which include a wider
range of content and style than crowd-sourced captions.

Seq:"A"bu&erfly"to"the"car"was"spo&ed"by"
my"nine"year"old"cousin."
Seq+Pruning:"The"bu&erflies"are"
a&racted"to"the"colourful"flowers"to"the"
car.+
Seq+Tree:"The"bu&erflies"are"a&racted"to"
the"colourful"flowers"in"Hope"Gardens."
"

Seq+Tree+Pruning:"The"bu&erflies"are"
a&racted"to"the"colourful"flowers."

Orig:"The"bu&erflies"are"a&racted"
to"the"colourful"flowers"in"Hope"
Gardens."
"
SeqCompression:"The"colourful"
flowers."
"
"

TreePruning:"The"bu&erflies"are"
a&racted"to"the"colourful"flowers."
""

Cap>on"Generaliza>on" Image"Descrip>on"Genera>on"

Figure 8: An example of a description preferred over hu-
man gold standard. Image description is improved due to
caption generalization.

SEQ+TREE is strongly preferred over SEQ, with a
selection rate of 83%. Somewhat surprisingly, im-
proved grammaticality also seems to improve rele-
vance scores (72%), possibly because it is harder to
appreciate the semantic relevance of automatic cap-
tions when they are less comprehensible. Also as
expected, compositions based on pruned tree frag-
ments significantly improve relevance (68–72%),
while slightly deteriorating grammar (38–41%).

Notably, the captions generated by our system are
preferred over the original (owner generated) cap-
tions 19–24% of the time. One such example is in-
cluded in Figure 8: “The butterflies are attracted to
the colorful flowers.”

Additional examples (good and bad) are pro-
vided in Figures 9 and 10. Many of these captions
are highly expressive while remaining semantically

359


Human:"Some"flower"on"a"
bar"in"a"hotel"in"Grapevine,"
TX."
"
&

Seq+Tree+Pruning:"The"
flower"was"so"vivid"and"
a:rac<ve."

Highly'expressive'

Human:"Maybe"the"most"common"
bird"in"the"neighborhood,"not"just"
the"most"common"water"fowl"in"
the"neighborhood!"
&

Seq+Tree+Pruning:"The"duck"was"
having"a"feast."

Human:"Spring"in"a"white"
dress."
&
&

Seq+Tree+Pruning:"Blue"
flowers"have"no"scent."
Small"white"flowers"have"
no"idea"what"they"are."

Poe0c'

Human:"Tower"bridge"
London"in"black"and"
white.&
&
&

Seq+Tree+Pruning:"The"
tower"built"on"each"side."

Human:"This"stained"glass"
window"is"in"the"porch"of"
Kilcash"Church,"Slieve"na"
Mon."
&

Seq+Tree+Pruning:"This"
window"depicts"the"church."

Interes0ng'choice'of'verb'phrases'

Human:"Shot"in"Blackpool"on"
Tescos"car"park"with"use"of"a"
5M"candle"power"light.&
&
&
&

Seq+Tree+Pruning:"Red"car"in"
the"middle"of"the"road."

Informa0ve'

Figure 9: Description generation: good examples. Description preferred over human gold standard are highlighted.

Human:"The"floor"of"the"
market"area"in"Tirumala"was"
decorated"with"these"rangolis."
"
&

Seq+Tree+Pruning:"In"a"tree"
ball"from"the"ground"train"
sta:on."

Human:"Our"cat"sleeping"
in"the"cot."
"
&

Seq+Tree+Pruning:"Our"
cat"is"si=ng"in"the"bird"
feeder"and"actually"eats"
the"sun"flower"seed."

Human:"My"orange"is"in"a"very"
blue"state."
"
"
&

Seq+Tree+Pruning:"Just"an"
apple"in"the"sky."

Human:"In"the"flower"bed"by"
the"large"gate,"and"various"
other"places"in"the"garden."
"
&

Seq+Tree+Pruning:"Random"
flowers"offered"to"me"by"two"
liEle"girls."

Seman&c(dissonance(due(
to(generaliza&on(error(

Completely(wrong( Extraneous(informa&on( Vision(detec&on(error(

Human:"A"delighGul"clock"
in"the"town"centre"of"St"
Helier"with"the"iconic"Jersey"
cow"at"the"base."
&

Seq+Tree+Pruning:"Not"the"
clock"face"in"the"world."

Grammar(
problems(

Human:"A"buEerfly"in"a"
field"in"the"Santa"Monica"
mountains."
&

Seq+Tree+Pruning:"
Monarch"in"her"bedroom"
before"the"wedding"
ceremony."

Literally(not(relevant,(but(
metaphorically(crea&ve!(

Figure 10: Description generation: bad examples.

plausible, thanks to the expressive, but somewhat
predictable descriptions online users write about
their photos. Even among the bad examples (Fig-
ure 10) one can find highly creative captions with
not literal but metaphorical relevance: “Monarch in
her bedroom before the wedding ceremony”.12 The
complete system captions and the original captions
are available at http://ilp-cky.appspot.
com/

6 Related Work

Sentence Fusion Sentence fusion has been stud-
ied mostly for multi-document summarization
(Barzilay and McKeown, 2005), where redundancy
across multiple sentences serves as a guideline for
syntactic and semantic validity of generation. In
contrast, we do not have the natural redundancy to
rely upon in our task, therefore requiring the compo-
sition algorithm to be intrinsically better constrained
for correct sentence structures.

12“Monarch” can be a type of butterfly.

Sentence Compression At the core of the image
caption generalization task is sentence compression.
Much work has considered deletion-only edits like
ours (Knight and Marcu, 2000; Turner and Char-
niak, 2005; Cohn and Lapata, 2007; Filippova and
Altun, 2013), while recent ones explore more com-
plex edits, such as substitutions, insertions and re-
ordering (Cohn and Lapata, 2008). The latter gener-
ally requires a larger training corpus. We leave more
expressive compression as a future research work.

7 Conclusion
In this paper, we have presented a novel tree com-
position approach for generating expressive image
descriptions. As an optional preprocessing step, we
also presented a tree compression approach and re-
ported the empirical benefit of using automatically
compressed captions to improve image description
generation. By integrating both the tree structure
and the sequence structure, we have significantly im-
proved the quality of composed image captions over
several competitive baselines.

360


References

Regina Barzilay and Kathleen McKeown. 2005. Sen-
tence fusion for multidocument news summarization.
Computational Linguistics, 31(3):297–328.

Tamara L. Berg, Alexander C. Berg, and Jonathan Shih.
2010. Automatic attribute discovery and character-
ization from noisy web data. In Proceedings of
the 11th European Conference on Computer Vision:
Part I, ECCV’10, pages 663–676, Berlin, Heidelberg.
Springer-Verlag.

Thorsten Brants and Alex Franz. 2006. Web 1t 5-gram
version 1. In Linguistic Data Consortium.

James Clarke and Mirella Lapata. 2008. Global infer-
ence for sentence compression an integer linear pro-
gramming approach. Journal of Artificial Intelligence
Research, 31:399–429.

Trevor Cohn and Mirella Lapata. 2007. Large margin
synchronous generation and its application to sentence
compression. In Proceedings of the 2007 Joint Confer-
ence on Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language Learn-
ing (EMNLP-CoNLL), pages 73–82, Prague, Czech
Republic, June. Association for Computational Lin-
guistics.

Trevor Cohn and Mirella Lapata. 2008. Sentence com-
pression beyond word deletion. In Proceedings of the
22nd International Conference on Computational Lin-
guistics (Coling 2008), pages 137–144, Manchester,
UK, August. Coling 2008 Organizing Committee.

Navneet Dalal and Bill Triggs. 2005. Histograms of ori-
ented gradients for human detection. In Proceedings
of the 2005 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’05)
- Volume 1 - Volume 01, CVPR ’05, pages 886–893,
Washington, DC, USA. IEEE Computer Society.

Jia Deng, Alexander C. Berg, Kai Li, and Fei-Fei Li.
2010. What does classifying more than 10,000 image
categories tell us? In ECCV.

Jia Deng, Alexander C. Berg, Sanjeev Satheesh, Hao Su,
Aditya Khosla, and Fei-Fei Li. 2012a. Large scale
visual recognition challenge. In http://www.image-
net.org/challenges/LSVRC/2012/index.

Jia Deng, Jonathan Krause, Alexander C. Berg, and
L. Fei-Fei. 2012b. Hedging your bets: Optimiz-
ing accuracy-specificity trade-offs in large scale visual
recognition. In Conference on Computer Vision and
Pattern Recognition.

Michael Denkowski and Alon Lavie. 2011. Meteor 1.3:
Automatic Metric for Reliable Optimization and Eval-
uation of Machine Translation Systems. In Proceed-
ings of the EMNLP 2011 Workshop on Statistical Ma-
chine Translation.

Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Men-
sch, Margaret Mitchell, Karl Stratos, Kota Yamaguchi,
Yejin Choi, Hal Daume III, Alexander C. Berg, and
Tamara L. Berg. 2012. Detecting visual text. In Pro-
ceedings of the 2012 Conference of the North Ameri-
can Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, pages 762–
772, Montréal, Canada, June. Association for Compu-
tational Linguistics.

Desmond Elliott and Frank Keller. 2013. Image de-
scription using visual dependency representations. In
EMNLP, pages 1292–1302.

Desmond Elliott and Frank Keller. 2014. Comparing
automatic evaluation measures for image description.
In ACL (2), pages 452–457.

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi,
Peter Young1, Cyrus Rashtchian, Julia Hockenmaier,
and David Forsyth. 2010. Every picture tells a story:
generating sentences for images. In European Confer-
ence on Computer Vision.

Pedro F. Felzenszwalb, Ross B. Girshick, David
McAllester, and Deva Ramanan. 2010. Object detec-
tion with discriminatively trained part based models.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 32(9):1627–1645.

Yansong Feng and Mirella Lapata. 2013. Automatic
caption generation for news images. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
35(4):797–812.

Katja Filippova and Yasemin Altun. 2013. Overcoming
the lack of parallel data in sentence compression. In
EMNLP, pages 1481–1491.

ILOG, Inc. 2006. ILOG CPLEX: High-performance
software for mathematical programming and optimiza-
tion. See http://www.ilog.com/products/
cplex/.

Michael Jamieson, Afsaneh Fazly, Suzanne Stevenson,
Sven J. Dickinson, and Sven Wachsmuth. 2010. Us-
ing language to learn structured appearance models for
image annotation. IEEE Trans. Pattern Anal. Mach.
Intell., 32(1):148–164.

Dan Klein and Christopher D. Manning. 2003. Accurate
unlexicalized parsing. In Proceedings of the 41st An-
nual Meeting on Association for Computational Lin-
guistics, pages 423–430. Association for Computa-
tional Linguistics.

Kevin Knight and Daniel Marcu. 2000. Statistics-based
summarization - step one: Sentence compression. In
AAAI/IAAI, pages 703–710.

Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga.
2002. Natural language description of human activi-
ties from video images based on concept hierarchy of
actions. IJCV, 50.

361


Niveda Krishnamoorthy, Girish Malkarnenkar, Ray-
mond J. Mooney, Kate Saenko, and Sergio Guadar-
rama. 2013. Generating natural-language video de-
scriptions using text-mined knowledge. In AAAI.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton.
2012. Imagenet classification with deep convolutional
neural networks. In NIPS.

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming
Li, Yejin Choi, Alexander C. Berg, and Tamara L.
Berg. 2011. BabyTalk: Understanding and generat-
ing simple image descriptions. In Conference on Com-
puter Vision and Pattern Recognition.

Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg,
Tamara L. Berg, and Yejin Choi. 2012. Collective
generation of natural image descriptions. In Proceed-
ings of the 50th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 359–368, Jeju Island, Korea, July. Association
for Computational Linguistics.

Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg,
Tamara L. Berg, and Yejin Choi. 2013. Generaliz-
ing image captions for image-text parallel corpus. In
The 51st Annual Meeting of the Association for Com-
putational Linguistics - Short Papers, Sofia, Bulgaria,
August. Association for Computational Linguistics.

Thomas K. Leung and Jitendra Malik. 1999. Recog-
nizing surfaces using three-dimensional textons. In
ICCV, pages 1010–1017.

Siming Li, Girish Kulkarni, Tamara L. Berg, Alexan-
der C. Berg, and Yejin Choi. 2011. Composing sim-
ple image descriptions using web-scale n-grams. In
Proceedings of the Fifteenth Conference on Compu-
tational Natural Language Learning, pages 220–228,
Portland, Oregon, USA, June. Association for Compu-
tational Linguistics.

David G. Lowe. 2004. Distinctive image features
from scale-invariant keypoints. Int. J. Comput. Vision,
60:91–110, November.

Rebecca Mason and Eugene Charniak. 2013. Annota-
tion of online shopping images without labeled train-
ing examples. In Proceedings of Workshop on Vision
and Language, Atlanta, Georgia, June. Association for
Computational Linguistics.

Rebecca Mason. 2013. Domain-independent caption-
ing of domain-specific images. In Proceedings of the
2013 NAACL HLT Student Research Workshop, pages
69–76, Atlanta, Georgia, June. Association for Com-
putational Linguistics.

Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Ya-
maguchi, Karl Stratos, Xufeng Han, Alyssa Mensch,
Alexander C. Berg, Tamara L. Berg, and Hal Daumé
III. 2012. Midge: Generating image descriptions from
computer vision detections. In EACL, pages 747–756.

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg.
2011. Im2text: Describing images using 1 million
captioned photographs. In Neural Information Pro-
cessing Systems (NIPS).

Kishore Papineni, Salim Roukos, Todd Ward, and Wei
jing Zhu. 2002. Bleu: a method for automatic evalua-
tion of machine translation. In ACL.

Florent Perronnin, Zeynep Akata, Zaid Harchaoui, and
Cordelia Schmid. 2012. Towards good practice in
large-scale learning for image classification. In CVPR.

Dan Roth and Wen tau Yih. 2004. A linear programming
formulation for global inference in natural language
tasks. In Proc. of the Annual Conference on Computa-
tional Natural Language Learning (CoNLL).

Rion Snow, Brendan O’Connor, Daniel Jurafsky, and
Andrew Y. Ng. 2008. Cheap and fast—but is it
good?: Evaluating non-expert annotations for natural
language tasks. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing,
EMNLP ’08, pages 254–263, Stroudsburg, PA, USA.
Association for Computational Linguistics.

Richard Socher, Andrej Karpathy, Quoc V. Le, Christo-
pher D. Manning, and Andrew Y. Ng. 2014.
Grounded compositional semantics for finding and de-
scribing images with sentences. In Transactions of the
Association for Computational Linguistics, pages 207
– 218, April.

Jenine Turner and Eugene Charniak. 2005. Supervised
and unsupervised learning for sentence compression.
In Proceedings of the 43rd Annual Meeting of the
Association for Computational Linguistics (ACL’05),
pages 290–297, Ann Arbor, Michigan, June. Associa-
tion for Computational Linguistics.

Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude
Oliva, and Antonio Torralba. 2010. Sun database:
Large-scale scene recognition from abbey to zoo. In
CVPR.

Yezhou Yang, Ching Teo, Hal Daume III, and Yiannis
Aloimonos. 2011. Corpus-guided sentence genera-
tion of natural images. In Proceedings of the 2011
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 444–454, Edinburgh, Scot-
land, UK., July. Association for Computational Lin-
guistics.

Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee,
and Song-Chun Zhu. 2010. I2T: Image parsing to text
description. Proc. IEEE, 98(8).

Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded
language learning from video described with sen-
tences. In Proceedings of the 51st Annual Meeting
of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 53–63, Sofia, Bulgaria,
August. Association for Computational Linguistics.

362