Transactions of the Association for Computational Linguistics, 1 (2013) 193–206. Action Editor: Jason Eisner. Submitted 10/2012; Revised 3/2013; Published 5/2013. c©2013 Association for Computational Linguistics. Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World Jayant Krishnamurthy Computer Science Department Carnegie Mellon University jayantk@cs.cmu.edu Thomas Kollar Computer Science Department Carnegie Mellon University tkollar@andrew.cmu.edu Abstract This paper introduces Logical Semantics with Perception (LSP), a model for grounded lan- guage acquisition that learns to map natu- ral language statements to their referents in a physical environment. For example, given an image, LSP can map the statement “blue mug on the table” to the set of image seg- ments showing blue mugs on tables. LSP learns physical representations for both cate- gorical (“blue,” “mug”) and relational (“on”) language, and also learns to compose these representations to produce the referents of en- tire statements. We further introduce a weakly supervised training procedure that estimates LSP’s parameters using annotated referents for entire statements, without annotated ref- erents for individual words or the parse struc- ture of the statement. We perform experiments on two applications: scene understanding and geographical question answering. We find that LSP outperforms existing, less expressive models that cannot represent relational lan- guage. We further find that weakly supervised training is competitive with fully supervised training while requiring significantly less an- notation effort. 1 Introduction Learning the mapping from natural language to physical environments is a central problem for natu- ral language semantics. Understanding this mapping is necessary to enable natural language interactions with robots and other embodied systems. For exam- ple, for an autonomous robot to understand the sen- tence “The blue mug is on the table,” it must be able to identify (1) the objects in its environment corre- sponding to “blue mug” and “table,” and (2) the ob- jects which participate in the spatial relation denoted by “on.” If the robot can successfully identify these objects, it understands the meaning of the sentence. The problem of learning to map from natural lan- guage expressions to their referents in an environ- ment is known as grounded language acquisition. In embodied settings, environments consist of raw sensor data – for example, an environment could be an image collected from a robot’s camera. In such applications, grounded language acquisition has two subproblems: parsing, learning the compositional structure of natural language; and perception, learn- ing the environmental referents of individual words. Acquiring both kinds of knowledge is necessary to understand novel language in novel environments. Unfortunately, perception is often ignored in work on language acquisition. Other variants of grounded language acquisition eliminate the need for percep- tion by assuming access to a logical representation of the environment (Zettlemoyer and Collins, 2005; Wong and Mooney, 2006; Matuszek et al., 2010; Chen and Mooney, 2011; Liang et al., 2011). The existing work which has jointly addressed both pars- ing and perception has significant drawbacks, in- cluding: (1) fully supervised models requiring large amounts of manual annotation and (2) limited se- mantic representations (Kollar et al., 2010; Tellex et al., 2011; Matuszek et al., 2012). This paper introduces Logical Semantics with Perception (LSP), a model for grounded language acquisition that jointly learns to semantically parse language and perceive the world. LSP models a mapping from natural language queries to sets of ob- jects in a real-world environment. The input to LSP is an environment containing objects, such as a seg- 193 (a) An environment containing 4 objects (image segments). Environment: (image on left) Knowledge Base Query: “things to the right of the blue mug” Semantic Parse Grounding: {(2, 1), (3, 1)} Denotation: {2, 3} (b) LSP predicting the environmental referents of a natural language query. Language Denotation The mugs {1, 3} The objects on the table {1, 2, 3} There is an LCD monitor {2} Is the blue mug right {} of the monitor? The monitor is behind {2} the blue cup. (c) Training examples for weakly su- pervised training. Figure 1: LSP applied to scene understanding. Given an environment containing a set of objects (left), and a natural language query, LSP produces a semantic parse, logical knowledge base, grounding and denotation (middle), using only language/denotation pairs (right) for training. mented image (Figure 1a), and a natural language query, such as “the things to the right of the blue mug.” Given these inputs, LSP produces (1) a logi- cal knowledge base describing objects and relation- ships in the environment and (2) a semantic parse of the query capturing its compositional structure. LSP combines these two outputs to produce the query’s grounding, which is the set of object referents of the query’s noun phrases, and its denotation, which is the query’s answer (Figure 1b).1 Weakly supervised training estimates parameters for LSP using queries annotated with their denotations in an environment (Figure 1c). This work has two contributions. The first con- tribution is LSP, which is more expressive than pre- vious models, representing both one-argument cat- egories and two-argument relations over sets of ob- jects in the environment. The second contribution is a weakly supervised training procedure that esti- mates LSP’s parameters without annotated semantic parses, noun phrase/object mappings, or manually- constructed knowledge bases. We perform experiments on two different applica- tions. The first application is scene understanding, where LSP grounds descriptions of images in im- age segments. The second application is geograph- ical question answering, where LSP learns to an- swer questions about locations, represented as poly- gons on a map. In geographical question answering, 1We treat declarative sentences as if they were queries about their subject, e.g., the denotation of “the mug is on the table” is the set of mugs on tables. Typically, the denotation of a sentence is either true or false; our treatment is strictly more general, as a sentence’s denotation is nonempty if and only if the sentence is true. LSP correctly answers 34% more questions than the most comparable state-of-the-art model (Matuszek et al., 2012). In scene understanding, accuracy sim- ilarly improves by 16%. Furthermore, weakly su- pervised training achieves an accuracy within 6% of that achieved by fully supervised training, while re- quiring significantly less annotation effort. 2 Prior Work Logical Semantics with Perception (LSP) is related to work from planning, natural language processing, computer vision and robotics. Much of the related work focuses on interpreting natural language us- ing a fixed formal representation. Some work con- structs integrated systems which execute plans in re- sponse to natural language commands (Winograd, 1970; Hsiao et al., 2003; Roy et al., 2003; Skubic et al., 2004; MacMahon et al., 2006; Levit and Roy, 2007; Kruijff et al., 2007). These systems parse natural language to a formal representation which can be executed using a set of fixed control pro- grams. Similarly, work on semantic parsing learns to map natural language to a given formal repre- sentation. Semantic parsers can be trained using sentences annotated with their formal representation (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Kate and Mooney, 2006; Kwiatkowski et al., 2010) or various less restrictive annotations (Clarke et al., 2010; Liang et al., 2011; Krishnamurthy and Mitchell, 2012). Finally, work on grounded lan- guage acquisition leverages semantic parsing to map from natural language to a formal representation of an environment (Kate and Mooney, 2007; Chen and Mooney, 2008; Shimizu and Haas, 2009; Matuszek 194 Environment d Know. Base Γ mug(1) mug(3) blue(1) table(4) on-rel(1, 4) on-rel(3, 4) ... (a) Perception fper produces a logical knowl- edge base Γ from the environment d using an independent classifier for each category and relation. Language z “blue mug on table” Logical form ` λx.∃y.blue(x) ∧ mug(x) ∧ on-rel(x,y) ∧ table(y) (b) Semantic parsing fprs maps language z to a log- ical form `. Grounding: g = {(1, 4)}, Denotation: γ = {1} {1} {1} blue(x) {1, 3} mug(x) {(1, 4), (3, 4)} {(1, 4), (3, 4)} on-rel(x,y) {4} table(y) (c) Evaluation feval evaluates a logical form ` on a logical knowledge base Γ to produce a grounding g and denotation γ. Figure 2: Overview of Logical Semantics with Perception (LSP). et al., 2010; Dzifcak et al., 2009; Cantrell et al., 2010; Chen and Mooney, 2011). All of this work as- sumes that the formal environment representation is given, while LSP learns to produce this formal rep- resentation from raw sensor input. Most similar to LSP is work on simultaneously understanding natural language and perceiving the environment. This problem has been addressed in the context of robot direction following (Kollar et al., 2010; Tellex et al., 2011) and visual attribute learning (Matuszek et al., 2012). However, this work is less semantically expressive than LSP and trained using more supervision. The G3 model (Kol- lar et al., 2010; Tellex et al., 2011) assumes a one- to-one mapping from noun phrases to entities and is trained using full supervision, while LSP allows one-to-many mappings from noun phrases to entities and can be trained using minimal annotation. Ma- tuszek et al. (2012) learns only one-argument cate- gories (“attributes”) and requires a fully supervised initial training phase. In contrast, LSP models two- argument relations and allows for weakly supervised supervised training throughout. 3 Logical Semantics with Perception Logical Semantics with Perception (LSP) is a model for grounded language acquisition. LSP accepts as input a natural language statement and an environ- ment and outputs the objects in the environment de- noted by the statement. The LSP model has three components: perception, parsing and evaluation (see Figure 2). The perception component constructs logical knowledge bases from low-level feature- based representations of environments. The pars- ing component semantically parses natural language into lambda calculus queries against the constructed knowledge base. Finally, the evaluation compo- nent deterministically executes this query against the knowledge base to produce LSP’s output. The output of LSP can be either a denotation or a grounding. A denotation is the set of entity refer- ents for the phrase as a whole, while a grounding is the set of entity referents for each component of the phrase. The distinction between these two outputs is shown in Figure 1b. In this example, the denotation is the set of “things to the right of the blue mug,” which does not include the blue mug itself. On the other hand, the grounding includes both the refer- ents of “things” and “blue mug.” Only denotations are used during training, so we ignore groundings in the following model description. However, ground- ings are used in our evaluation, as they are a more complete description of the model’s understanding. Formally, LSP is a linear model f that predicts a denotation γ given a natural language statement z in an environment d. As shown in Figure 3, the struc- ture of LSP factors into perception (fper), semantic parsing (fprs) and evaluation (feval) components us- ing several latent variables: f(γ, Γ,`,t,z,d; θ) = fper(Γ,d; θper)+ fprs(`,t,z; θprs) + feval(γ, Γ,`) LSP assumes access to a set of predicates that take either one argument, called categories (c ∈ C) or two arguments, called relations (r ∈ R).2 These predicates are the interface between LSP’s percep- tion and parsing components. The perception func- tion fper takes an environment d and produces a log- 2The set of predicates are derived from our training data. See Section 5.3. 195 γ feval Γd fper ℓ z fprs t Figure 3: Factor graph of LSP. The environment d and language z are given as input, from which the model predicts a logical knowledge base Γ, logical form `, syntactic tree t and denotation γ. ical knowledge base Γ that assigns truth values to instances of these predicates using parameters θper. This function uses an independent classifier to pre- dict the instances of each predicate. The seman- tic parser fprs takes a natural language statement z and produces a logical form ` and syntactic parse t using parameters θprs. The logical form ` is a database query expressed in lambda calculus nota- tion, constructed by logically combining the given predicates. Finally, the evaluation function feval de- terministically evaluates the logical form ` on the knowledge base Γ to produce a denotation γ. These components are illustrated in Figure 2. The following sections describe the percep- tion function (Section 3.1), semantic parser (Sec- tion 3.2), evaluation function (Section 3.3), and in- ference (Section 3.4) in more detail. 3.1 Perception Function The perception function fper constructs a logical knowledge base Γ given an environment d. The per- ception function assumes that an environment con- tains a collection of entities e ∈ Ed. The knowl- edge base produced by perception is a collection of ground predicate instances using these entities. For example, in Figure 2a, the entire image is the envi- ronment, and each image segment is an entity. The logical knowledge base Γ contains the shown pred- icate instances, where the categories include blue, mug and table, and the relations include on-rel. The perception function scores logical knowledge bases using a set of per-predicate binary classifiers. These classifiers independently assign a score to whether each entity (entity pair) is an element of each category (relation). Let γc ∈ Γ denote the set of entities which are elements of category c; simi- larly, let γr ∈ Γ denote the set of entity pairs which are elements of the relation r. Given these sets, the score of a logical knowledge base Γ factors into per- relation and per-category scores h: fper(Γ,d; θper) = ∑ c∈C h(γc,d; θcper) + ∑ r∈R h(γr,d; θrper) The per-predicate scores are in turn given by a sum of per-element classification scores: h(γc,d; θcper) = ∑ e∈Ed γc(e)(θcper) Tφcat(e) h(γr,d; θrper) = ∑ (e1,e2)∈Ed γr(e1,e2)(θ r per) Tφrel(e1,e2) Each term in the above sums represents a single binary classification, determining the score for a sin- gle entity (entity pair) belonging to a particular cat- egory (relation). We treat γc and γr as indicator functions for the sets they denote, i.e., γc(e) = 1 for entities e in the set, and 0 otherwise. Similarly, γr(e1,e2) = 1 for entity pairs (e1,e2) in the set, and 0 otherwise. The features of these classifiers are given by φcat and φrel, which are feature functions that map entities and entity pairs to feature vectors. The parameters of these classifiers are given by θcper and θrper. The perception parameters θper contain one such set of parameters for every category and re- lation, i.e., θper = {θcper : c ∈ C}∪{θrper : r ∈ R}. 3.2 Semantic Parser The goal of semantic parsing is to identify which portions of the input natural language denote enti- ties and relationships between entities in the envi- ronment. Semantic parsing accomplishes this goal by mapping from natural language to a logical form that explicitly describes the language’s entity refer- ents using one- and two-argument predicates. The logical form is combined with instances of these predicates to produce the statement’s denotation. LSP’s semantic parser is defined using Combina- tory Categorial Grammar (CCG) (Steedman, 1996). The grammar of the parser is given by a lexicon Λ which maps words to syntactic categories and logi- cal forms. For example, “mug” may have the syn- tactic category N for noun, and the logical form λx.mug(x), denoting the set of all entities x such that mug is true. During parsing, the logical forms for adjacent phrases are combined to produce the logical form for the complete statement. 196 the N/N λf.f mugs N λx.mug(x) N : λx.mug(x) are (S\N)/N λf.λg.λx.g(x) ∧f(x) right N/PP λf.λx.∃y.right-rel(x,y) ∧f(y) of PP/N λf.f the N/N λf.f monitor N λx.monitor(x) N : λx.monitor(x) PP : λx.monitor(x) N : λx.∃y.right-rel(x,y) ∧monitor(y) S\N : λg.λx.∃y.g(x) ∧right-rel(x,y) ∧monitor(y) S : λx.∃y.mug(x) ∧right-rel(x,y) ∧monitor(y) Figure 4: Example parse of “the mugs are right of the monitor.” The first row of the derivation retrieves lexical categories from the lexicon, while the remaining rows represent applications of CCG combinators. Figure 4 illustrates how CCG parsing produces a syntactic tree t and a logical form `. The top row of the parse represents retrieving a lexicon entry for each word. Each successive row combines a pair of entries by applying a logical form to an adjacent ar- gument. A given sentence may have multiple parses like the one shown, using a different set of lexicon entries or a different order of function applications. The semantic parser scores each such parse, learning to distinguish correct and incorrect parses. The semantic parser in LSP is a linear model over CCG parses (`,t) given language z: fprs(`,t,z; θprs) = θ T prsφprs(`,t,z) Here, φprs(`,t,z) represents a feature function map- ping CCG parses to vectors of feature values. φprs factorizes according to the tree structure of the CCG parse; it contains features for local parsing opera- tions which are summed to produce the feature val- ues for a tree. If the parse tree is a terminal, then: φprs(`,t,z) = 1(lexicon entry) The notation 1(x) denotes a vector with a single one entry whose position is determined by x. The termi- nal features are indicator features for each lexicon entry, as shown in the top row of Figure 4. These features allow the model to learn the correct syntac- tic and semantic function of each word. If the parse tree is a nonterminal, then: φprs(`,t,z) = φprs(left(`,t,z)) + φprs(right(`,t,z)) + 1(combinator) These nonterminal features are defined over combi- nator rules in the parse tree, as in the remaining rows of Figure 4. These features allow the model to learn which adjacent parse trees are likely to combine. We refer the reader to Zettlemoyer and Collins (2005) for more information about CCG semantic parsing. 3.3 Evaluation Function The evaluation function feval deterministically scores denotations given a logical form ` and a logical knowledge base Γ. Intuitively, the evalu- ation function simply evaluates the query ` on the database Γ to produce a denotation. The evaluation function then assigns score 0 to this denotation, and score −∞ to all other denotations. We describe feval by giving a recurrence for com- puting the denotation γ of a logical form ` on a log- ical knowledge base Γ. This evaluation takes the form of a tree, as in Figure 2c. The base cases are: • If ` = λx.c(x) then γ = γc. • If ` = λx.λy.r(x,y), then γ = γr. The denotations for more complex logical forms are computed recursively by decomposing ` accord- ing to its logical structure. Our logical forms con- tain only conjunctions and existential quantifiers; the corresponding recursive computations are: • If ` = λx.`1(x) ∧ `2(x), then γ(e) = 1 iff γ1(e) = 1 ∧γ2(e) = 1. • If ` = λx.∃y.`1(x,y), then γ(e1) = 1 iff ∃e2.γ1(e1,e2) = 1. Note that a similar recurrence can be used to com- pute groundings: simply retain the satisfying assign- ments to existentially-quantified variables. 3.4 Inference The basic inference problem in LSP is to predict a denotation γ given language z and an environment d. This inference problem is straightforward due to the deterministic structure of feval. The highest- scoring γ can be found by independently maximiz- ing fprs and fper to find the highest-scoring logical form ` and logical knowledge base Γ. Deterministi- cally evaluating the recurrence for feval using these values yields the highest-scoring denotation. 197 Another inference problem occurs during train- ing: identify the highest-scoring logical form and knowledge base which produce a particular denota- tion. Our approximate inference algorithm for this problem is described in Section 4.2. 4 Weakly Supervised Parameter Estimation This section describes a weakly supervised training procedure for LSP, which estimates parameters us- ing a corpus of sentences with annotated denota- tions. The algorithm jointly trains both the pars- ing and the perception components of LSP to best predict the denotations of the observed training sen- tences. Our approach trains LSP as a maximum margin Markov network using the stochastic subgra- dient method. The main difficulty is computing the subgradient, which requires computing values for the model’s hidden variables, i.e., the logical knowl- edge base Γ and semantic parse ` that are responsi- ble for the model’s prediction. 4.1 Stochastic Subgradient Method The training procedure trains LSP as a maximum margin Markov network (Taskar et al., 2004), a structured analog of a support vector machine. The training data for our weakly supervised algorithm is a collection {(zi,γi,di)}ni=1, consisting of language zi paired with its denotation γi in environment di. Given this data, the parameters θ = [θprs,θper] are estimated by minimizing the following objective function: O(θ) = λ 2 ||θ||2 + 1 n [ n∑ i=1 ζi ] (1) where λ is a regularization parameter that controls the trade-off between model complexity and slack penalties. The slack variable ζi represents a margin violation penalty for the ith training example, de- fined as: ζi = max γ,Γ,`,t [ f(γ, Γ,`,t,zi,di; θ) + cost(γ,γi) ] − max Γ,`,t f(γi, Γ,`,t,zi,di; θ) The above expression is the structured counterpart of the hinge loss, where cost(γ,γi) is the margin by which γi’s score must exceed γ’s score. We let cost(γ,γi) be the Hamming cost; it adds a cost of 1 for each entity e such that γi(e) 6= γ(e). We optimize this objective using the stochastic subgradient method (Ratliff et al., 2006). To com- pute the subgradient gi, first compute the highest- scoring assignments to the model’s hidden variables: γ̂, Γ̂, ˆ̀, t̂ ←arg max γ,Γ,`,t f(γ, Γ,`,t,zi,di; θj) + cost(γ,γi) (2) Γ∗,`∗, t∗ ←arg max Γ,`,t f(γi, Γ,`,t,zi,di; θj) (3) The first set of values (e.g., ˆ̀) are the best ex- planation for the denotation γ̂ which most violates the margin constraint. The second set of values (e.g., `∗) are the best explanation for the true de- notation γi. The subgradient update increases the weights of features that explain the true denotation, while decreasing the weights of features that explain the denotation violating the margin. The subgradi- ent factors into parsing and perception components: gi = [giprs,g i per]. The parsing subgradient is: giprs = φprs( ˆ̀, t̂,zi) − φprs(`∗, t∗,zi) The subgradient of the perception parameters θper factors into subgradients of the category and relation classifier parameters. Recall that θper = {θcper : c ∈ C}∪{θrper : r ∈ R}. Let γ̂c ∈ Γ̂ be the best margin- violating set of entities for c, and γc∗ ∈ Γ∗ be the best truth-explaining set of entities. Similarly define γ̂r and γr∗. The subgradients of the category and relation classifier parameters are: gi,cper = ∑ e∈E di (γ̂c(e) −γc∗(e)) φcat(e) gi,rper = ∑ (e1,e2)∈Edi (γ̂r(e1,e2) −γr∗(e1,e2)) φrel(e1,e2) 4.2 Inference: Computing the Subgradient Solving the maximizations in Equations 2 and 3 is challenging because the weights placed on the de- notation γ couple fprs and fper. Due to this cou- pling, exactly solving these problems requires (1) enumerating all possible logical forms `, and (2) for each logical form, finding the highest-scoring logi- cal knowledge base Γ by propagating the weights on γ back through feval. We use a two-step approximate inference algo- rithm for both maximizations. The first step per- forms a beam search over CCG parses, producing 198 k possible logical forms `1, ...,`k. The second step uses an integer linear program (ILP) to find the best logical knowledge base Γ given each parse `i. In our experiments, we parse with a beam size of 1000, then solve an ILP for each of the 10 highest-scoring logical forms. The highest-scoring parse/logical knowledge base pair is the approximate maximizer. Given a logical form ` output by beam search, the second step of inference computes the best values for the logical knowledge base Γ and denotation γ: max Γ,γ fper(Γ,d; θper) + feval(γ,`, Γ) + ψ(γ) (4) Here, ψ(γ) = ∑ e∈Ed ψeγ(e) represents a set of weights on the entities in the predicted denotation γ. For Equation 2, ψ represents cost(γ,γi). For Equa- tion 3, ψ is a hard constraint encoding γ = γi (i.e., ψ(γ) = −∞ when γ 6= γi and 0 otherwise). We encode the maximization in Equation 4 as an ILP. For each category c and relation r, we create bi- nary variables γc(e1) and γr(e1,e2) for each entity in the environment, e1,e2 ∈ Ed. We similarly create binary variables γ(e) for the denotation γ. Using the fact that fper is a linear function of these variables, we write the ILP objective as: fper(Γ,d; θper) + ψ(γ) = ∑ e1∈Ed ∑ c∈C wce1γ c(e1) + ∑ e1,e2∈Ed ∑ r∈R wre1,e2γ r(e1,e2) + ∑ e1∈Ed ψe1γ(e1) where the weights wce1 and w r e1,e2 determine how likely it is that each entity or entity pair belongs to the predicates c and r: wce1 = (θ c per) Tφcat(e1) wre1,e2 = (θ r per) Tφrel(e1,e2) The ILP also includes constraints and additional auxiliary variables that represent feval. These con- straints couple the denotation γ and the logical knowledge base Γ such that γ is the result of evaluat- ing ` on Γ. ` is recursively decomposed as in Section 3.3, and each intermediate set of entities in the recur- rence is given its own set of |Ed| (or |Ed|2) variables. These variables are then logically constrained to en- force `’s structure. 5 Evaluation Our evaluation performs three major comparisons. First, we examine the performance impact of weakly supervised training by comparing weakly and fully supervised variants of LSP. Second, we examine the performance impact of modelling relations by com- paring against a category-only baseline, which is an ablated version of LSP similar to the model of Ma- tuszek et al. (2012). Finally, we examine the causes of errors by performing an error analysis of LSP’s semantic parser and perceptual classifiers. Before describing our results, we first describe some necessary set-up for the experiments. These sections describe the data sets, features, construc- tion of the CCG lexicon, and details of the models. Our data sets and additional evaluation resources are available online from http://rtw.ml.cmu.edu/ tacl2013_lsp/. 5.1 Data Sets We evaluate LSP on two applications: scene un- derstanding (SCENE) and geographical question an- swering (GEOQA). These data sets are collections {(zi,γi,di,`i, Γi)}ni=1, consisting of a number of natural language statements zi with annotated deno- tations γi in environments di. For fully supervised training, each statement is annotated with a gold standard logical form `i, and each environment with a gold standard logical knowledge base Γi. Statistics of these data sets are shown in Table 1, and example environments and statements are shown in Figure 5. The SCENE data set consists of segmented images of indoor environments containing a number of or- dinary objects such as mugs and monitors. Each image is an environment, and each image segment (bounding box) is an entity. We collected natural language descriptions of each scene from members of our lab and Amazon Mechanical Turk, asking subjects to describe the objects in each scene. The authors then manually annotated the collected state- ments with their denotations and logical forms. In this data set, each image contains the same set of ob- jects; note that this does not trivialize the task, as the model only observes visual features of the objects, which are not consistent across environments. The GEOQA data set consists of several maps containing entities such as cities, states, national parks, lakes and oceans. Each map is an envi- ronment, and its component entities are given by polygons of latitude/longitude coordinates marking 199 Data Set Statistics SCENE GEOQA # of environments 15 10 Mean entities / environment d 4.1 6.9 Mean # of entities in denotation γ 1.5 1.2 # of statements 284 263 Mean words / statement 6.3 6.3 Mean predicates / log. form 2.6 2.8 # of preds. in annotated worlds 46 38 Lexicon Statistics # of words in lexicon 169 288 # of lexicon entries 712 876 Mean parses / statement 15.0 8.9 Table 1: Statistics of the two data sets used in our evaluation, and of the generated lexicons. their boundaries.3 Furthermore, each entity has one known name (e.g., “Greenville”). In this data set, distinct entities occur on average in 1.25 environ- ments; repeated entities are mostly large locations, such as states and oceans. The language for this data set was contributed by members of our research group, who were instructed to provide a mixture of simple and complex geographical questions. The authors then manually annotated each question with a denotation (its answer) and a logical form. 5.2 Features The features of both applications are intended to capture properties of entities and relations between them. As such, both applications share a set of phys- ical features which are functions of the bounding polygons of each entity. Example category features (φcat) are the area and perimeter of the entity, and an example relation feature (φrel) is the distance be- tween entity centroids. The SCENE data set additionally includes visual appearance features in φcat to capture visual proper- ties of objects. These features include a Histogram of Oriented Gradients (HOG) (Dalal and Triggs, 2005) and an RGB color histogram. The GEOQA data set additionally includes dis- tributional semantic features to distinguish between different types of entities (e.g., states vs. lakes) and to capture non-spatial relations (e.g., capitals of states). These features are derived from phrase co-occurrences with entity names in the Clueweb09 3Polygons were extracted from OpenStreetMap, http:// www.openstreetmap.org/. web corpus.4 The category features φcat include in- dicators for the 20 contexts which most frequently occur with an entity’s name (e.g., “X is a city”). Similarly, the relation features φrel include indica- tors for the 20 most frequent contexts between two entities’ names (e.g., “X in eastern Y ”). 5.3 Lexicon Induction One of the inputs to the semantic parser (Section 3.2) is a lexicon that lists possible syntactic and seman- tic functions for each word. Together, these per- word entries define the set of possible logical forms for every statement. Each word may have mul- tiple lexicon entries to capture uncertainty about its meaning. For example, the word “right” may have entries N : λx.right(x) and N/PP : λf.λx.∃y.right-rel(x,y) ∧f(y). The semantic parser learns to distinguish among these interpreta- tions to produce good logical forms. We automatically generated lexicons for both applications using part-of-speech tag heuristics.5 These heuristics map words to lexicon entries con- taining category and relation predicates derived from the word’s lemma. Nouns and adjectives pro- duce lexicon entries containing either categories or relations (as shown above for “right”). Mapping these parts-of-speech to relations is necessary for phrases like “to the right of,” where the noun “right” denotes a relation. Verbs and prepositions pro- duce lexicon entries containing relations. Additional heuristics generate semantically-empty lexicon en- tries, allowing words like determiners to have no physical interpretation. Finally, there are special heuristics for forms of “to be” and, in GEOQA, to handle known entity names. The complete set of lexicon induction heuristics is available online. The automatically generated lexicon makes it dif- ficult to compare semantic parses across models, since the correctness of a semantic parse depends on the learned perceptual classifiers. To facilitate such a comparison (Section 5.6), we filtered out lexicon entries containing predicates which were not used in any of the annotated logical forms. Statistics of the filtered lexicons are shown in Table 1. 4http://www.lemurproject.org/clueweb09/ 5We used the Stanford POS tagger (Toutanova et al., 2003). 200 Environment d Language z and predicted logical form ` Predicted grounding True grounding monitor to the left of the mugs {(2,1), (2,3)} {(2,1), (2,3)} λx.∃y.monitor(x) ∧ left-rel(x,y) ∧ mug(y) mug to the left of the other mug {(3,1)} {(3,1)} λx.∃y.mug(x) ∧ left-rel(x,y) ∧ mug(y) objects on the table {(1,4), (2,4) {(1,4), (2,4), λx.∃y.object(x) ∧ on-rel(x,y) ∧ table(y) (3,4)} (3,4)} two blue cups are placed near to the computer screen {(1)} {(1,2), (3,2)} λx.blue(x) ∧ cup(x) ∧ comp.(x) ∧ screen(x) What cities are in North Carolina? {(CH,NC), (GB,NC) {(CH,NC), (GB,NC) λx.∃y.city(x) ∧ in-rel(x,y) ∧ y = NC (RA,NC)} (RA,NC)} What city is east of Greensboro in North Carolina? {(RA,GB,NC), {(RA,GB,NC)} λx.∃y,z.city(x) ∧ east-rel(x,y) (MB,GB,NC)} ∧ y = GB ∧ in-rel(y,z) ∧ z = NC What cities are on the ocean? {(CH,AO), (GB,AO), {(MB,AO)} λx.∃y.city(x) ∧ on-rel(x,y) ∧ ocean(y) (MB,AO), (RA,AO)} Figure 5: Example environments, statements, and model predictions from the SCENE and GEOQA data sets. 5.4 Models and Training The evaluation compares three models. The first model is LSP-W, which is LSP trained using the weakly supervised algorithm described in Section 4. The second model, LSP-CAT, replicates the model of Matuszek et al. (2012) by restricting LSP to use category predicates. LSP-CAT is constructed by removing all relation predicates in lexicon entries, mapping entries like λf.λg.λx.∃y.r(x,y) ∧ g(x) ∧ f(y) to λf.λg.λx.∃y.g(x) ∧ f(y). This model is also trained using our weakly supervised algorithm. The third model, LSP-F, is LSP trained with full supervision, using the manually annotated semantic parses and logical knowledge bases in our data sets. Given these annotations, training LSP amounts to independently training a semantic parser (using sen- tences with annotated logical forms, {(zi,`i)}) and a set of perceptual classifiers (using environments with annotated logical knowledge bases, {(di, Γi)}). This model measures the performance achievable with LSP given significantly more supervision. All three variants of LSP were trained using the same hyperparameters. For SCENE, we computed subgradients in 5 example minibatches and per- formed 100 passes over the data using λ = 0.03. For GEOQA, we computed subgradients in 8 example minibatches, again performing 100 passes over the data using λ = 0.02. We tried varying the regular- ization parameter, but found that performance was relatively stable under λ ≤ 0.05. All experiments use leave-one-environment-out cross-validation to estimate model performance. We hold out each en- vironment in turn, train each model on the remaining environments, then test on the held-out environment. 5.5 Results We consider two prediction problems in the eval- uation. The first problem is to predict the correct denotation γi for a statement zi in an environment di. A correct prediction on this task corresponds to a correctly answered question. A weakness of this task is that it is possible to guess the right de- notation without fully understanding the language. For example, given a query like “mugs on the ta- ble,” it might be possible to guess the denotation based solely on “mugs,” ignoring “table” altogether. The grounding prediction task corrects for this prob- lem. Here, each model predicts a grounding, which is the set of all satisfying assignments to the vari- ables in a logical form. For example, for the log- ical form λx.∃y.left-rel(x,y) ∧ mug(y), the grounding is the set of (x,y) tuples for which both left-rel(x,y) and mug(y) return true. Note that, if the predicted semantic parse is incorrect, the predicted grounding for a statement may contain a different number of variables than the true ground- ing; such groundings are incorrect. Figure 5 shows model predictions for the grounding task. Performance on both tasks is measured using ex- act match accuracy. This metric is the fraction of examples for which the predicted set of entities (be it the denotation or grounding) exactly equals the annotated set. This is a challenging metric, as the 201 Denotation γ 0 rel. 1 rel. other total LSP-CAT 0.94 0.45 0.20 0.51 LSP-F 0.89 0.81 0.20 0.70 LSP-W 0.89 0.77 0.16 0.67 Grounding g 0 rel. 1 rel. other total LSP-CAT 0.94 0.37 0.00 0.42 LSP-F 0.89 0.80 0.00 0.65 LSP-W 0.89 0.70 0.00 0.59 % of data 23 56 21 100 (a) Results on the SCENE data set. Denotation γ 0 rel. 1 rel. other total LSP-CAT 0.22 0.19 0.07 0.17 LSP-F 0.64 0.53 0.21 0.48 LSP-W 0.64 0.58 0.21 0.51 Grounding g 0 rel. 1 rel. other total LSP-CAT 0.22 0.19 0.00 0.16 LSP-F 0.64 0.53 0.17 0.47 LSP-W 0.64 0.58 0.15 0.50 % of data 8 72 20 100 (b) Results on the GEOQA data set. Table 2: Model performance on the SCENE and GEOQA datasets. LSP-CAT is an ablated version of LSP that only learns categories (similar to Matuszek et al. (2012)), LSP-F is LSP trained with annotated seman- tic parses and logical knowledge bases, and LSP-W is LSP trained on sentences with annotated denotations. Results are separated by the number of relations in each test natural language statement. number of possible sets grows exponentially in the number of entities in the environment. Say an en- vironment has 5 entities and a logical form has two variables; then there are 25 possible denotations and 225 possible groundings. To quantify this difficulty, note that selecting a denotation uniformly at random achieves 6% accuracy on SCENE, and 1% accuracy on GEOQA; selecting a random grounding achieves 1% and 0% accuracy, respectively. Table 2 shows results for both applications us- ing exact match accuracy. To better understand the performance of each model, we break down perfor- mance according to linguistic complexity. We com- pute the number of relations in the annotated logical form for each statement, and show separate results for 0 and 1 relations. We also include an “other” category to capture sentences with more than one relation (very infrequent), or that include quanti- fiers, comparatives, or other linguistic phenomena not captured by LSP. The results from these experiments suggest three conclusions. First, we find that modelling relations is important for both applications, as (1) the major- ity of examples contain relational language, and (2) LSP-W and LSP-F dramatically outperform LSP- CAT on these examples. The low performance of LSP-CAT suggests that many denotations cannot be predicted from only the first noun phrase in a statement, demonstrating that both applications re- quire an understanding of relations. Second, we find that weakly supervised training and fully supervised training perform similarly, with accuracy differences in the range of 3%-6%. Finally, we find that LSP-W performs similarly on both the denotation and com- plete grounding tasks; this result suggests that when LSP-W predicts a correct denotation, it does so be- cause it has identified the correct entity referents of each portion of the statement. 5.6 Component Error Analysis We performed an error analysis of each model com- ponent to better understand the causes of errors. Ta- ble 3 shows the accuracy of the semantic parser from each trained model. Each held-out sentence zi is parsed to produce a logical form `, which is marked correct if it exactly matches our manual annotation `i. A correct logical form implies a correct ground- ing for the statement when the parse is evaluated in the gold standard logical knowledge base. These re- sults show that both LSP-W and LSP-F have rea- sonably accurate semantic parsers, given the restric- tions on possible logical forms. Common mistakes include missing lexicon entries (e.g., “borders” is POS-tagged as a noun, so the GEOQA lexicon does not include a verb for it) and prepositional phrase attachments (e.g., 6th example in Figure 5). Table 4 shows the precision and recall of the in- dividual perceptual classifiers. We computed these metrics by comparing each annotated predicate in the held-out environment with the model’s predic- tions for the same predicate, treating each entity (or entity pair) as an independent example for classifi- 202 SCENE GEOQA LSP-CAT 0.21 0.17 LSP-W 0.72 0.71 LSP-F 0.73 0.75 Upper Bound 0.79 0.87 Table 3: Accuracy of the semantic parser from each trained model. Upper bound is the highest accu- racy achievable without modelling comparatives and other linguistic phenomena not captured by LSP. cation. Fully supervised training appears to produce better perceptual classifiers than weakly supervised training; however, this result conflicts with the full system evaluation in Table 2, where both systems perform equally well. There are two causes for this phenomenon: uninformative adjectives and unim- portant relation instances. Uninformative adjective predicates are responsi- ble for the low performance of the category classi- fiers in SCENE. Phrases like “LCD screen” in this domain are annotated with logical forms such as λx.lcd(x) ∧ screen(x). Here, lcd is uninfor- mative, since screen already denotes a unique ob- ject in the environment. Therefore, it is not impor- tant to learn an accurate classifier for lcd. Weakly supervised training learns that lcd is meaningless, yet predicts the correct denotation for λx.lcd(x) ∧ screen(x) using its screen classifier. The discrepancy in relation performance occurs because the relation evaluation weights every rela- tion equally, whereas in reality some relations are more frequent. Furthermore, even within a single relation, each entity pair is not equally important – for example, people tend to ask what is in a state, but not what is in an ocean. To account for these factors, we define a reweighted relation metric using the an- notated logical forms containing only one relation, of the form λx.∃y.c1(x) ∧ r(x,y) ∧ c2(y). Using these logical forms, we measure the performance of r on the set of x,y pairs such that c1(x)∧c2(y), then average this over all examples. Table 4 shows that, using this metric, both training regimes have similar performance. This result suggests that weakly su- pervised training adapts LSP’s relation classifiers to the relation instances which are empirically impor- tant for grounding natural language. SCENE GEOQA Categories P R F1 P R F1 LSP-CAT 0.40 0.86 0.55 0.78 0.25 0.38 LSP-W 0.40 0.84 0.54 0.85 0.63 0.73 LSP-F 0.98 0.96 0.97 0.89 0.63 0.74 Relations P R F1 P R F1 LSP-W 0.40 0.42 0.41 0.34 0.51 0.41 LSP-F 0.99 0.87 0.92 0.70 0.46 0.55 Relations (rw) P R F1 P R F1 LSP-W 0.98 0.98 0.98 0.86 0.72 0.79 LSP-F 0.98 0.95 0.96 0.89 0.66 0.76 Table 4: Perceptual classifier performance, mea- sured against the gold standard logical knowledge bases. LSP-CAT is excluded from the relation eval- uations, since it does not learn relations. Relations (rw) is the reweighted metric (see text for details). 6 Conclusions This paper introduces Logical Semantics with Per- ception (LSP), a model for mapping natural lan- guage statements to their referents in a physical en- vironment. LSP jointly models perception and lan- guage understanding, simultaneously learning (1) to map from environments to logical knowledge bases containing instances of both one-argument categories and two-argument relations, and (2) to semantically parse natural language. Furthermore, we introduce a weakly supervised training proce- dure that trains LSP using only sentences with anno- tated denotations, without annotated semantic parses or noun phrase/entity mappings. An experimen- tal evaluation reveals that this procedure performs nearly as well fully supervised training, while re- quiring significantly less annotation effort. Our ex- periments also find that LSP’s ability to learn rela- tions improves performance over comparable prior work (Matuszek et al., 2012). Acknowledgments This research has been supported in part by DARPA under award FA8750-13-2-0005, and in part by a gift from Google. We also gratefully acknowledge the CMU Read the Web group for assistance with data set construction, and Tom Mitchell, Manuela Veloso, Brendan O’Connor, Felix Duvallet, Robert Fisher and the anonymous reviewers for helpful dis- cussions and comments on the paper. 203 References Rehj Cantrell, Matthias Scheutz, Paul Schermerhorn, and Xuan Wu. 2010. Robust spoken instruc- tion understanding for HRI. In Proceedings of the 5th ACM/IEEE International Conference on Human- Robot Interaction. David L. Chen and Raymond J. Mooney. 2008. Learning to sportscast: a test of grounded language acquisition. In Proceedings of the 25th International Conference on Machine learning. David L. Chen and Raymond J. Mooney. 2011. Learn- ing to interpret natural language navigation instruc- tions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence. James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. 2010. Driving semantic parsing from the world’s response. In Proceedings of the Four- teenth Conference on Computational Natural Lan- guage Learning. Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceed- ings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Juraj Dzifcak, Matthias Scheutz, Chitta Baral, and Paul Schermerhorn. 2009. What to do and how to do it: translating natural language directives into tempo- ral and dynamic logic representation for goal manage- ment and action execution. In Proceedings of the 2009 IEEE International Conference on Robotics and Au- tomation. Kai-yuh Hsiao, Nikolaos Mavridis, and Deb Roy. 2003. Coupling perception and simulation: Steps towards conversational robotics. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Rohit J. Kate and Raymond J. Mooney. 2006. Using string-kernels for learning semantic parsers. In Pro- ceedings of the 21st International Conference on Com- putational Linguistics and the 44th annual meeting of the ACL. Rohit J. Kate and Raymond J. Mooney. 2007. Learning language semantics from ambiguous supervision. In Proceedings of the 22nd Conference on Artificial In- telligence. Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. 2010. Toward understanding natural language directions. In Proceedings of the 5th ACM/IEEE In- ternational Conference on Human-Robot Interaction. Jayant Krishnamurthy and Tom M. Mitchell. 2012. Weakly supervised training of semantic parsers. In Proceedings of the 2012 Joint Conference on Empir- ical Methods in Natural Language Processing and Computational Natural Language Learning. Geert-Jan M. Kruijff, Hendrik Zender, Patric Jensfelt, and Henrik I. Christensen. 2007. Situated dialogue and spatial organization: What, where... and why. In- ternational Journal of Advanced Robotic Systems. Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa- ter, and Mark Steedman. 2010. Inducing probabilistic CCG grammars from logical form with higher-order unification. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Michael Levit and Deb Roy. 2007. Interpretation of spa- tial language in a map navigation task. IEEE Transac- tions on Systems, Man, and Cybernetics, Part B. Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In Proceedings of the Association for Computational Linguistics. Matthew MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: connecting language, knowledge, and action in route instructions. In Pro- ceedings of the 21st National Conference on Artificial Intelligence. Cynthia Matuszek, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine transla- tion. In Proceedings of the 5th ACM/IEEE Interna- tional Conference on Human-Robot Interaction. Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle- moyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded at- tribute learning. In Proceedings of the 29th Interna- tional Conference on Machine Learning. Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. 2006. (online) subgradient methods for structured prediction. Artificial Intelligence and Statistics. Deb Roy, Kai-Yuh Hsiao, and Nikolaos Mavridis. 2003. Conversational robots: building blocks for grounding word meaning. In Proceedings of the HLT-NAACL 2003 Workshop on Learning Word Meaning from Non- linguistic Data. Nobuyuki Shimizu and Andrew Haas. 2009. Learning to follow navigational route instructions. In Proceedings of the 21st international joint conference on artifical intelligence. Marjorie Skubic, Dennis Perzanowski, Sam Blisard, Alan Schultz, William Adams, Magda Bugajska, and Derek Brock. 2004. Spatial language for human-robot di- alogs. IEEE Transactions on Systems, Man, and Cy- bernetics, Part C: Applications and Reviews. Mark Steedman. 1996. Surface Structure and Interpre- tation. Ben Taskar, Carlos Guestrin, and Daphne Koller. 2004. Max-margin markov networks. In Advances in Neural Information Processing Systems. 204 Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew Walter, Ashis Banerjee, Seth Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipu- lation. In AAAI Conference on Artificial Intelligence. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Pro- ceedings of the 2003 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics on Human Language Technology. Terry Winograd. 1970. Procedures as a representation for data in a computer program for understanding nat- ural language. Ph.D. thesis, Massachusetts Institute of Technology. Yuk Wah Wong and Raymond J. Mooney. 2006. Learn- ing for semantic parsing with statistical machine trans- lation. In Proceedings of the main conference on Hu- man Language Technology Conference of NAACL. John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to parse database queries using inductive logic pro- gramming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form: Structured clas- sification with probabilistic categorial grammars. In Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence. 205 206