key: cord-0616231-wjfvr7pa
authors: Kane, Benjamin; Platonov, Georgiy; Schubert, Lenhart K.
title: History-Aware Question Answering in a Blocks World Dialogue System
date: 2020-05-26
journal: nan
DOI: nan
sha: 4e4e4ba5e2a1a9269d70ead456aa0883211b8349
doc_id: 616231
cord_uid: wjfvr7pa

It is essential for dialogue-based spatial reasoning systems to maintain memory of historical states of the world. In addition to conveying that the dialogue agent is mentally present and engaged with the task, referring to historical states may be crucial for enabling collaborative planning (e.g., for planning to return to a previous state, or diagnosing a past misstep). In this paper, we approach the problem of spatial memory in a multi-modal spoken dialogue system capable of answering questions about interaction history in a physical blocks world setting. This work builds upon a full spatial question-answering pipeline consisting of a vision system, speech input and output mediated by an animated avatar, a dialogue system that robustly interprets spatial queries, and a constraint solver that derives answers based on 3-D spatial modelling. The contributions of this work include a symbolic dialogue context registering knowledge about discourse history and changes in the world, as well as a natural language understanding module capable of interpreting free-form historical questions and querying the dialogue context to form an answer.

AI systems have seen impressive growth in the last 10 or 20 years with respect to various specific, narrow tasks in spatial cognition and natural language processing. However, there is still a shortage of multimodal interactive systems capable of performing high-level tasks requiring understanding and reasoning. In particular, many such dialogue systems lack episodic memory -historical recall of earlier discourse and perceived "world" situations and events. Episodic memory is needed for supporting a sense of shared contextual awareness, and to set the stage for potential development of intelligent collaborative systems such as ones This work was supported by DARPA grant W911NF-15-1-0542, NSF NRT Graduate Training Grant 2019-2020, and NSF EAGER Award IIS-1940981. We thank our team of volunteers for their suggestions and contributions to system evaluation.

that allow diagnostic discussion of past actions, planning to re-achieve an earlier situation, repeating a past actions sequence, etc.

Because of the complexity of most real-life tasks, the blocks world domain provides an ideal experimental setting for developing prototypes with such capabilities. Interest in the blocks world as a domain for AI research goes back as far as the 1970s, with Winograd's thesis [24] being one of the earliest studies, featuring a virtual environment along with text-based interaction. Winograd's system understood basic spatial relations such as one block being on another or in the box, and also maintained a record of block "put-on" actions (with their preconditions and effects), enabling it to answer questions about past actions and their purpose.

In recent years there has been a resurgence of interest in solving problems in such limited domains using modern techniques. Despite its relative simplicity, the blocks world domain motivates implementation of diverse capabilities in a virtual interactive agent aware of physical blocks on a table, including visual scene analysis, spatial reasoning, planning, learning of new concepts, dialogue management, voice interaction, and more. Recent studies in this domain have focused on learning spatial concepts in a physical blocks world setting [17, 16] and applying deep learning techniques to a virtual blocks world environment [2] . However, unlike Winograd's earlier study, these recent systems lack episodic memory and cannot reason about historical states of the world.

In this work, we extend a spatial question-answering system in a physical blocks world to allow for registering historical context and answering questions about the session history, such as "Which block did I just move?", "Where was the Toyota block before I moved it?", "Did the Target block ever touch the Texaco block?", "Was the Twitter block always between two red blocks?", "When did I last move the Starbucks block?", etc. Our modelling of spatial relations is based on 3-D Blender graphics representations of the objects in the blocks world; thus a straightforward approach to providing a historical record of the world would be to store successive states in this "imagistic" form. However, this would be intractable in terms of computation and storage in more general scenarios, where there may be object-rich scenes or indefinitely long histories. Moreover, this would appear to be cognitively implausible: Detailed visual memory of scenes in humans is quite short-lived, and a few higher-level properties suffice for humans to swiftly reconstruct more detailed representations of a scene [21] . Therefore we approach this problem by maintaining a dialogue context containing symbolic knowledge about changes in the world. Used jointly with current spatial observations, this symbolic knowledge enables reconstruction of past states of the world, sufficient for answering historical questions.

Early studies featuring the blocks world include [24] (as already mentioned) and [3] , both of which relied on a simulated environment. The latter was focused on construction planning, rather than user interaction, and as such incorporated extensive reasoning about geometric consistency and structural stability, more than descriptive aspects of the block configurations. Both of the systems maintained memory of historical states of the blocks world -the former kept a record of "put-on" actions, with which it could answer historical questions by reconstructing necessary knowledge, while the latter maintained a cache of known facts that it could query to improve performance. These systems demonstrated impressive planning capabilities, but their worlds were simulated, interaction was text-based, and they lacked realistic models for spatial relations.

Modern efforts in blocks worlds include work by Perera et al. [17, 16] , which is focused on learning spatial concepts (such as staircases, towers, etc.) based on verbally-conveyed structural constraints, e.g., "The height is at most 3", as well as explicit examples and counterexamples, given by the user. Their spatial modelling is mostly concerned with positioning of substructures such as stacks or rows with respect to each other, and sizes of rows and columns, whereas our focus is more in realistic modelling of prepositional relations, and on answering free-form questions.

In a rather different vein, the work by Bisk et al. [2] is concerned with learning to transduce verbal instructions, e.g., "Move McDonald's so its just to the right (not touching) the Twitter block" into block displacements in a simulated environment. This system, unlike ours, relies on deep learning and does not use high-level cognitively motivated spatial relation models. The CLEVR dataset [8] and its modified versions, such as [13] , lays out an explicit spatial question answering challenge that has inspired a flurry visual reasoning studies, e.g., [12] and [15] , which achieves near-perfect scores on the CLEVR questions.

Common shortcomings of the deep learning approaches include a reliance on image data projected from synthetic scenes of limited variety, simplified groundtruth models of spatial relations (e.g., left means any amount laterally to the left, regardless of depth or intervening objects, etc.), and use of domain-specific procedural formalisms for linguistic semantics. Also, while the system architectures employed could probably be carried over to other domains, there would be no carry-over of conceptual understanding or language understanding -each new domain would require creation of new large training corpora for both the spatiophysical and linguistic aspects of the domain, with ever-increasing demands for data as the complexity of scenes being considered grows.

Finally, in relation to our focus here, we note that the recent blocks world systems do not not maintain an episodic memory or attempt reasoning about historical states of the world.

Outside of the blocks world domain, several AI systems have made use of some form of episodic memory. The TRAINS system [5] and the subsequent TRIPS system [4] were interactive dialogue-based problem solving systems in a virtual map environment. These systems maintained temporal knowledge containing facts about the planning environment, and the planning agents were able to reason about temporal aspects of plans using Allen Interval Logic [1] . The work reported in [14] implements a spatial working memory using LIDA, a symbolic cognitive architecture, in a virtual reality environment. This work rep-resents spatial context using a combination of a grid representation of the world, and hierarchical "place nodes" with individual activations, which are updated based on phase changes of the grid. The performance of this system was compared to human performance on a map recall task; however, this study did not involve any ability to reason about historical spatial relations between objects. Recent deep-learning-based approaches to modelling spatial episodic memory include [22] and [6] . The former uses an unsupervised encoder-decoder model to represent episodic memory as latent embeddings, and shows that this model can allow a robot to recall previous visual episodes in a physical scene. The latter introduces a neuro-symbolic Structured Event Memory (SEM) model that is capable of segmenting events in video data and reconstructing past memory items.

The goal of this work is to enable dialogue-based question answering about historical states of blocks arranged on a table. This includes both questions about spatially-relevant actions by the user (e.g., "Which block did I just move?", "What blocks did I put near the Twitter block?", "When did I last move the Starbucks block?", "What was the first block that I moved?", etc.), as well as questions about past spatial relations between blocks (e.g., "Where was the Toyota block before I moved it?", "Did the Target block ever touch the Texaco block?", "Was the Twitter block always on the Starbucks block?", etc.). Moreover, the historical module for answering these questions should be sufficiently general to be extended to more realistic domains, such as a "room world" containing everyday items.

This task serves two purposes. First, it motivates the augmentation of the dialogue manager with an episodic memory, and guides its integration with the overall pipeline (including dialogue management, audio-visual input/output, etc.). Experience with episodic memory design for the blocks world will allow extensions to more advanced functionalities in more general settings. Second, our overarching goal is to build a collaborative blocks world agent, capable of interactively learning structural concepts and building examples of them, relying on natural language communication with the user. Spatial episodic memory is necessary for allowing diagnostic discussion of past actions (e.g. "You mean, next to the previous block?", "Where it was before, right?", etc.), repeating past action sequences, and generally supporting a sense of shared contextual awareness.

The following example interaction between the user and the system demonstrates the kind of back-and-forth exchange our system is capable of: 

The capacity for answering historical questions is built on top of an existing dialogue-based blocks world system and physical apparatus, which we describe in this section 1 .

The physical apparatus (see Fig. 1a ) is comprised of a square table surface, approximately 1.5m x 1.5m in size, several cubical blocks with 0.15m sides, two Microsoft Kinect sensors to track the state of the world, and a display for user interaction. The blocks are marked with corporate logos, such as McDonald's, Toyota, Texaco, etc., which serve as block names and allow the user and the system to uniquely identify and refer to individual blocks. The blocks are also color-coded as either red, green, or blue, using the colored stripes running along the edges of the blocks (see Figure 1a ).

The architecture of the software component is shown in Fig. 1b . The system uses audio-visual input and output. The block detection and tracking module periodically reads the input from the Kinect cameras and updates the block positioning information. Based on the information from the block tracking module, the physical block arrangement is modeled as a 3-D scene in Blender. All the spatial processing is performed on that model. The automatic speech recognition module, based on the Google Cloud Speech-To-Text API, is responsible for generating the transcripts of user utterances.

For communicating back to the user, we employ an interactive avatar, David 2 It is capable of vocalizing the text and displaying facial expressions, making the flow of conversation more natural than with textual I/O.

The spatial component module together with a constraint solver is responsible for analyzing the block configuration with respect to the conditions implicit in the user's utterance. The Eta dialogue manager is responsible for unscoped logical form (ULF) generation (see subsection below) and controlling the dialogue flow and transition between phases, such as greeting, ending the session, etc.

The Eta dialogue manager (DM) is responsible for semantic parsing and dialogue control. Eta is designed to follow a modifiable dialogue schema, the contents of which are formulas in episodic logic [23] with open variables describing successive steps (events) expected in the course of the interaction, typically speech acts by the system or the user. These are either realized directly as instantiated actions, or expanded into sub-schemas for further processing as the interaction proceeds 3 .

A key mechanism used in the course of instantiating schema steps, including interpretation of user inputs, is hierarchical pattern transduction. Transduction hierarchies specify patterns at their nodes, with branches from a node providing alternative continuations as a hierarchical match proceeds. Terminal nodes provide result templates, or specify a subschema, a subordinate transduction tree, or some other result. The patterns are simple template-like ones that look for particular words or word features, and allow for "match-anything", lengthbounded word spans. For example, a feature-annotated word might be (spring season time-period noun name), and "any number or words" is indicated by 0, and "at most two words" is indicated by 2.

As described so far, the DM resembles the dialogue manager used by the LISSA system [19, 20] . However, interpretation in that system was designed for casual conversation, and was limited to context-dependent derivation of English gist clauses from user inputs. The gist clauses were derived using transduction trees that take account of prior utterances, largely eliminating context dependence in the process. This greatly simplifies the process of finding appropriate responses to inputs -again via transduction trees. Eta likewise uses gist clause derivation, but only for handling casual aspects of dialogue such as greetings, and for "tidying up" some inputs in preparation for for further processing.

A simplified generic example of a gist clause transduction tree is shown in Figure 2 . The gist clause of Eta's previous utterance (shown in red) is used to select an appropriate subtree, which is next used to extract a gist clause from the user's utterance (shown in green).

After extracting gist clauses, Eta also derives an unscoped logical form (ULF) [11] from the tidied-up input. ULF is closely related to the logical syntax used in schemas -it is a preliminary form of that syntax, when mapping English to logic. ULF differs from similar semantic representations, e.g., AMR, in that it is close to the surface form of English, covers a richer set of semantic phenomena, and does so in a type-consistent way. To illustrate the approach, consider the example "Which blocks are on two other blocks?". The resulting ULF will be (((Which.d (plur block.n)) ((pres be.v) (on.p (two.d (other.a (plur block.n)))))) ?). As can be seen from this example, the resulting ULF retains much of the surface structure, but uses semantic typing and adds operators to indicate plurality, tense, aspect, and other linguistic phenomena.

We extended the semantic parsing mechanism, originally aimed at deriving gist clauses, by introducing phrase-based recursion into hierarchical transduction trees. This enabled a rather novel form of compositional interpretion that is quite efficient and accurate for the domain at hand, and has proved to be readily extensible. A top-level transduction tree identifies different types of input sentences and accordingly sends them off to more specialized trees. These trees again use hierarchical pattern matching based on words and their features to identify meaningful (generally phrasal) segments of the input, such as an NP segment or a VP segment. They then dispatch the corresponding (feature-annotated) word sequences to transduction hierarchies appropriate for their phrasal types; these recursively derive and return ULF formula constituents, which are then composed into larger expressions by the "calling" tree, and returned. At the level of individual words (or certain phrases), a lexicon and lexical routines supply word ULFs. The efficiency and accuracy of the approach lies in the fact that hierarchical pattern matching can quite accurately segment utterances into meaningful parts, often relying on automatically added syntactic and semantic features, so that the need for recursive backtracking rarely arises. Also some transductions may remove or ignore extraneous words (such as the first two words of "OK, so, when did the ..."), improving robustness.

An example of a transduction tree being used for parsing a historical question into ULF is shown and described in Figure 3 . As in the example mentioned above, the resulting ULF retains much of the surface structure, but uses semantic typing and adds operators to indicate plurality, tense, aspect, and other linguistic phenomena. Additional regularization is done with a limited coreference module, which can resolve anaphora and referring expressions such as "it", "that block", etc., by detecting and storing discourse entities in context and employing recency and syntactic salience heuristics. 

To answer historical questions, the blocks world agent requires two related functionalities: First, the DM must maintain a dialogue context, including (besides basic indexical knowledge such as current time, location, and dialogue participants) the discourse history, a list of past referents for reference resolution, and spatial episodic memory. Secondly, the DM must be able to robustly parse historical questions into the logical form described above, and consequently resolve the resulting semantic interpretation into operations over the episodic memory.

As noted earlier, the spatial component of an interaction memory might consist of detailed visual or vector-based memory representations that the system can query, or else it might be implemented as a high-level symbolic memory enabling reconstruction of past scenes. Inspired by Winograd's early work in [24] , and mindful of the cognitive considerations already cited [21] , we chose to adopt the latter approach.

As the spatial question answering session progresses, the vision system records the centroid coordinates of blocks and block moves in real time, thresholded to avoid registering noise as block moves. On the DM side, a "perceive-world" action in the schema causes the DM to request perceptions (represented in ULF) from the vision system. These perceptions currently consist of block location propositions of the form (|Twitter| at-loc.p ($ loc ?x ?y ?z)), 4 and block move propositions of the form (|Twitter| ((past move.v) (from.p-arg ($ loc ?x1 ?y1 ?z1)) (to.p-arg ($ loc ?x2 ?y2 ?z2)))). In principle our formalism also allows named locations, e.g., (|Twitter| at-loc.p |Loc1|), though this is not yet implemented.

We rely on a simple linear, discrete time representation (possible future modifications are discussed in Section 6). The DM stores a symbol denoting the current time, with |Now0| representing the time at which the dialogue is initialized. Each sequential action in the world causes the DM to "update" its time twice corresponding to the time during which the move is in-progress and the time at which the move has finished. That is, if the DM denoted the initial time with |Now0|, a block move would cause it to update its time to |Now1| (the in-progress time), and then to |Now2| once the move has finished. These temporal symbols are related to each other via propositions of the form (|Now1| before.p |Now2|) and (|Now2| after.p |Now1|) stored in the context. 5 The fact ((|Twitter| ((past move.v) (from.p-arg ($ loc ?x1 ?y1 ?z1)) (to.p-arg ($ loc ?x2 ?y2 ?z2)))) * |Now1|) is stored in the dialogue context, where '*' is the episodic "true in" operator described in [23] . User utterance actions are similarly stored in the context.

Based on this context, the DM can efficiently reconstruct a scene at any past time by backtracking from currently observed block locations, as well as use these reconstructed scenes to evaluate spatial relationships between blocks in a ""rough-and-ready" way, i.e., using approximate calculations of spatial relations based on block centroid coordinates, as opposed to the detailed spatial computations supported by the visual blocks world system.

Following a successful parse of a historical question by the semantic parser described previously, historical modifiers in a ULF will be indicated by constituents of type "adv-e" (event adverbial, e.g., (adv-e (during.p (the.d move.n)))), "adv-f" (frequency adverbial, e.g., (adv-f (three.a (plur time.n)))), or "adv-s" (sentence adverbial, e.g., (adv-s (after.ps (|Twitter| (past move.v))))).

The algorithm the DM uses to answer historical questions is as follows: starting from the present time, the algorithm iterates over past times, reconstructing the scene at each one using stored knowledge about moves. At each time, the algorithm computes and stores a list of salient facts (i.e. propositions about spatial relations or actions which held at that time) depending on the subject, object, predicate, question category, and polarity of the query sentence. Furthermore, temporal constraints are applied to filter these times (in the manner described below) to obtain a final list of times with relevant attached facts.

The semantic types of these temporal and frequency modifiers allow them to be lifted to the sentence level [10] . Temporal constraints expressed by modifiers may be binary, e.g., (adv-e (before.p |Now4|)), or unary, e.g. (adv-e recent.a). A binary constraint takes a temporal entity as an argument and maps it to a truth value, depending on whether the given relation holds with the object of the constraint. This is used by the algorithm described above to filter out each time at which the binary constraint does not evaluate to true. However, first the binary constraint needs to resolve its object constituent (which could be a simple noun phrase or an embedded clause). This is done using a recursive call of the algorithm described above, which maps the object ULF to a list of times, treating any modifiers in the noun phrase or embedded clause as temporal constraints.

A unary constraint takes a set of times and maps it to a subset (possibly null) of these times. For example, the "recent" constraint above picks out the subset of times that are within some fixed threshold to the present time. Frequency constraints such as twice.adv-f or (adv-f (three.a (plur time.n))) are similar to unary constraints in that they take a set of times and return a subset of these, though their behavior is slightly more complicated -they pick out all times for which the salient facts attached to that time are also attached to at least N unique times, inclusive. For (adv-f always.a), N is taken to be the size of the set of times being filtered (so that only facts that are attached to every time in the set are obtained). Fig. 4 : A simplified example of how the context is represented and how the DM uses the context to compute relations given temporal constraints (top half), and an example of the DM determining an answer from a specific historical query (bottom half). Note that, although the visual scenes are shown for reference, the DM does not actually store detailed visual memories; it only stores the symbolic facts in the "Memory" column.

Each constraint may also be modified by a "mod-a" modifier, e.g., (adv-e (just.mod-a recent.a)), which modifies how that constraint is applied. In the case of "just recently", the singular most recent time is picked out.

Historical questions don't necessarily involve sentence-level adverbial modifiers, as the temporal content could be embedded within a noun phrase, as in "What was the first block that I moved?". In this case, the DM will resolve this reference to a particular block by calling the above algorithm recursively, treating the noun pre-and post-modifiers as temporal constraints, and using the facts attached to the resulting times.

Once a list of final times and the corresponding facts/relations have been obtained, an answer is generated by making the appropriate substitutions in the query ULF (e.g. a wh-pronoun for the subject of a relation), applying syntactic transformations (e.g., uninverting questions and removing auxiliary verbs such as "do"), and converting this to surface form.

The DM additionally has a limited module for generating and responding to pragmatic inferences, based on the work in [9] . In the case where either no times or no relations are obtained, and the question contains a presupposition (e.g. "What block was the Twitter block on?" carries the presupposition that the Twitter block was on some block), the DM will attempt to respond by negating the inferred presupposition (e.g. "The Twitter block wasn't on any block.").

A full example of answering a historical question (using a simplified scene) is shown in Figure 4 . The extraction of answer relations, given the query ULF and the generation of the answer ULF, are shown in the bottom half of the figure, while the scene reconstruction and computation of relevant facts/relations is depicted in the top half of the figure.

Note that the example in Figure 4 is actually fairly ambiguous; the answer could be "A, D, C" or "A, C" depending on whether someone reads the query as meaning any blocks that B ever touched before it was moved, or only the blocks that it touched directly before the move. In fact, we found that many natural historical questions are similarly under-specified, presenting a major source of difficulty. To deal with this issue, the DM's pragmatic module attempts to infer temporal constraints in these ambiguous cases -in this particular example, Eta would infer the constraint "most recently", unless the user explicitly specifies otherwise in their query.

Though our work is grounded in a physical blocks world system, the COVID-19 pandemic made an on-site user study impossible. Therefore we resorted to developing a virtualized environment that mirrors our setup, and used it to collect the evaluation data. Only the physical blocks tracker and the audio I/O were disabled in the modified system. All the crucial components evaluated in this work, namely, the parser, the dialog manager, and the historical questionanswering subsystem based on the world state memory were not changed, so the results of the user study are not invalidated.

We enlisted the help of 4 student volunteers to test the capabilities of the system, including both native and non-native English speakers. The participants were instructed to move the blocks around and ask general questions about relationships and changes in the world; no restrictions on wording were imposed. After the system displayed its answer, the participants were asked to provide feedback on the quality of the answer, by marking the system's answer as correct, partially correct or incorrect. Each participant contributed at least 100 questions.

Each session started with the blocks positioned in a row at the front of the table. The participants were instructed to reposition or stack up the blocks arbitrarily in the course of the question-answer session, to test the robustness and consistency of the spatial models. The data is presented in Table 1 . A few malformed questions were excluded when computing accuracy. We find these preliminary results encouraging, given the complexity of the task and the unrestricted form of the questions, though there is still much room for future improvement. A little above half of Eta's answers were judged to be fully correct, with accuracy rising to 58% when including partially correct answers. We find that the semantic parser itself is very reliable, with 94% of grammatical sentences being parsed correctly.

We observe that historical questions are, in general, far more pragmatically loaded than simple spatial questions, and judgements involve high degrees of subjectivity. A major source of error is in the handling of under-specified historical questions, as described in Section 5.2. There are many nuances to how humans naturally interpret these, that are difficult to consistently capture with simple pragmatic rules. For example, Eta will plausibly interpret "What blocks did I move before the Twitter block?" as meaning "What blocks did I move shortly before I moved the Twitter block?" (especially if the move of the Twitter block was very recent); however if the user instead asks "How many blocks did I move before the Twitter block?", it seems that the questioner probably means "How many blocks did I ever move before I moved the Twitter block?". Currently, Eta would add "recently" for the latter case, which would be incorrect.

In future work, we aim to investigate the pragmatic phenomena discussed in Section 6 in more detail and to improve the pragmatic inference module to handle these cases correctly, as well as carrying out more detailed analyses of other sources of error.

In addition, as questions in the blocks world domain tend to exhibit a fairly simple tense structure, we encountered issues with some of the more complex questions as a result of our simplifying assumption of discrete linear time. In future work, we plan to look into the use of more general temporal reasoning systems such as the tense trees described in [7] to enable the system to handle different aspects and more complex embedded clauses more robustly.

Finally, as described in Section 5.1, our system approximates each object in memory in terms of the position of its centroid with a cubical bounding box around it. Although this approach is justifiable in view of people's generally rather vague and unreliable recall of spatial relations, it can also lead to deviations from human judgments, especially for configurations conceptualized by people in terms of larger shapes. For example, if multiple blocks are arranged in a crescent shape, where that crescent surrounds an additional block nearly but not quite in contact with it, a person would remember that the interior block was not actually touching the nearest block of the crescent, whereas our centoid-based computation might well decide that they were touching. In our continuing work on natural language interaction in the blocks world to allow for teaching and learning larger-scale structural concepts, and also generalizing to a more realistic "room world" (see [18] ), we are developing a set of object schemas for objects in the domain, using much the same formalism as for the dialogue schemas described above (but augmented with 3-D prototypes). Dealing with historical questions in such settings will require enrichment of episodic memory representations and of the linguistic and spatial reasoning mechanisms for interacting intelligently with a user.

We have augmented a spatial question answering system in a physical blocks world system with the ability to answer free-form historical questions using a symbolic dialogue context, keeping track of a record of block moves and other actions. A pattern-driven, compositional semantic parser allows historical questions to be parsed into a logical form, which is then used in conjunction with the historical context model to derive and generate answers. We obtained an accuracy of 58%, which we believe is a reasonable preliminary result given the free-form and often under-specified nature of the historical questions that users asked, though it also leaves much room for improvement. Overall, the pragmatic richness and complexity that we've observed in historical question-answering suggests that further work in this under-studied area is likely to be fruitful.

Actions and Events in Interval Temporal Logic

Learning interpretable spatial operations in a rich 3d blocks world

A planning system for robot construction tasks

Trips: An integrated intelligent problem-solving assistant

Trains-95: Towards a mixed-initiative planning assistant

Structured event memory: a neuro-symbolic model of event cognition

Interpreting tense, aspect and time adverbials: A compositional, unified approach

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Generating discourse inferences from unscoped episodic logical formulas

Intension, attitude, and tense annotation in a high-fidelity semantic representation

A type-coherent, expressive representation as an initial step to language understanding

Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog

Clevr-ref+: Diagnosing visual reasoning with referring expressions

Spatial working memory in the lida cognitive architecture

The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision

Building and learning structures in a situated blocks world through deep language understanding

A situated dialogue system for learning structural concepts in blocks world

Computational models for spatial prepositions

The LISSA virtual human and ASD teens: An overview of initial experiments

Managing casual spoken dialogue using flexible schemas, pattern transduction trees, and gist clauses

Scene Perception

Deep episodic memory: Encoding, recalling, and predicting episodic experiences for robot action execution

Episodic logic meets little red riding hood: A comprehensive, natural representation for language understanding. Natural Language Processing and Knowledge Representation: Language for Knowledge and Knowledge for Language

Understanding natural language