Whodunnit? Crime Drama as a Case for Natural Language Understanding

Lea Frermann Shay B. Cohen Mirella Lapata
Institute for Language, Cognition and Computation

School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB

l.frermann@ed.ac.uk scohen@inf.ed.ac.uk mlap@inf.ed.ac.uk

Abstract

In this paper we argue that crime drama ex-
emplified in television programs such as CSI:
Crime Scene Investigation is an ideal testbed
for approximating real-world natural language
understanding and the complex inferences as-
sociated with it. We propose to treat crime
drama as a new inference task, capitalizing on
the fact that each episode poses the same ba-
sic question (i.e., who committed the crime)
and naturally provides the answer when the
perpetrator is revealed. We develop a new
dataset1 based on CSI episodes, formalize per-
petrator identification as a sequence labeling
problem, and develop an LSTM-based model
which learns from multi-modal data. Exper-
imental results show that an incremental in-
ference strategy is key to making accurate
guesses as well as learning from representa-
tions fusing textual, visual, and acoustic input.

1 Introduction

The success of neural networks in a variety of ap-
plications (Sutskever et al., 2014; Vinyals et al.,
2015) and the creation of large-scale datasets have
played a critical role in advancing machine under-
standing of natural language on its own or together
with other modalities. The problem has assumed
several guises in the literature such as reading com-
prehension (Richardson et al., 2013; Rajpurkar et
al., 2016), recognizing textual entailment (Bowman
et al., 2015; Rocktäschel et al., 2016), and notably
question answering based on text (Hermann et al.,

1Our dataset is available at https://github.com/
EdinburghNLP/csi-corpus.

2015; Weston et al., 2015), images (Antol et al.,
2015), or video (Tapaswi et al., 2016).

In order to make the problem tractable and
amenable to computational modeling, existing ap-
proaches study isolated aspects of natural language
understanding. For example, it is assumed that un-
derstanding is an offline process, models are ex-
pected to digest large amounts of data before being
able to answer a question, or make inferences. They
are typically exposed to non-conversational texts or
still images when focusing on the visual modality,
ignoring the fact that understanding is situated in
time and space and involves interactions between
speakers. In this work we relax some of these sim-
plifications by advocating a new task for natural lan-
guage understanding which is multi-modal, exhibits
spoken conversation, and is incremental, i.e., un-
folds sequentially in time.

Specifically, we argue that crime drama exempli-
fied in television programs such as CSI: Crime Scene
Investigation can be used to approximate real-world
natural language understanding and the complex in-
ferences associated with it. CSI revolves around a
team of forensic investigators trained to solve crim-
inal cases by scouring the crime scene, collecting
irrefutable evidence, and finding the missing pieces
that solve the mystery. Each episode poses the same
“whodunnit” question and naturally provides the an-
swer when the perpetrator is revealed. Speculation
about the identity of the perpetrator is an integral
part of watching CSI and an incremental process:
viewers revise their hypotheses based on new evi-
dence gathered around the suspect/s or on new in-
ferences which they make as the episode evolves.

We formalize the task of identifying the perpetra-
tor in a crime series as a sequence labeling problem.

1

Transactions of the Association for Computational Linguistics, vol. 6, pp. 1–15, 2018. Action Editor: Marco Baroni.
Submission batch: 8/2017; Revision batch: 10/2017; Published 1/2018.

c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


Like humans watching an episode, we assume the
model is presented with a sequence of inputs com-
prising information from different modalities such
as text, video, or audio (see Section 4 for details).
The model predicts for each input whether the per-
petrator is mentioned or not. Our formulation gen-
eralizes over episodes and crime series. It is not spe-
cific to the identity and number of persons commit-
ting the crime as well as the type of police drama
under consideration. Advantageously, it is incre-
mental, we can track model predictions from the
beginning of the episode and examine its behavior,
e.g., how often it changes its mind, whether it is con-
sistent in its predictions, and when the perpetrator is
identified.

We develop a new dataset based on 39 CSI
episodes which contains goldstandard perpetrator
mentions as well as viewers’ guesses about the
perpetrator while each episode unfolds. The se-
quential nature of the inference task lends it-
self naturally to recurrent network modeling. We
adopt a generic architecture which combines a
one-directional long-short term memory network
(Hochreiter and Schmidhuber, 1997) with a softmax
output layer over binary labels indicating whether
the perpetrator is mentioned. Based on this architec-
ture, we investigate the following questions:

1. What type of knowledge is necessary for per-
forming the perpetrator inference task? Is the
textual modality sufficient or do other modali-
ties (i.e., visual and auditory input) also play a
role?

2. What type of inference strategy is appropriate?
In other words, does access to past information
matter for making accurate inferences?

3. To what extent does model behavior simu-
late humans? Does performance improve over
time and how much of an episode does the
model need to process in order to make accu-
rate guesses?

Experimental results on our new dataset reveal
that multi-modal representations are essential for the
task at hand boding well with real-world natural lan-
guage understanding. We also show that an incre-
mental inference strategy is key to guessing the per-
petrator accurately although the model tends to be

less consistent compared to humans. In the remain-
der, we first discuss related work (Section 2), then
present our dataset (Section 3) and formalize the
modeling problem (Section 4). We describe our ex-
periments in Section 5.

2 Related Work

Our research has connections to several lines of
work in natural language processing, computer vi-
sion, and more generally multi-modal learning. We
review related literature in these areas below.

Language Grounding Recent years have seen in-
creased interest in the problem of grounding lan-
guage in the physical world. Various semantic space
models have been proposed which learn the meaning
of words based on linguistic and visual or acous-
tic input (Bruni et al., 2014; Silberer et al., 2016;
Lazaridou et al., 2015; Kiela and Bottou, 2014).
A variety of cross-modal methods which fuse tech-
niques from image and text processing have also
been applied to the tasks of generating image de-
scriptions and retrieving images given a natural lan-
guage query (Vinyals et al., 2015; Xu et al., 2015;
Karpathy and Fei-Fei, 2015). Another strand of re-
search focuses on how to explicitly encode the un-
derlying semantics of images making use of struc-
tural representations (Ortiz et al., 2015; Elliott and
Keller, 2013; Yatskar et al., 2016; Johnson et al.,
2015). Our work shares the common goal of ground-
ing language in additional modalities. Our model is,
however, not static, it learns representations which
evolve over time.

Video Understanding Work on video understand-
ing has assumed several guises such as generat-
ing descriptions for video clips (Venugopalan et al.,
2015a; Venugopalan et al., 2015b), retrieving video
clips with natural language queries (Lin et al., 2014),
learning actions in video (Bojanowski et al., 2013),
and tracking characters (Sivic et al., 2009). Movies
have also been aligned to screenplays (Cour et al.,
2008), plot synopses (Tapaswi et al., 2015), and
books (Zhu et al., 2015) with the aim of improv-
ing scene prediction and semantic browsing. Other
work uses low-level features (e.g., based on face de-
tection) to establish social networks of main charac-
ters in order to summarize movies or perform genre

2


Peter Berglund:

You're still going to have to 
convince a jury that I killed
two strangers for no reason.

Grissom doesn't look 
worried.
He takes his gloves off and 
puts them on the table.

Grissom:
You ever been to the 
theater Peter?
There 's a play called six 
degrees of separation.

It 's about how all the 
people in the world are 
connected to each other by 
no more than six people.

All it takes to connect you to 
the victims is one degree.

Camera holds on Peter
Berglund's worried look.

Figure 1: Excerpt from a CSI script (Episode 03, Season 03: “Let the Seller Beware”). Speakers are shown in bold,
spoken dialog in normal font, and scene descriptions in italics. Gold-standard entity mention annotations are in color.
Perpetrator mentions (e.g., Peter Berglund) are in green, while words referring to other entities are in red.

classification (Rasheed et al., 2005; Sang and Xu,
2010; Dimitrova et al., 2000). Although visual fea-
tures are used mostly in isolation, in some cases they
are combined with audio in order to perform video
segmentation (Boreczky and Wilcox, 1998) or se-
mantic movie indexing (Naphide and Huang, 2001).

A few datasets have been released recently
which include movies and textual data. MovieQA
(Tapaswi et al., 2016) is a large-scale dataset
which contains 408 movies and 14,944 questions,
each accompanied with five candidate answers,
one of which is correct. For some movies, the
dataset also contains subtitles, video clips, scripts,
plots, and text from the Described Video Service
(DVS), a narration service for the visually impaired.
MovieDescription (Rohrbach et al., 2017) is a re-
lated dataset which contains sentences aligned to
video clips from 200 movies. Scriptbase (Gorinski
and Lapata, 2015) is another movie database which
consists of movie screenplays (without video) and
has been used to generate script summaries.

In contrast to the story comprehension tasks en-
visaged in MovieQA and MovieDescription, we fo-
cus on a single cinematic genre (i.e., crime series),
and have access to entire episodes (and their corre-
sponding screenplays) as opposed to video-clips or
DVSs for some of the data. Rather than answering
multiple factoid questions, we aim to solve a single
problem, albeit one that is inherently challenging to
both humans and machines.

Question Answering A variety of question an-
swering tasks (and datasets) have risen in popularity
in recent years. Examples include reading compre-

hension, i.e., reading text and answering questions
about it (Richardson et al., 2013; Rajpurkar et al.,
2016), open-domain question answering, i.e., find-
ing the answer to a question from a large collection
of documents (Voorhees and Tice, 2000; Yang et al.,
2015), and cloze question completion, i.e., predict-
ing a blanked-out word of a sentence (Hill et al.,
2015; Hermann et al., 2015). Visual question an-
swering (VQA; Antol et al. (2015)) is a another re-
lated task where the aim is to provide a natural lan-
guage answer to a question about an image.

Our inference task can be viewed as a form of
question answering over multi-modal data, focus-
ing on one type of question. Compared to previous
work on machine reading or visual question answer-
ing, we are interested in the temporal characteristics
of the inference process, and study how understand-
ing evolves incrementally with the contribution of
various modalities (text, audio, video). Importantly,
our formulation of the inference task as a sequence
labeling problem departs from conventional ques-
tion answering allowing us to study how humans and
models alike make decisions over time.

3 The CSI Dataset

In this work, we make use of episodes of the U.S.
TV show “Crime Scene Investigation Las Vegas”
(henceforth CSI), one of the most successful crime
series ever made. Fifteen seasons with a total of 337
episodes were produced over the course of fifteen
years. CSI is a procedural crime series, it follows
a team of investigators employed by the Las Vegas
Police Department as they collect and evaluate ev-

3


episodes with one case 19
episodes with two cases 20
total number of cases 59

min max avg

pe
r

ca
se

sentences 228 1209 689
sentences with perpetrator 0 267 89
scene descriptions 64 538 245
spoken utterances 144 778 444
characters 8 38 20

type of crime

murder 51
accident 4
suicide 2
other 2

Table 1: Statistics on the CSI data set. The type of crime
was identified by our annotators via a multiple-choice
questionnaire (which included the option “other”). Note
that accidents may also involve perpetrators.

idence to solve murders, combining forensic police
work with the investigation of suspects.

We paired official CSI videos (from seasons 1–5)
with screenplays which we downloaded from a web-
site hosting TV show transcripts.2 Our dataset com-
prises 39 CSI episodes, each approximately 43 min-
utes long. Episodes follow a regular plot, they begin
with the display of a crime (typically without reveal-
ing the perpetrator) or a crime scene. A team of five
recurring police investigators attempt to reconstruct
the crime and find the perpetrator. During the inves-
tigation, multiple (innocent) suspects emerge, while
the crime is often committed by a single person, who
is eventually identified and convicted. Some CSI
episodes may feature two or more unrelated cases.
At the beginning of the episode the CSI team is
split and each investigator is assigned a single case.
The episode then alternates between scenes cover-
ing each case, and the stories typically do not over-
lap. Figure 1 displays a small excerpt from a CSI
screenplay. Readers unfamiliar with script writing
conventions should note that scripts typically consist
of scenes, which have headings indicating where the
scene is shot (e.g., inside someone’s house). Char-
acter cues preface the lines the actors speak (see
boldface in Figure 1), and scene descriptions explain
what the camera sees (see second and fifth panel in
Figure 1).

Screenplays were further synchronized with the
2http://transcripts.foreverdreaming.org/

video using closed captions which are time-stamped
and provided in the form of subtitles as part of
the video data. The alignment between screenplay
and closed captions is non-trivial, since the latter
only contain dialogue, omitting speaker information
or scene descriptions. We first used dynamic time
warping (DTW; Myers and Rabiner (1981)) to ap-
proximately align closed captions with the dialogue
in the scripts. And then heuristically time-stamped
remaining elements of the screenplay (e.g., scene
descriptions), allocating them to time spans between
spoken utterances. Table 1 shows some descrip-
tive statistics on our dataset, featuring the number
of cases per episode, its length (in terms of number
of sentences), the type of crime, among other infor-
mation.

The data was further annotated, with two goals
in mind. Firstly, in order to capture the character-
istics of the human inference process, we recorded
how participants incrementally update their beliefs
about the perpetrator. Secondly, we collected gold-
standard labels indicating whether the perpetrator is
mentioned. Specifically, while a participant watches
an episode, we record their guesses about who the
perpetrator is (Section 3.1). Once the episode is fin-
ished and the perpetrator is revealed, the same par-
ticipant annotates entities in the screenplay referring
to the true perpetrator (Section 3.2).

3.1 Eliciting Behavioral Data

All annotations were collected through a web-
interface. We recruited three annotators, all post-
graduate students and proficient in English, none of
them regular CSI viewers. We obtained annotations
for 39 episodes (comprising 59 cases).

A snapshot of the annotation interface is pre-
sented in Figure 2. The top of the interface provides
a short description of the episode, i.e., in the form of
a one-sentence summary (carefully designed to not
give away any clues about the perpetrator). Sum-
maries were adapted from the CSI season summaries
available in Wikipedia.3 The annotator watches the
episode (i.e., the video without closed captions) as
a sequence of three minute intervals. Every three
minutes, the video halts, and the annotator is pre-

3See e.g., https://en.wikipedia.org/wiki/
CSI:_Crime_Scene_Investigation_(season_1).

4


Number of cases: 2
Case 1: Grissom, Catherine, Nick and Warrick investigate
when a wealthy couple is murdered at their house.
Case 2: Meanwhile Sara is sent to a local high school where a
cheerleader was found eviscerated on the football field.

Screenplay Perpetrator
mentioned?

Relates
to case
1/2/none?

(Nick cuts the canopy around
MONICA NEWMAN.)
Nick okay, Warrick, hit it

(WARRICK starts the crane
support under the awning to re-
move the body and the canopy
area that NICK cut.)
Nick white female, multiple
bruising . . . bullet hole to the
temple doesn’t help
Nick .380 auto on the side

Warrick yeah, somebody man-
handled her pretty good before
they killed her

Figure 2: Annotation interface (first pass): after watch-
ing three minutes of the episode, the annotator indicates
whether she believes the perpetrator has been mentioned.

sented with the screenplay corresponding to the part
of the episode they have just watched. While read-
ing through the screenplay, they must indicate for
every sentence whether they believe the perpetrator
is mentioned. This way, we are able to monitor how
humans create and discard hypotheses about perpe-
trators incrementally. As mentioned earlier, some
episodes may feature more than one case. Annota-
tors signal for each sentence, which case it belongs
to or whether it is irrelevant (see the radio buttons
in Figure 2). In order to obtain a more fine-grained
picture of the human guesses, annotators are addi-
tionally asked to press a large red button (below the
video screen) as soon as they “think they know who
the perpetrator is”, i.e., at any time while they are

( It ’s a shell casing . )
Perpetrator

Suspect
Other

GRISSOM moves his light to the canopy below
Perpetrator

Suspect
Other

Figure 3: Annotation interface (second pass): after
watching the episode, the annotator indicates for each
word whether it refers to the perpetrator.

watching the video. They are allowed to press the
button multiple times throughout the episode in case
they change their mind.

Even though the annotation task just described
reflects individual rather than gold-standard behav-
ior, we report inter-annotator agreement (IAA) as
a means of estimating variance amongst partici-
pants. We computed IAA using Cohen’s (1960)
Kappa based on three episodes annotated by two
participants. Overall agreement on this task (sec-
ond column in Figure 2) is 0.74. We also measured
percent agreement on the minority class (i.e., sen-
tences tagged as “perpetrator mentioned”) and found
it to be reasonably good at 0.62, indicating that de-
spite individual differences, the process of guessing
the perpetrator is broadly comparable across partic-
ipants. Finally, annotators had no trouble distin-
guishing which utterances refer to which case (when
the episode revolves around several), achieving an
IAA of κ = 0.96.

3.2 Gold Standard Mention Annotation

After watching the entire episode, the annotator
reads through the screenplay for a second time, and
tags entity mentions, now knowing the perpetrator.
Each word in the script has three radio buttons at-
tached to it, and the annotator selects one only if a
word refers to a perpetrator, a suspect, or a character
who falls into neither of these classes (e.g., a po-
lice investigator or a victim). For the majority of
words, no button will be selected. A snapshot of
our interface for this second layer of annotations is
shown in Figure 3. To ensure consistency, annota-
tors were given detailed guidelines about what con-
stitutes an entity. Examples include proper names
and their titles (e.g., Mr Collins, Sgt. O’ Reilly),

5


pronouns (e.g., he, we ), and other referring expres-
sions including nominal mentions (e.g., let’s arrest
the guy with the black hat ).

Inter-annotator agreement based on three
episodes and two annotators was κ = 0.90 on
the perpetrator class and κ = 0.89 on other en-
tity annotations (grouping together suspects with
other entities). Percent agreement was 0.824 for
perpetrators and 0.823 for other entities. The high
agreement indicates that the task is well-defined and
the elicited annotations reliable. After the second
pass, various entities in the script are disambiguated
in terms of whether they refer to the perpetrator or
other individuals.

Note that in this work we do not use the token-
level gold standard annotations directly. Our model
is trained on sentence-level annotations which we
obtain from token-level annotations, under the as-
sumption that a sentence mentions the perpetrator if
it contains a token that does.

4 Model Description

We formalize the problem of identifying the perpe-
trator in a crime series episode as a sequence label-
ing task. Like humans watching an episode, our
model is presented with a sequence of (possibly
multi-modal) inputs, each corresponding to a sen-
tence in the script, and assigns a label l indicating
whether the perpetrator is mentioned in the sentence
(l = 1) or not (l = 0). The model is fully incremen-
tal, each labeling decision is based solely on infor-
mation derived from previously seen inputs.

We could have formalized our inference task as a
multi-label classification problem where labels cor-
respond to characters in the script. Although per-
haps more intuitive, the multi-class framework re-
sults in an output label space different for each
episode which renders comparison of model perfor-
mance across episodes problematic. In contrast, our
formulation has the advantage of being directly ap-
plicable to any episode or indeed any crime series.

A sketch of our inference task is shown in Fig-
ure 4. The core of our model (see Figure 5)
is a one-directional long-short term memory net-
work (LSTM; Hochreiter and Schmidhuber (1997);
Zaremba et al. (2014)). LSTM cells are a variant
of recurrent neural networks with a more complex

Figure 4: Overview of the perpetrator prediction task.
The model receives input in the form of text, images, and
audio. Each modality is mapped to a feature representa-
tion. Feature representations are fused and passed to an
LSTM which predicts whether a perpetrator is mentioned
(label l = 1) or not (l = 0).

Figure 5: Illustration of input/output structure of our
LSTM model for two time steps.

computational unit which have emerged as a popular
architecture due to their representational power and
effectiveness at capturing long-term dependencies.
LSTMs provide ways to selectively store and forget
aspects of previously seen inputs, and as a conse-
quence can memorize information over longer time
periods. Through input, output, and forget gates,
they can flexibly regulate the extent to which inputs
are stored, used, and forgotten.

The LSTM processes a sequence of (possibly
multi-modal) inputs s = {xh1,xh2, ...,xhN}. It utilizes
a memory slot ct and a hidden state ht which are in-
crementally updated at each time step t. Given input
xt, the previous latent state ht−1 and previous mem-
ory state ct−1, the latent state ht for time t and the

6


updated memory state ct, are computed as follows:




it
ft
ot
ĉt


 =




σ
σ
σ

tanh


W

[
ht−1
xt

]

ct = ft � ct−1 + it � ĉt
ht = ot � tanh(ct).

The weight matrix W is estimated during inference,
and i, o, and f are memory gates.

As mentioned earlier, the input to our model con-
sists of a sequence of sentences, either spoken utter-
ances or scene descriptions (we do not use speaker
information). We further augment textual input with
multi-modal information obtained from the align-
ment of screenplays to video (see Section 3).

Textual modality Words in each sentence are
mapped to 50-dimensional GloVe embeddings, pre-
trained on Wikipedia and Gigaword (Pennington
et al., 2014). Word embeddings are subsequently
concatenated and padded to the maximum sentence
length observed in our data set in order to obtain
fixed-length input vectors. The resulting vector
is passed through a convolutional layer with max-
pooling to obtain a sentence-level representation xs.
Word embeddings are fine-tuned during training.

Visual modality We obtain the video correspond-
ing to the time span covered by each sentence and
sample one frame per sentence from the center of
the associated period.4 We then map each frame to
a 1,536-dimensional visual feature vector xv using
the final hidden layer of a pre-trained convolutional
network which was optimized for object classifica-
tion (inception-v4; Szegedy et al. (2016)).

Acoustic modality For each sentence, we extract
the audio track from the video which includes all
sounds and background music but no spoken dia-
log. We then obtain Mel-frequency cepstral coef-
ficient (MFCC) features from the continuous sig-
nal. MFCC features were originally developed in
the context of speech recognition (Davis and Mer-
melstein, 1990; Sahidullah and Saha, 2012), but

4We also experimented with multiple frames per sentence
but did not observe any improvement in performance.

have also been shown to work well for more gen-
eral sound classification (Chachada and Kuo, 2014).
We extract a 13-dimensional MFCC feature vector
for every five milliseconds in the video. For each
input sentence, we sample five MFCC feature vec-
tors from its associated time interval, and concate-
nate them in chronological order into the acoustic
input xa.5

Modality Fusion Our model learns to fuse multi-
modal input as part of its overall architecture. We
use a general method to obtain any combination
of input modalities (i.e., not necessarily all three).
Single modality inputs are concatenated into an
m-dimensional vector (where m is the sum of di-
mensionalities of all the input modalities). We then
multiply this vector with a weight matrix Wh of di-
mension m×n, add an m-dimensional bias bh, and
pass the result through a rectified linear unit (ReLU):

xh = ReLU([xs;xv;xa]Wh + bh)

The resulting multi-modal representation xh is of di-
mension n and passed to the LSTM (see Figure 5).

5 Evaluation

In our experiments we investigate what type of
knowledge and strategy are necessary for identify-
ing the perpetrator in a CSI episode. In order to shed
light on the former question we compare variants of
our model with access to information from different
modalities. We examine different inference strate-
gies by comparing the LSTM to three baselines. The
first one lacks the ability to flexibly fuse multi-modal
information (a CRF), while the second one does not
have a notion of history, classifying inputs indepen-
dently (a multilayer perceptron). Our third baseline
is a rule-base system that neither uses multi-modal
inputs nor has a notion of history. We also compare
the LSTM to humans watching CSI. Before we re-
port our results, we describe our setup and compari-
son models in more detail.

5.1 Experimental Settings

Our CSI data consists of 39 episodes giving rise to
59 cases (see Table 1). The model was trained on

5Preliminary experiments showed that concatenation out-
performs averaging or relying on a single feature vector.

7


53 cases using cross-validation (five splits with 47/6
training/test cases). The remaining 6 cases were
used as truly held-out test data for final evaluation.

We trained our model using ADAM with
stochastic gradient-descent and mini-batches of six
episodes. Weights were initialized randomly, except
for word embeddings which were initialized with
pre-trained 50-dimensional GloVe vectors (Penning-
ton et al., 2014), and fine-tuned during training. We
trained our networks for 100 epochs and report the
best result obtained during training. All results are
averages of five runs of the network. Parameters
were optimized using two cross-validation splits.

The sentence convolution layer has three filters of
sizes 3,4,5 each of which after convolution returns
75-dimensional output. The final sentence represen-
tation xs is obtained by concatenating the output of
the three filters and is of dimension 225. We set the
size of the hidden representation of merged cross-
modal inputs xh to 300. The LSTM has one layer
with 128 nodes. We set the learning rate to 0.001
and apply dropout with probability of 0.5.

We compared model output against the gold stan-
dard of perpetrator mentions which we collected as
part of our annotation effort (second pass).

5.2 Model Comparison

CRF Conditional Random Fields (Lafferty et al.,
2001) are probabilistic graphical models for se-
quence labeling. The comparison allows us to exam-
ine whether the LSTM’s use of long-term memory
and (non-linear) feature integration is beneficial for
sequence prediction. We experimented with a vari-
ety of features for the CRF, and obtained best results
when the input sentence is represented by concate-
nated word embeddings.

MLP We also compared the LSTM against a
multi-layer perceptron with two hidden layers, and a
softmax output layer. We replaced the LSTM in our
overall network structure with the MLP, keeping the
methodology for sentence convolution and modal-
ity fusion and all associated parameters fixed to the
values described in Section 5.1. The hidden layers
of the MLP have ReLU activations and a layer-size
of 128, as in the LSTM. We set the learning rate
to 0.0001. The MLP makes independent predictions
for each element in the sequence. This comparison

Model Modality Cross-val Held-out
T V A pr re f1 pr re f1

PRO + – – 19.3 76.3 31.6 19.5 77.2 31.1
CRF + – – 33.1 15.4 20.5 30.2 16.1 21.0

MLP

+ – – 36.7 32.5 33.7 35.9 36.8 36.3
+ + – 37.4 35.1 35.1 38.0 41.0 39.3
+ – + 39.6 34.2 35.7 38.7 36.5 37.5
+ + + 38.4 34.6 35.7 38.5 42.3 40.2

LSTM

+ – – 39.2 45.7 41.3 36.9 50.4 42.3
+ + – 39.9 48.3 43.1 40.9 54.9 46.8
+ – + 39.2 52.0 44.0 36.8 56.3 44.5
+ + + 40.6 49.7 44.1 42.8 51.2 46.6

Humans 74.1 49.4 58.2 76.3 60.2 67.3

Table 2: Precision (pr) recall (re) and f1 for detecting the
minority class (perpetrator mentioned) for humans (bot-
tom) and various systems. We report results with cross-
validation (center) and on a held-out data set (right) using
the textual (T) visual (V), and auditory (A) modalities.

sheds light on the importance of sequential informa-
tion for the perpetrator identification task. All re-
sults are best checkpoints over 100 training epochs,
averaged over five runs.

PRO Aside from the supervised models described
so far, we developed a simple rule-based system
which does not require access to labeled data. The
system defaults to the perpetrator class for any sen-
tence containing a personal (e.g., you ), possessive
(e.g., mine ) or reflexive pronoun (e.g., ourselves ).
In other words, it assumes that every pronoun refers
to the perpetrator. Pronoun mentions were identi-
fied using string-matching and a precompiled list
of 31 pronouns. This system cannot incorporate any
acoustic or visual data.

Human Upper Bound Finally, we compared
model performance against humans. In our anno-
tation task (Section 3.1), participants annotate sen-
tences incrementally, while watching an episode for
the first time. The annotations express their belief as
to whether the perpetrator is mentioned. We evalu-
ate these first-pass guesses against the gold standard
(obtained in the second-pass annotation).

8


 0

 0.2

 0.4

 0.6

 0.8

 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

p
re

c
is

io
n
 i

n
 f

in
a
l 

1
0
%

 o
f 

th
e
 e

p
is

o
d
e

test episode ID

LSTM Human LSTM avg Human avg

Figure 6: Precision in the final 10% of an episode, for 30 test episodes from five cross-validation splits. We show
scores per episode and global averages (horizontal bars). Episodes are ordered by increasing model precision.

5.3 Which Model Is the Best Detective?

We report precision, recall and f1 on the minority
class, focusing on how accurately the models iden-
tify perpetrator mentions. Table 2 summarizes our
results, averaged across five cross-validation splits
(left) and on the truly held-out test episodes (right).

Overall, we observe that humans outperform all
comparison models. In particular, human precision
is superior, whereas recall is comparable, with the
exception of PRO which has high recall (at the ex-
pense of precision) since it assumes that all pro-
nouns refer to perpetrators. We analyze the differ-
ences between model and human behavior in more
detail in Section 5.5. With regard to the LSTM, both
visual and acoustic modalities bring improvements
over the textual modality, however, their contribu-
tion appears to be complementary. We also exper-
imented with acoustic and visual features on their
own, but without high-level textual information, the
LSTM converges towards predicting the majority
class only. Results on the held-out test set reveal that
our model generalizes well to unseen episodes, de-
spite being trained on a relatively small data sample
compared to standards in deep learning.

The LSTM consistently outperforms the non-
incremental MLP. This shows that the ability to uti-
lize information from previous inputs is essential for
this task. This is intuitively plausible; in order to
identify the perpetrator, viewers must be aware of
the plot’s development and make inferences while
the episode evolves. The CRF is outperformed by

all other systems, including rule-based PRO. In con-
trast to the MLP and PRO, the CRF utilizes sequen-
tial information, but cannot flexibly fuse informa-
tion from different modalities or exploit non-linear
mappings like neural models. The only type of
input which enabled the CRF to predict perpetra-
tor mentions were concatenated word embeddings
(see Table 2). We trained CRFs on audio or visual
features, together with word embeddings, however
these models converged to only predicting the ma-
jority class. This suggests that CRFs do not have the
capacity to model long complex sequences and draw
meaningful inferences based on them. PRO achieves
a reasonable f1 score but does so because it achieves
high recall at the expense of very low precision. The
precision-recall tradeoff is much more balanced for
the neural systems.

5.4 Can the Model Identify the Perpetrator?

In this section we assess more directly how the
LSTM compares against humans when asked to
identify the perpetrator by the end of a CSI episode.
Specifically, we measure precision in the final 10%
of an episode, and compare human performance
(first-pass guesses) and an LSTM model which uses
all three modalities. Figure 6 shows precision results
for 30 test episodes (across five cross-validation
splits) and average precision as horizontal bars.

Perhaps unsurprisingly, human performance is su-
perior; however, the model achieves an average pre-
cision of 60% which is encouraging (compared to

9


Episode 12 (Season 03): “Got Murder?” Episode 19 (Season 03): “A Night at the Movies”

 0

 0.2

 0.4

 0.6

 0.8

 0  100  200  300  400  500  600

sc
o
re

LSTM  f1
Human f1

 0

 20

 40

 60

 80

 100

 0  100  200  300  400  500  600

c
o
u
n
t

LSTM  tp
Human tp
Gold   tp

 0

 2

 4

 6

 8

 10

 0  100  200  300  400  500  600

c
o
u
n
t

#sentences observed

LSTM  tp
Human tp
Gold  tp

 0

 0.2

 0.4

 0.6

 0.8

 0  100  200  300  400  500

sc
o
re

LSTM  f1
Human f1

 0

 30

 60

 90

 120

 150

 180

 0  100  200  300  400  500

c
o
u
n
t

LSTM  tp
Human tp
Gold   tp

 0

 2

 4

 6

 8

 10

 0  100  200  300  400  500

c
o
u
n
t

#sentences observed

LSTM  tp
Human tp
Gold  tp

Figure 7: Human and LSTM behavior over the course of two episodes (left and right). Top plots show cumulative
f1; true positives (tp) are shown cumulatively (center) and as individual counts for each interval (bottom). Statistics
relating to gold perpetrator mentions are shown in black. Red vertical bars show when humans press the red button to
indicate that they (think they) have identified the perpetrator.

85% achieved by humans). Our results also show
a moderate correlation between the model and hu-
mans: episodes which are difficult for the LSTM
(see left side of the plot in Figure 6) also result in
lower human precision. Two episodes on the very
left of the plot have 0% precision and are special
cases. The first one revolves around a suicide, which
is not strictly speaking a crime, while the second one
does not mention the perpetrator in the final 10%.

5.5 How Is the Model Guessing?

We next analyze how the model’s guessing abil-
ity compares to humans. Figure 7 tracks model
behavior over the course of two episodes, across
100 equally sized intervals. We show the cumula-
tive development of f1 (top plot), cumulative true
positive counts (center plot), and true positive counts
within each interval (bottom plot). Red bars indicate
times at which annotators pressed the red button.

Figure 7 (right) shows that humans may outper-
form the LSTM in precision (but not necessarily in
recall). Humans are more cautious at guessing the
perpetrator: the first human guess appears around
sentence 300 (see the leftmost red vertical bars in

Figure 7 right), the first model guess around sen-
tence 190, and the first true mention around sentence
30. Once humans guess the perpetrator, however,
they are very precise and consistent. Interestingly,
model guesses at the start of the episode closely fol-
low the pattern of gold-perpetrator mentions (bottom
plots in Figure 7). This indicates that early model
guesses are not noise, but meaningful predictions.

Further analysis of human responses is illustrated
in Figure 8. For each of our three annotators we
plot the points in each episode where they press the
red button to indicate that they know the perpetra-
tor (bottom). We also show the number of times (all
three) annotators pressed the red button individually
for each interval and cumulatively over the course of
the episode. Our analysis reveals that viewers tend
to press the red button more towards the end, which
is not unexpected since episodes are inherently de-
signed to obfuscate the identification of the perpe-
trator. Moreover, Figure 8 suggests that there are
two types of viewers: eager viewers who like our
model guess early on, change their mind often, and
therefore press the red button frequently (annotator
1 pressed the red button 6.1 times on average per

10


 0  0.2  0.4  0.6  0.8  1

portion of episode lapsed

annotator 1
annotator 2
annotator 3 
all annotators frequency
all annotators cumulative

Figure 8: Number of times the red button is pressed by each anno-
tator individually (bottom) and by all three within each time interval
and cumulatively (top). Times are normalized with respect to length.
Statistics are averaged across 18/12/9 cases per annotator 1/2/3.

First correct perpetrator prediction
min max avg

LSTM 2 554 141
Human 12 1014 423

Table 3: Sentence ID in the script where
the LSTM and Humans predict the true
perpetrator for the first time. We show
the earliest (min) latest (max) and av-
erage (avg) prediction time over 30 test
episodes (five cross-validation splits).

Episode 03 (Season 03): “Let the Seller Beware”
s1 s2 s3 s4 s5 s6 s7 s8 s9

Grissom
pulls out
a small
evidence
bag with
the filling

He puts
it on the
table

Tooth fill-
ing 0857

10-7-02 Brass We
also found
your finger-
prints and
your hair

Peter B. Look
I’m sure you’ll
find me all
over the house

Peter B.
I wanted
to buy it

Peter B.
I was ev-
erywhere

Brass well
you made
sure you
were every-
where too
didn’t you?

Episode 21 (Season 05): “Committed”
s1 s2 s3 s4 s5 s6 s7 s8

Grissom
What’s so
amusing?

Adam Trent
So let’s say
you find out
who did it and
maybe it’s
me.

Adam Trent
What are you
going to do?

Adam Trent
Are you going
to convict me
of murder and
put me in a
bad place?

Adam
smirks
and starts
biting his
nails.

Grissom
Is it you?

Adam Trent
Check the
files sir.

Adam Trent
I’m a rapist
not a mur-
derer.

Table 4: Excerpts of CSI episodes together with model predictions. Model confidence (p(l = 1)) is illustrated in red,
with darker shades corresponding to higher confidence. True perpetrator mentions are highlighted in blue. Top: a
conversation involving the true perpetrator. Bottom: a conversation with a suspect who is not the perpetrator.

episode) and conservative viewers who guess only
late and press the red button less frequently (on av-
erage annotator 2 pressed the red button 2.9 times
per episode, and annotator 3 and 3.7 times). Notice
that statistics in Figure 8 are averages across several
episodes each annotator watched and thus viewer
behavior is unlikely to be an artifact of individual
episodes (e.g., featuring more or less suspects). Ta-
ble 3 provides further evidence that the LSTM be-
haves more like an eager viewer. It presents the time
in the episode (by sentence count) where the model
correctly identifies the perpetrator for the first time.
As can be seen, the minimum and average identifi-
cation times are lower for the LSTM compared to
human viewers.

Table 4 shows model predictions on two CSI
screenplay excerpts. We illustrate the degree of the
model’s belief in a perpetrator being mentioned by
color intensity. True perpetrator mentions are high-
lighted in blue. In the first example, the model
mostly identifies perpetrator mentions correctly. In
the second example, it identifies seemingly plausible
sentences which, however, refer to a suspect and not
the true perpetrator.

5.6 What if There Is No Perpetrator?

In our experiments, we trained our model on CSI
episodes which typically involve a crime, commit-
ted by a perpetrator, who is ultimately identified.
How does the LSTM generalize to episodes without

11


 0

 10

 20

 30

 40

 50

 60

 0  50  100  150  200  250  300  350  400  450

c
o

u
n

t

#sentences observed

LSTM  fp
Human fp

Figure 8: Cumulative counts of false positives (fp) for
the LSTM and a human viewer for an episode with no
perpetrator (the victim committed suicide). Red vertical
bars show the times at which the viewer pressed the red
button indicating that they (think they) have identified the
perpetrator.

a crime, e.g., because the “victim” turns out to have
committed suicide? To investigate how model and
humans alike respond to atypical input we present
both with an episode featuring a suicide, i.e., an
episode which did not have any true positive perpe-
trator mentions.

Figure 8 tracks the incremental behavior of a hu-
man viewer and the model while watching the sui-
cide episode. Both are primed by their experience
with CSI episodes to identify characters in the plot
as potential perpetrators, and consequently predict
false positive perpetrator mentions. The human re-
alizes after roughly two thirds of the episode that
there is no perpetrator involved (he does not anno-
tate any subsequent sentences as “perpetrator men-
tioned”), whereas the LSTM continues to make per-
petrator predictions until the end of the episode. The
LSTM’s behavior is presumably an artifact of the re-
curring pattern of discussing the perpetrator in the
very end of an episode.

6 Conclusions

In this paper we argued that crime drama is an ideal
testbed for models of natural language understand-
ing and their ability to draw inferences from com-
plex, multi-modal data. The inference task is well-
defined and relatively constrained: every episode
poses and answers the same “whodunnit” ques-
tion. We have formalized perpetrator identifica-
tion as a sequence labeling problem and developed
an LSTM-based model which learns incrementally
from complex naturalistic data. We showed that
multi-modal input is essential for our task, as well

an incremental inference strategy with flexible ac-
cess to previously observed information. Compared
to our model, humans guess cautiously in the begin-
ning, but are consistent in their predictions once they
have a strong suspicion. The LSTM starts guessing
earlier, leading to superior initial true-positive rates,
however, at the cost of consistency.

There are many directions for future work. Be-
yond perpetrators, we may consider how suspects
emerge and disappear in the course of an episode.
Note that we have obtained suspect annotations but
did not use them in our experiments. It should also
be interesting to examine how the model behaves
out-of-domain, i.e., when tested on other crime se-
ries, e.g., “Law and Order”. Finally, more detailed
analysis of what happens in an episode (e.g., what
actions are performed, by who, when, and where)
will give rise to deeper understanding enabling ap-
plications like video summarization and skimming.

Acknowledgments The authors gratefully ac-
knowledge the support of the European Research
Council (award number 681760; Frermann, Lap-
ata) and H2020 EU project SUMMA (award number
688139/H2020-ICT-2015; Cohen). We also thank
our annotators, the TACL editors and anonymous re-
viewers whose feedback helped improve the present
paper, and members of EdinburghNLP for helpful
discussions and suggestions.

References

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu,
M̃argaret Mitchell, Dhruv Batra, C. Lawrence Zitnick,
and Devi Parikh. 2015. VQA: Visual Question An-
swering. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), pages 2425–
2433, Santiago, Chile.

Piotr Bojanowski, Francis Bach, Ivan Laptev, Jean Ponce,
Cordelia Schmid, and Josef Sivic. 2013. Finding ac-
tors and actions in movies. In The IEEE International
Conference on Computer Vision (ICCV), pages 2280–
2287, Sydney, Australia.

John S. Boreczky and Lynn D. Wilcox. 1998. A hid-
den Markov model framework for video segmentation
using audio and image features. In Proceedings of
the 1998 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 3741–
3744, Seattle, Washington, USA.

12


Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and Christopher D. Manning. 2015. A large anno-
tated corpus for learning natural language inference.
In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pages 632–
642, Lisbon, Portugal.

Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014.
Multimodal distributional semantics. Journal of Arti-
ficial Intelligence Research, 49(1):1–47, January.

Sachin Chachada and C.-C. Jay Kuo. 2014. Environmen-
tal sound recognition: A survey. APSIPA Transactions
on Signal and Information Processing, 3.

Jacob Cohen. 1960. A coefficient of agreement for nom-
inal scales. Educational and Psychological Measure-
ment, 20(1):37–46.

Timothee Cour, Chris Jordan, Eleni Miltsakaki, and Ben
Taskar. 2008. Movie/script: Alignment and parsing
of video and text transcription. In Proceedings of the
10th European Conference on Computer Vision, pages
158–171, Marseille, France.

Steven B. Davis and Paul Mermelstein. 1990. Com-
parison of parametric representations for monosyllabic
word recognition in continuously spoken sentences.
In Alex Waibel and Kai-Fu Lee, editors, Readings in
Speech Recognition, pages 65–74. Morgan Kaufmann
Publishers Inc., San Francisco, California, USA.

Nevenka Dimitrova, Lalitha Agnihotri, and Gang Wei.
2000. Video classification based on HMM using text
and faces. In Proceedings of the 10th European Signal
Processing Conference (EUSIPCO), pages 1–4. IEEE.

Desmond Elliott and Frank Keller. 2013. Image descrip-
tion using visual dependency representations. In Pro-
ceedings of the 2013 Conference on Empirical Meth-
ods in Natural Language Processing, pages 1292–
1302, Seattle, Washington, USA.

Philip John Gorinski and Mirella Lapata. 2015. Movie
script summarization as graph-based scene extraction.
In Proceedings of the 2015 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 1066–1076, Denver, Colorado, USA.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
and Phil Blunsom. 2015. Teaching machines to read
and comprehend. In C. Cortes, N. D. Lawrence, D. D.
Lee, M. Sugiyama, and R. Garnett, editors, Advances
in Neural Information Processing Systems 28, pages
1693–1701. Curran Associates, Inc.

Felix Hill, Antoine Bordes, Sumit Chopra, and Jason We-
ston. 2015. The Goldilocks principle: Reading chil-
dren’s books with explicit memory representations. In
Proceedings of the 3rd International Conference on
Learning Representations (ICLR), San Diego, Califor-
nia, USA.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural Computation, 9(8):1735–
1780, November.

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li,
David A Shamma, Michael S Bernstein, and Li Fei-
Fei. 2015. Image retrieval using scene graphs. In
Proceedings of the 2015 IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
3668–3678, Boston, Massachusetts, USA.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-
semantic alignments for generating image descrip-
tions. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 3128–
3137, Boston, Massachusetts.

Douwe Kiela and Léon Bottou. 2014. Learning image
embeddings using convolutional neural networks for
improved multi-modal semantics. In Proceedings of
the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 36–45, Doha,
Qatar.

John D. Lafferty, Andrew McCallum, and Fernando C. N.
Pereira. 2001. Conditional random fields: Probabilis-
tic models for segmenting and labeling sequence data.
In Proceedings of the 18th International Conference
on Machine Learning, pages 282–289, San Francisco,
CA, USA. Morgan Kaufmann Publishers Inc.

Angeliki Lazaridou, Nghia The Pham, and Marco Ba-
roni. 2015. Combining language and vision with a
multimodal skip-gram model. In Proceedings of the
2015 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies, pages 153–163, Denver,
Colorado, USA.

Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urta-
sun. 2014. Visual semantic search: Retrieving videos
via complex textual queries. In IEEE Conference
on Computer Vision and Pattern Recognition, pages
2657–2664, Columbus, Ohio, USA.

Cory S. Myers and Lawrence R. Rabiner. 1981. A com-
parative study of several dynamic time-warping algo-
rithms for connected word recognition. The Bell Sys-
tem Technical Journal, 60(7):1389–1409.

Milind R. Naphide and Thomas S. Huang. 2001. A prob-
abilistic framework for semantic video indexing, filter-
ing, and retrieval. IEEE Transactions on Multimedia,
3(1):141–151.

Luis Gilberto Mateos Ortiz, Clemens Wolff, and Mirella
Lapata. 2015. Learning to interpret and describe ab-
stract scenes. In Proceedings of the 2015 NAACL: Hu-
man Language Technologies, pages 1505–1515, Den-
ver, Colorado, USA.

Jeffrey Pennington, Richard Socher, and Christopher D.
Manning. 2014. GloVe: Global vectors for word rep-
resentation. In Proceedings of the 2014 Conference

13


on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 1532–1543, Doha, Qatar.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions for
machine comprehension of text. In Proceedings of
the 2016 Conference on Empirical Methods in Natu-
ral Language Processing, pages 2383–2392, Austin,
Texas, USA.

Zeeshan Rasheed, Yaser Sheikh, and Mubarak Shah.
2005. On the use of computable features for film clas-
sification. IEEE Transactions on Circuits and Systems
for Video Technology, 15(1):52–64.

Matthew Richardson, Christopher J.C. Burges, and Erin
Renshaw. 2013. MCTest: A challenge dataset for the
open-domain machine comprehension of text. In Pro-
ceedings of the 2013 Conference on Empirical Meth-
ods in Natural Language Processing, pages 193–203,
Seattle, Washington, USA.

Tim Rocktäschel, Edward Grefenstette, Karl Moritz Her-
mann, Tomas Kocisky, and Phil Blunsom. 2016. Rea-
soning about entailment with neural attention. In
Proceedings of the 4th International Conference on
Learning Representations (ICLR), San Juan, Puerto
Rico.

Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket
Tandon, Christopher Pal, Hugo Larochelle, Aaron
Courville, and Bernt Schiele. 2017. Movie de-
scription. International Journal of Computer Vision,
123(1):94–120.

Md Sahidullah and Goutam Saha. 2012. Design, analy-
sis and experimental evaluation of block based trans-
formation in MFCC computation for speaker recogni-
tion. Speech Communication, 54(4):543–565.

Jitao Sang and Changsheng Xu. 2010. Character-based
movie summarization. In Proceedings of the 18th
ACM International Conference on Multimedia, pages
855–858, Firenze, Italy.

Carina Silberer, Vittorio Ferrari, and Mirella Lapata.
2016. Visually grounded meaning representations.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 99.

Josef Sivic, Mark Everingham, and Andrew Zisserman.
2009. “Who are you?” – Learning person specific
classifiers from video. In IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 1145–
1152, Miami, Florida, USA.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural networks.
In Proceedings of the 27th International Conference
on Neural Information Processing Systems, NIPS’14,
pages 3104–3112, Cambridge, MA, USA. MIT Press.

Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke.
2016. Inception-v4, Inception-ResNet and the im-

pact of residual connections on learning. CoRR,
abs/1602.07261.

Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelha-
gen. 2015. Aligning plot synopses to videos for story-
based retrieval. International Journal of Multimedia
Information Retrieval, (4):3–26.

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen,
Antonio Torralba, Raquel Urtasun, and Sanja Fidler.
2016. MovieQA: Understanding stories in movies
through question-answering. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 4631–4640, Las Vegas, Nevada.

Subhashini Venugopalan, Marcus Rohrbach, Jeff Don-
ahue, Raymond J. Mooney, Trevor Darrell, and Kate
Saenko. 2015a. Sequence to sequence – Video to text.
In Proceedings of the 2015 International Conference
on Computer Vision (ICCV), pages 4534–4542, Santi-
ago, Chile.

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue,
Marcus Rohrbach, Raymond Mooney, and Kate
Saenko. 2015b. Translating videos to natural lan-
guage using deep recurrent neural networks. In Pro-
ceedings the 2015 Conference of the North American
Chapter of the Association for Computational Linguis-
tics – Human Language Technologies (NAACL HLT
2015), pages 1494–1504, Denver, Colorado, June.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
mitru Erhan. 2015. Show and tell: A neural image
caption generator. Proceedings of the 2015 IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 3156–3164.

Ellen M. Voorhees and Dawn M. Tice. 2000. Building a
question answering test collection. In ACM Special In-
terest Group on Information Retrieval (SIGIR), pages
200–207, Athens, Greece.

Jason Weston, Antoine Bordes, Sumit Chopra, and
Tomas Mikolov. 2015. Towards AI-complete ques-
tion answering: A set of prerequisite toy tasks. CoRR,
abs/1502.05698.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Aaron Courville, Ruslan Salakhudinov, Rich Zemel,
and Yoshua Bengio. 2015. Show, attend and tell: Neu-
ral image caption generation with visual attention. In
Proceedings of the 32nd International Conference on
Machine Learning, pages 2048–2057, Boston, Mas-
sachusetts, USA.

Yi Yang, Wen-tau Yih, and Christopher Meek. 2015.
WikiQA: A challenge dataset for open-domain ques-
tion answering. In Proceedings of the 2015 Confer-
ence on Empirical Methods in Natural Language Pro-
cessing, pages 2013–2018, Lisbon, Portugal.

Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016.
Situation recognition: Visual semantic role labeling

14


for image understanding. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 5534–5542, Zurich, Switzerland.

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.
2014. Recurrent neural network regularization.
CoRR, abs/1409.2329.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
dinov, Raquel Urtasun, Antonio Torralba, and Sanja
Fidler. 2015. Aligning books and movies: Towards
story-like visual explanations by watching movies and
reading books. In The IEEE International Conference
on Computer Vision (ICCV), Santiago, Chile.

15


16