Dynamic Language Models for Streaming Text

Dani Yogatama∗ Chong Wang∗ Bryan R. Routledge† Noah A. Smith∗ Eric P. Xing∗
∗School of Computer Science
†Tepper School of Business
Carnegie Mellon University
Pittsburgh, PA 15213, USA

∗{dyogatama,chongw,nasmith,epxing}@cs.cmu.edu, †routledge@cmu.edu

Abstract
We present a probabilistic language model that
captures temporal dynamics and conditions on
arbitrary non-linguistic context features. These
context features serve as important indicators
of language changes that are otherwise difficult
to capture using text data by itself. We learn
our model in an efficient online fashion that is
scalable for large, streaming data. With five
streaming datasets from two different genres—
economics news articles and social media—we
evaluate our model on the task of sequential
language modeling. Our model consistently
outperforms competing models.

1 Introduction

Language models are a key component in many NLP
applications, such as machine translation and ex-
ploratory corpus analysis. Language models are typi-
cally assumed to be static—the word-given-context
distributions do not change over time. Examples
include n-gram models (Jelinek, 1997) and proba-
bilistic topic models like latent Dirichlet allocation
(Blei et al., 2003); we use the term “language model”
to refer broadly to probabilistic models of text.

Recently, streaming datasets (e.g., social media)
have attracted much interest in NLP. Since such data
evolve rapidly based on events in the real world, as-
suming a static language model becomes unrealistic.
In general, more data is seen as better, but treating all
past data equally runs the risk of distracting a model
with irrelevant evidence. On the other hand, cau-
tiously using only the most recent data risks overfit-
ting to short-term trends and missing important time-
insensitive effects (Blei and Lafferty, 2006; Wang
et al., 2008). Therefore, in this paper, we take steps
toward methods for capturing long-range temporal
dynamics in language use.

Our model also exploits observable context vari-
ables to capture temporal variation that is otherwise
difficult to capture using only text. Specifically for
the applications we consider, we use stock market
data as exogenous evidence on which the language
model depends. For example, when an important
company’s price moves suddenly, the language model
should be based not on the very recent history, but
should be similar to the language model for a day
when a similar change happened, since people are
likely to say similar things (either about that com-
pany, or about conditions relevant to the change).
Non-linguistic contexts such as stock price changes
provide useful auxiliary information that might indi-
cate the similarity of language models across differ-
ent timesteps.

We also turn to a fully online learning framework
(Cesa-Bianchi and Lugosi, 2006) to deal with non-
stationarity and dynamics in the data that necessitate
adaptation of the model to data in real time. In on-
line learning, streaming examples are processed only
when they arrive. Online learning also eliminates
the need to store large amounts of data in memory.
Strictly speaking, online learning is distinct from
stochastic learning, which for language models built
on massive datasets has been explored by Hoffman
et al. (2013) and Wang et al. (2011). Those tech-
niques are still for static modeling. Language model-
ing for streaming datasets in the context of machine
translation was considered by Levenberg and Os-
borne (2009) and Levenberg et al. (2010). Goyal
et al. (2009) introduced a streaming algorithm for
large scale language modeling by approximating n-
gram frequency counts. We propose a general online
learning algorithm for language modeling that draws
inspiration from regret minimization in sequential
predictions (Cesa-Bianchi and Lugosi, 2006) and on-

181

Transactions of the Association for Computational Linguistics, 2 (2014) 181–192. Action Editor: Eric Fosler-Lussier.
Submitted 10/2013; Revised 2/2014; Published 4/2014. c©2014 Association for Computational Linguistics.


line variational algorithms (Sato, 2001; Honkela and
Valpola, 2003).

To our knowledge, our model is the first to bring
together temporal dynamics, conditioning on non-
linguistic context, and scalable online learning suit-
able for streaming data and extensible to include
topics and n-gram histories. The main idea of our
model is independent of the choice of the base lan-
guage model (e.g., unigrams, bigrams, topic models,
etc.). In this paper, we focus on unigram and bi-
gram language models in order to evaluate the basic
idea on well understood models, and to show how it
can be extended to higher-order n-grams. We leave
extensions to topic models for future work.

We propose a novel task to evaluate our proposed
language model. The task is to predict economics-
related text at a given time, taking into account the
changes in stock prices up to the corresponding day.
This can be seen an inverse of the setup considered
by Lavrenko et al. (2000), where news is assumed
to influence stock prices. We evaluate our model
on economics news in various languages (English,
German, and French), as well as Twitter data.

2 Background

In this section, we first discuss the background for
sequential predictions then describe how to formulate
online language modeling as sequential predictions.

2.1 Sequential Predictions

Let w1,w2, . . . ,wT be a sequence of response vari-
ables, revealed one at a time. The goal is to design
a good learner to predict the next response, given
previous responses and additional evidence which
we denote by xt ∈ RM (at time t). Throughout this
paper, we use the term features for x. Specifically, at
each round t, the learner receives xt and makes a pre-
diction ŵt, by choosing a parameter vector αt ∈ RM .
In this paper, we refer to α as feature coefficients.

There has been an enormous amount of work on
online learning for sequential predictions, much of it
building on convex optimization. For a sequence of
loss functions `1,`2, . . . ,`T (parameterized by α),
an online learning algorithm is a strategy to minimize
the regret, with respect to the best fixed α∗ in hind-
sight.1 Regret guarantees assume a Lipschitz con-

1Formally, the regret is defined as RegretT (α
∗) =

dition on the loss function ` that can be prohibitive
for complex models. See Cesa-Bianchi and Lugosi
(2006), Rakhlin (2009), Bubeck (2011), and Shalev-
Shwartz (2012) for in-depth discussion and review.

There has also been work on online and stochastic
learning for Bayesian models (Sato, 2001; Honkela
and Valpola, 2003; Hoffman et al., 2013), based on
variational inference. The goal is to approximate pos-
terior distributions of latent variables when examples
arrive one at a time.

In this paper, we will use both kinds of techniques
to learn language models for streaming datasets.

2.2 Problem Formulation
Consider an online language modeling problem, in
the spirit of sequential predictions. The task is to
build a language model that accurately predicts the
texts generated on day t, conditioned on observ-
able features up to day t, x1:t. Every day, after
the model makes a prediction, the actual texts wt
are revealed and we suffer a loss. The loss is de-
fined as the negative log likelihood of the model
`t = − log p(wt | α,β1:t−1,x1:t−1,n1:t−1), where
α and β1:T are the model parameters and n is a back-
ground distribution (details are given in §3.2). We
can then update the model and proceed to day t + 1.
Notice the similarity to the sequential prediction de-
scribed above. Importantly, this is a realistic setup for
building evolving language models from large-scale
streaming datasets.

3 Model

3.1 Notation
We index timesteps by t ∈ {1, . . . ,T} and word
types by v ∈ {1, . . . ,V}, both are always given as
subscripts. We denote vectors in boldface and use
1 : T as a shorthand for {1,2, . . . ,T}. We assume
words of the form {wt}Tt=1 for wt ∈ RV , which is
the vector of word frequences at timetstep t. Non-
linguistic context features are {xt}Tt=1 for xt ∈ RM .
The goal is to learn parameters α and β1:T , which
will be described in detail next.

3.2 Generative Story
The main idea of our model is illustrated by the fol-
lowing generative story for the unigram languagePT

t=1
`t(xt,αt,wt)− infα∗

PT
t=1

`t(xt,α
∗,wt).

182


model. (We will discuss the extension to higher-order
language models later.) A graphical representation
of our proposed model is given in Figure 1.

1. Draw feature coefficients α ∼ N(0,λI).2 Here
α is a vector in RM , where M is the dimension-
ality of the feature vector.

2. For each timestep t:
(a) Observe non-linguistic context features xt.
(b) Draw βt ∼

N

(∑t−1
k=1 δk

exp(α>f(xt,xk))Pt−1
j=1 δj exp(α

>f(xt,xj))
βk,ϕI

)
.

Here, βt is a vector in R
V , where V is

the size of the word vocabulary, ϕ is
the variance parameter and δk is a fixed
hyperparameter; we discuss them below.

(c) For each word wt,v, draw wt,v ∼
Categorical

(
exp(n1:t−1,v+βt,v)P

j∈V exp(n1:t−1,j+βt,j)

)
.

In the last step, βt and n are mapped to the V -
dimensional simplex, forming a distribution over
words. n1:t−1 ∈ RV is a background (log) distri-
bution, inspired by a similar idea in Eisenstein et al.
(2011). In this paper, we set n1:t−1,v to be the log-
frequency of v up to time t−1. We can interpret β
as a time-dependent deviation from the background
log-frequencies that incorporates world-context. This
deviation comes in the form of a weighted average of
earlier deviation vectors.

The intuition behind the model is that the probabil-
ity of a word appearing at day t depends on the back-
ground log-frequencies, the deviation coefficients of
the word at previous timesteps β1:t−1, and the sim-
ilarity of current conditions of the world (based on
observable features x) to previous timesteps through
f(xt,xk). That is, f is a function that takes d-
dimensional feature vectors at two timesteps xt and
xk and returns a similarity vector f(xt,xk) ∈ RM
(see §6.1.1 for an example of f that we use in our
experiments). The similarity is parameterized by α,
and decays over time with rate δk. In this work, we
assume a fixed window size c (i.e., we consider c
most recent timesteps), so that δ1:t−c−1 = 0 and
δt−c:t−1 = 1. This allows up to cth order depen-
dencies.3 Setting δ this way allows us to bound the

2Feature coefficients α can be also drawn from other distri-
butions such as α ∼ Laplace(0,λ).

3In online Bayesian learning, it is known that forgetting
inaccurate estimates from earlier timesteps is important (Sato,

�

xtxsxrxq

wq wr ws wt

�t�s�r�q

↵

NrNq Ns Nt

T

Figure 1: Graphical representation of the model. The
subscript indices q,r,s are shorthands for the previ-
ous timesteps t − 3, t − 2, t − 1. Only four timesteps
are shown here. There are arrows from previous
βt−4,βt−5, . . . ,βt−c to βt, where c is the window size
as described in §3.2. They are not shown here, for read-
ability.

number of past vectors β that need to be kept in
memory. We set β0 to 0.

Although the generative story described above
is for unigram language models, extensions can be
made to more complex models (e.g., mixture of un-
igrams, topic models, etc.) and to longer n-gram
contexts. In the case of topic models, the model
will be related to dynamic topic models (Blei and
Lafferty, 2006) augmented by context features, and
the learning procedure in §4 can be used to perform
online learning of dynamic topic models. However,
our model captures longer-range dependencies than
dynamic topic models, and can condition on non-
linguistic features or metadata. In the case of higher-
order n-grams, one simple way is to draw more β,
one for each history. For example, for a bigram
model, β is in RV

2
, rather than RV in the unigram

model. We consider both unigram and bigram lan-
guage models in our experiments in §6. However, the
main idea presented in this paper is largely indepen-
dent of the base model.

Related work. Mimno and McCallum (2008) and
Eisenstein et al. (2010) similarly conditioned text on

2001; Honkela and Valpola, 2003). Since we set δ1:t−c−1 = 0,
at every timestep t, δk leads to forgetting older examples.

183


observable features (e.g., author, publication venue,
geography, and other document-level metadata), but
conducted inference in a batch setting, thus their ap-
proaches are not suitable for streaming data. It is not
immediately clear how to generalize their approach to
dynamic settings. Algorithmically, our work comes
closest to the online dynamic topic model of Iwata
et al. (2010), except that we also incorporate context
features.

4 Learning and Inference

The goal of the learning procedure is to minimize the
overall negative log likelihood,

− log L(D) =

− log
∫
dβ1:Tp(β1:T | α,x1:T )p(w1:T | β1:T ,n).

However, this quantity is intractable. Instead, we
derive an upper bound for this quantity and minimize
that upper bound. Using Jensen’s inequality, the vari-
ational upper bound on the negative log likelihood
is:

− log L(D) ≤−
∫
dβ1:Tq(β1:T | γ1:T ) (4)

log
p(β1:T | α,x1:T )p(w1:T | β1:T ,n)

q(β1:T | γ1:T )
.

Specifically, we use mean-field variational inference
where the variables in the variational distribution q
are completely independent. We use Gaussian distri-
butions as our variational distributions for β, denoted
by γ in the bound in Eq. 4. We denote the parameters
of the Gaussian variational distribution for βt,v (word
v at timestep t) by µt,v (mean) and σt,v (variance).

Figure 2 shows the functional form of the varia-
tional bound that we seek to minimize, denoted by B̂.
The two main steps in the optimization of the bound
are inferring βt and updating feature coefficients α.
We next describe each step in detail.

4.1 Learning
The goal of the learning procedure is to minimize the
upper bound in Figure 2 with respect to α. However,
since the data arrives in an online fashion, and speed
is very important for processing streaming datasets,
the model needs to be updated at every timestep t (in
our experiments, daily).

Notice that at timestep t, we only have access
to x1:t and w1:t, and we perform learning at every
timestep after the text for the current timestep wt
is revealed. We do not know xt+1:T and wt+1:T .
Nonetheless, we want to update our model so that
it can make a better prediction at t + 1. Therefore,
we can only minimize the bound until timestep t.
Let Ck , exp(α

>f(xt,xk))Pt−1
j=t−c exp(α

>f(xt,xj))
. Our learning al-

gorithm is a variational Expectation-Maximization
algorithm (Wainwright and Jordan, 2008).

E-step Recall that we use variational inference and
the variational parameters for β are µ and σ. As
shown in Figure 2, since the log-sum-exp in the last
term of B is problematic, we introduce additional
variational parameters ζ to simplify B and obtain
B̂ (Eqs. 2–3). The E-step deals with all the local
variables µ, σ, and ζ.

Fixing other variables and taking the derivative
of the bound B̂ w.r.t. ζt and setting it to zero,
we obtain the closed-form update for ζt: ζt =∑

v∈V exp (n1:t−1,v) exp
(
µt,v +

σt,v
2

)
.

To minimize with respect to µt and σt, we apply
gradient-based methods since there are no closed-
form solutions. The derivative w.r.t. µt,v is:

∂B̂

∂µt,v
=
µt,v −Ckµk,v

ϕ

−nt,v +
nt
ζt

exp (n1:t−1,v) exp
(
µt,v +

σt,v
2

)
,

where nt =
∑

v∈V nt,v.
The derivative w.r.t. σt,v is:

∂B̂

∂σt,v
=

1

2σt,v
+

1

2ϕ
+
nt
2ζt

exp (n1:t−1,v) exp
(
µt,v +

σt,v
2

)
.

Although we require iterative methods in the E-step,
we find it to be reasonably fast in practice.4 Specifi-
cally, we use the L-BFGS quasi-Newton algorithm
(Liu and Nocedal, 1989).

We can further improve the bound by updating
the variational parameters for timestep 1 : t−1, i.e.,
µ1:t−1 and σ1:t−1, as well. However, this will require
storing the texts from previous timesteps. Addition-
ally, this will complicate the M-step update described

4Approximately 16.5 seconds/day (walltime) to learn the
model on the EN:NA dataset on a 2.40GHz CPU with 24GB
memory.

184


B =−
T∑

t=1

Eq[log p(βt | βk,α,xt)]−
T∑

t=1

Eq[log p(wt | βt,nt)]−H(q) (1)

=
T∑

t=1





1

2

∑

j∈V
log

σt,j
ϕ
−Eq


−

(
βt −

∑t−1
k=t−c Ckβk

)2

2ϕ


−Eq



∑

v∈wt
n1:t−1,v + βt,v − log

∑

j∈V
exp(n1:t−1,j + βt,j)








(2)

≤
T∑

t=1





1

2

∑

j∈V
log

σt,v
ϕ

+

(
µt −

∑t−1
k=t−c Ckµk

)2

2ϕ
+
σt +

∑t−1
k=t−c C

2
kσk

2ϕ

−
∑

v∈wt


µt,v − log ζt −

1

ζt

∑

j∈V
exp (n1:t−1,j) exp

(
µt,j +

σt,j
2

)







+ const (3)

Figure 2: The variational bound that we seek to minimize, B. H(q) is the entropy of the variational distribution q. The
derivation from line 1 to line 2 is done by replacing the probability distributions p(βt | βk,α,xt) and p(wt | βt,nt)
by their respective functional forms. Notice that in line 3 we compute the expectations under the variational distributions
and further bound B by introducing additional variational parameters ζ using Jensen’s inequality on the log-sum-exp in
the last term. We denote the new bound B̂.

below. Therefore, for each s < t, we choose to fix
µs and σs once they are learned at timestep s.

M-step In the M-step, we update the global pa-
rameter α, fixing µ1:t. Fixing other parameters and
taking the derivative of B̂ w.r.t. α, we obtain:5

∂B̂

∂α
=
(µt −

∑t−1
k=t−c Ckµk)(−

∑t−1
k=t−c

∂Ck
∂α

)

ϕ

+

∑t−1
k=t−c Ckσk

∂Ck
∂α

ϕ
,

where:

∂Ck
∂α

=Ckf(xt,xk)

−Ck
∑t−1

s=t−c f(xt,xs) exp(α
>f(xt,xs))∑t−1

s=t−c exp(α
>f(xt,xs))

.

We follow the convex optimization strategy and sim-
ply perform a stochastic gradient update: αt+1 =
αt + ηt

∂B̂
∂αt

(Zinkevich, 2003). While the variational
bound B̂ is not convex, given the local variables µ1:t

5In our implementation, we augment α with a squared L2
regularization term (i.e., we assume that α is drawn from a
normal distribution with mean zero and variance λ) and use the
FOBOS algorithm (Duchi and Singer, 2009). The derivative
of the regularization term is simple and is not shown here. Of
course, other regularizers (e.g., the L1-norm, which we use for
other parameters, or the L1/∞-norm) can also be explored.

and σ1:t, optimizing α at timestep t without know-
ing the future becomes a convex problem.6 Since
we do not reestimate µ1:t−1 and σ1:t−1 in the E-step,
the choice to perform online gradient descent instead
of iteratively performing batch optimization at every
timestep is theoretically justified.

Notice that our overall learning procedure is still
to minimize the variational upper bound B̂. All these
choices are made to make the model suitable for
learning in real time from large streaming datasets.
Preliminary experiments showed that performing
more than one EM iteration per day does not consid-
erably improve performance, so in our experiments
we perform one EM iteration per day.

To learn the parameters of the model, we rely on
approximations and optimize an upper bound B̂. We
have opted for this approach over alternatives (such
as MCMC methods) because of our interest in the
online, large-data setting. Our experiments show that
we are still able to learn reasonable parameter esti-
mates by optimizing B̂. Like online variational meth-
ods for other latent-variable models such as LDA
(Sato, 2001; Hoffman et al., 2013), open questions re-
main about the tightness of such approximations and
the identifiability of model parameters. We note, how-

6As a result, our algorithm is Hannan consistent w.r.t. the
best fixed α (for B̂) in hindsight; i.e., the average regret goes to
zero as T goes to ∞.

185


ever, that our model does not include latent mixtures
of topics and may be generally easier to estimate.

5 Prediction

As described in §2.2, our model is evaluated by the
loss suffered at every timestep, where the loss is
defined as the negative log likelihood of the model
on text at timestep wt. Therefore, at each timestep t,
we need to predict (the distribution of) wt. In order
to do this, for each word v ∈ V , we simply compute
the deviation means βt,v as weighted combinations
of previous means, where the weights are determined
by the world-context similarity encoded in x:

Eq[βt,v | µt,v] =
t−1∑

k=t−c

exp(α>f(xt,xk))∑t−1
j=t−c exp(α

>f(xt,xj))
µk,v.

Recall that the word distribution that we use for
prediction is obtained by applying the operator π
that maps βt and n to the V -dimensional simplex,
forming a distribution over words: π(βt,n1:t−1)v =

exp(n1:t−1,v+βt,v)P
j∈V exp(n1:t−1,j+βt,j)

, where n1:t−1,v ∈ RV is a
background distribution (the log-frequency of word
v observed up to time t−1).

6 Experiments

In our experiments, we consider the problem of pre-
dicting economy-related text appearing in news and
microblogs, based on observable features that reflect
current economic conditions in the world at a given
time. In the following, we describe our dataset in de-
tail, then show experimental results on text prediction.
In all experiments, we set the window size c = 7 (one
week) or c = 14 (two weeks), λ = 1

2|V | (V is the
size of vocabulary of the dataset under consideration),
and ϕ = 1.

6.1 Dataset
Our data contains metadata and text corpora. The
metadata is used as our features, whereas the text
corpora are used for learning language models and
predictions. The dataset (excluding Twitter) can
be downloaded at http://www.ark.cs.cmu.
edu/DynamicLM.

6.1.1 Metadata
We use end-of-day stock prices gathered from

finance.yahoo.com for each stock included in

the Standard & Poor’s 500 index (S&P 500). The
index includes large (by market value) companies
listed on US stock exchanges.7 We calculate daily
(continuously compounded) returns for each stock, o:
ro,t = log Po,t−log Po,t−1, where Po,t is the closing
stock price.8 We make a simplifying assumption that
text for day t is generated after Po,t is observed.9

In general, stocks trade Monday to Friday (except
for federal holidays and natural disasters). For days
when stocks do not trade, we set ro,t = 0 for all
stocks since any price change is not observed.

We transform returns into similarity values as fol-
lows: f(xo,t,xo,k) = 1 iff sign(ro,t) = sign(ro,k)
and 0 otherwise. While this limits the model by ig-
noring the magnitude of price changes, it is still rea-
sonable to capture the similarity between two days.10

There are 500 stocks in the S&P 500, so xt ∈ R500
and f(xt,xk) ∈ R500.

6.1.2 Text data
We have five streams of text data. The first four

corpora are news streams tracked through Reuters.11

Two of them are written in English, North American
Business Report (EN:NA) and Japanese Investment
News (EN:JP). The remaining two are German Eco-
nomic News Service (DE, in German) and French
Economic News Service (FR, in French). For all four
of the Reuters streams, we collected news data over
a period of thirteen months (392 days), 2012-05-26
to 2013-06-21. See Table 1 for descriptive statistics
of these datasets. Numerical terms are mapped to a
single word, and all letters are downcased.

The last text stream comes from the Deca-
hose/Gardenhose stream from Twitter. We collected
public tweets that contain ticker symbols (i.e., sym-
bols that are used to denote stocks of a particular
company in a stock market), preceded by the dollar

7For a list of companies listed in the S&P 500 as of
2012, see http://en.wikipedia.org/wiki/List_
of_S\%26P_500_companies. This set was fixed during
the time periods of all our experiments.

8We use the “adjusted close” on Yahoo that includes interim
dividend cash flows and also adjusts for “splits” (changes in the
number of outstanding shares).

9This is done in order to avoid having to deal with hourly
timesteps. In addition, intraday price data is only available
through commercial data provided.

10Note that daily stock returns are equally likely to be positive
or negative and display little serial correlation.

11http://www.reuters.com

186


Dataset Total # Doc. Avg. # Doc. #Days
Unigrams Bigrams

Total # Tokens Size Vocab. Total # Tokens Size Vocab.
EN:NA 86,683 223 392 28,265,550 10,000 11,804,201 5,000
EN:JP 70.807 182 392 16,026,380 10,000 7,047,095 5,000

FR 62,355 160 392 11,942,271 10,000 3,773,517 5,000
DE 51,515 132 392 9,027,823 10,000 3,499,965 5,000

Twitter 214,794 336 639 1,660,874 10,000 551,768 5,000

Table 1: Statistics about the datasets. Average number of documents (third column) is per day.

sign $ (e.g., $GOOG, $MSFT, $AAPL, etc.). These
tags are generally used to indicate tweets about the
stock market. We look at tweets from the period
2011-01-01 to 2012-09-30 (639 days). As a result,
we have approximately 100–800 tweets per day. We
tokenized the tweets using the CMU ARK TweetNLP
tools,12 numerical terms are mapped to a single word,
and all letters are downcased.

We perform two experiments using unigram and
bigram language models as the base models. For
each dataset, we consider the top 10,000 unigrams
after removing corpus-specific stopwords (the top
100 words with highest frequencies). For the bigram
experiments, we only use 5,000 words to limit the
number of unique bigrams so that we can simulate
experiments for the entire time horizon in a reason-
able amount of time. In standard open-vocabulary
language modeling experiments, the treatment of un-
known words deserves care. We have opted for a
controlled, closed-vocabulary experiment, since stan-
dard smoothing techniques will almost surely interact
with temporal dynamics and context in interesting
ways that are out of scope in the present work.

6.2 Baselines

Since this is a forecasting task, at each timestep, we
only have access to data from previous timesteps.
Our model assumes that all words in all documents
in a corpus come from a single multinomial distri-
bution. Therefore, we compare our approach to the
corresponding base models (standard unigram and bi-
gram language models) over the same vocabulary (for
each stream). The first one maintains counts of every
word and updates the counts at each timestep. This
corresponds to a base model that uses all of the avail-
able data up to the current timestep (“base all”). The
second one replaces counts of every word with the

12https://www.ark.cs.cmu.edu/TweetNLP

counts from the previous timestep (“base one”). Ad-
ditionally, we also compare with a base model whose
counts decay exponentially (“base exp”). That is, the
counts from previous timesteps decay by exp(−γs),
where s is the distance between previous timesteps
and the current timestep and γ is the decay constant.
We set the decay constant γ = 1. We put a symmetric
Dirichlet prior on the counts (“add-one” smoothing);
this is analogous to our treatment of the background
frequencies n in our model. Note that our model,
similar to “base all,” uses all available data up to
timestep t−1 when making predictions for timestep
t. The window size c only determines which previ-
ous timesteps’ models can be chosen for making a
prediction today. The past models themselves are es-
timated from all available data up to their respective
timesteps.

We also compare with two strong baselines: a lin-
ear interpolation of “base one” models for the past
week (“int. week”) and a linear interpolation of “base
all” and “base one” (“int one all”). The interpolation
weights are learned online using the normalized expo-
nentiated gradient algorithm (Kivinen and Warmuth,
1997), which has been shown to enjoy a stronger
regret guarantee compared to standard online gra-
dient descent for learning a convex combination of
weights.

6.3 Results

We evaluate the perplexity on unseen dataset to eval-
uate the performance of our model. Specifically, we
use per-word predictive perplexity:

perplexity = exp

(
−
∑T

t=1 log p(wt | α,x1:t,n1:t−1)∑T
t=1

∑
j∈V wt,j

)
.

Note that the denominator is the number of tokens
up to timestep T . Lower perplexity is better.

Table 2 and Table 3 show the perplexity results for

187


Dataset base all base one base exp int. week int. one all c = 7 c = 14
EN:NA 3,341 3,677 3,486 3,403 3,271 3,262 3,285
EN:JP 2,802 3,212 2,750 2,949 2,708 2,656 2,689

FR 3,603 3,910 3,678 3,625 3,416 3,404 3,438
DE 3,789 4,199 3,979 3,926 3,634 3,649 3,687

Twitter 3,880 6,168 5,133 5,859 4,047 3,801 3,819

Table 2: Perplexity results for our five data streams in the unigram experiments. The base models in “base all,” “base
one,” and “base exp” are unigram language models. “int. week” is a linear interpolation of “base one” from the past
week. “int. one all” is a linear interpolation of “base one” and “base all”. The rightmost two columns are versions of
our model. Best results are highlighted in bold.

Dataset base all base one base exp int. week int. one all c = 7
EN:NA 242 2,229 1,880 2,200 244 223
EN:JP 185 2,101 1,726 2,050 189 167

FR 159 2,084 1,707 2,068 166 139
DE 268 2,634 2,267 2,644 282 243

Twitter 756 4,245 4,253 5,859 4,046 739

Table 3: Perplexity results for our five data streams in the bigram experiments. The base models in “base all,” “base
one,” and “base exp” are bigram language models. “int. week” is a linear interpolation of “base one” from the past
week. “int. one all” is a linear interpolation of “base one” and “base all”. The rightmost column is a version of our
model with c = 7. Best results are highlighted in bold.

each of the datasets for unigram and bigram experi-
ments respectively. Our model outperformed other
competing models in all cases but one. Recall that we
only define the similarity function of world context
as: f(xo,t,xo,k) = 1 iff sign(ro,t) = sign(ro,k) and
0 otherwise. A better similarity function (e.g., one
that takes into account market size of the company
and the magnitude of increase or decrease in the stock
price) might be able to improve the performance fur-
ther. We leave this for future work. Furthermore,
the variations can be captured using models from the
past week. We discuss why increasing c from 7 to 14
did not improve performance of the model in more
detail in §6.4.

We can also see how the models performed over
time. Figure 4 traces perplexity for four Reuters news
stream datasets.13 We can see that in some cases the
performance of the “base all” model degraded over
time, whereas our model is more robust to temporal

13In both experiments, in order to manage the time and space
complexities of updating β, we apply a sparsity shrinkage tech-
nique by using OWL-QN (Andrew and Gao, 2007) when maxi-
mizing it, with regularization constant set to 1. Intuitively, this
is equivalent to encouraging the deviation vector to be sparse
(Eisenstein et al., 2011).

shifts.
In the bigram experiments, we only ran our model

with c = 7, since we need to maintain β in RV
2
,

instead of RV in the unigram model. The goal of
this experiment is to determine whether our method
still adds benefit to more expressive language mod-
els. Note that the weights of the linear interpolation
models are also learned in an online fashion since
there are no classical training, development, and test
sets in our setting. Since the “base one” model per-
formed poorly in this experiment, the performance of
the interpolated models also suffered. For example,
the “int. one all” model needed time to learn that the
“base one” model has to be downweighted (we started
with all interpolated models having uniform weights),
so it was not able to outperform even the “base all”
model.

6.4 Analysis and Discussion
It should not be surprising that conditioning on
world-context reduces perplexity (Cover and Thomas,
1991). A key attraction of our model, we believe, lies
in the ability to inspect its parameters.

Deviation coefficients. Inspecting the model al-
lows us to gain insight into temporal trends. We

188


Twitter:Google

timestep

β

0 100 200 300 400 500 600

0.
0
0.
5
1.
0
1.
5
2.
0

googgoog

@google@google
google+google+

#goog#goog

rGOOGrGOOG

Twitter:Microsoft

timestep

β

0 100 200 300 400 500 600

0.
0

0.
5

1.
0

1.
5

microsoftmicrosoft

msftmsft
#microsoft#microsoft

rMSFTrMSFT

Figure 3: Deviation coefficients β over time for Google- and Microsoft-related words on Twitter with unigram base
model (c = 7). Significant changes (increases or decreases) in the returns of Google and Microsoft stocks are usually
followed by increases in β of related words.

investigate the deviations learned by our model on the
Twitter dataset. Examples are shown in Figure 3. The
left plot shows β for four words related to Google:
goog, #goog, @google, google+. For compari-
son, we also show the return of Google stock for the
corresponding timestep (scaled by 50 and centered at
0.5 for readability, smoothed using loess (Cleveland,
1979), denoted by rGOOG in the plot). We can see
that significant changes of return of Google stocks
(e.g., the rGOOG spikes between timesteps 50–100,
150–200, 490–550 in the plot) occurred alongside
an increase in β of Google-related words. Similar
trends can also be observed for Microsoft-related
words in the right plot. The most significant loss of
return of Microsoft stocks (the downward spike near
timestep 500 in the plot) is followed by a sudden
sharp increase in β of the words #microsoft and
microsoft.

Feature coefficients. We can also inspect the
learned feature coefficients α to investigate which
stocks have higher associations with the text that
is generated. Our feature coefficients are designed
to reflect which changes (or lack of changes) in
stock prices influence the word distribution more,
not which stocks are talked about more often. We
find that the feature coefficients do not correlate with
obvious company characteristics like market capi-
talization (firm size). For example, on the Twitter
dataset with bigram base models, the five stocks with
the highest weights are: ConAgra Foods Inc., Intel
Corp., Bristol-Myers Squibb, Frontier Communica-
tions Corp., and Amazon.com Inc. Strongly negative
weights tended to align with streams with less activ-

time lags

fr
eq
ue
nc
y

0
20

40
60

80

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Figure 5: Distributions of the selection probabilities of
models from the previous c = 14 timesteps, on the EN:NA
dataset with unigram base model. For simplicity, we show
E-step modes. The histogram shows that the model tends
to favor models from days closer to the current date.

ity, suggesting that these were being used to smooth
across all c days of history. A higher weight for stock
o implies an increase in probability of choosing mod-
els from previous timesteps s, when the state of the
world for the current timestep t and timestep s is the
same (as represented by our similarity function) with
respect to stock o (all other things being equal), and
a decrease in probability for a lower weight.

Selected models. Besides feature coefficients, our
model captures temporal shift by modeling similar-
ity across the most recent c days. During inference,
our model weights different word distributions from
the past. The similarity is encoded in the pairwise
features f(xt,xk) and the parameters α. Figure 5
shows the distributions of the strongest-posterior
models from previous timesteps, based on how far

189


EN:NA

timestep

pe
rp
le
xi
ty

0 50 100 150 200 250 300 350

20
0

40
0

60
0

base allbase all
completecomplete
int. one allint. one all

EN:JP

timestep

pe
rp
le
xi
ty

0 50 100 150 200 250 300 350

20
0

40
0

60
0

base allbase allcompletecomplete
int. one allint. one all

FR

timestep

pe
rp
le
xi
ty

0 50 100 150 200 250 300 350

20
0

40
0

60
0

base allbase all
completecomplete
int. one allint. one all

DE

timestep

pe
rp
le
xi
ty

0 50 100 150 200 250 300 350

30
0

50
0

70
0

base allbase all
completecomplete
int. one allint. one all

Figure 4: Perplexity over time for four Reuters news streams (c = 7) with bigram base models.

190


in the past they are at the time of use, aggregated
across rounds on the EN:NA dataset, for window size
c = 14. It shows that the model tends to favor models
from days closer to the current date, with the t− 1
models selected the most, perhaps because the state
of the world today is more similar to dates closer to
today compare to more distant dates. The plot also
explains why increasing c from 7 to 14 did not im-
prove performance of the model, since most of the
variation in our datasets can be captured with models
from the past week.

Topics. Latent topic variables have often figured
heavily in approaches to dynamic language model-
ing. In preliminary experiments incorporating single-
membership topic variables (i.e., each document be-
longs to a single topic, as in a mixture of unigrams),
we saw no benefit to perplexity. Incorporating top-
ics also increases computational cost, since we must
maintain and estimate one language model per topic,
per timestep. It is straightforward to design mod-
els that incorporate topics with single- or mixed-
membership as in LDA (Blei et al., 2003), an in-
teresting future direction.

Potential applications. Dynamic language models
like ours can be potentially useful in many applica-
tions, either as a standalone language model, e.g.,
predictive text input, whose performance may de-
pend on the temporal dimension; or as a component
in applications like machine translation or speech
recognition. Additionally, the model can be seen as
a step towards enhancing text understanding with
numerical, contextual data.

7 Conclusion

We presented a dynamic language model for stream-
ing datasets that allows conditioning on observable
real-world context variables, exemplified in our ex-
periments by stock market data. We showed how to
perform learning and inference in an online fashion
for this model. Our experiments showed the predic-
tive benefit of such conditioning and online learning
by comparing to similar models that ignore temporal
dimensions and observable variables that influence
the text.

Acknowledgements
The authors thank several anonymous reviewers for help-
ful feedback on earlier drafts of this paper and Brendan
O’Connor for help with collecting Twitter data. This re-
search was supported in part by Google, by computing
resources at the Pittsburgh Supercomputing Center, by
National Science Foundation grant IIS-1111142, AFOSR
grant FA95501010247, ONR grant N000140910758, and
by the Intelligence Advanced Research Projects Activ-
ity via Department of Interior National Business Center
contract number D12PC00347. The U.S. Government is
authorized to reproduce and distribute reprints for Govern-
mental purposes notwithstanding any copyright annotation
thereon. The views and conclusions contained herein are
those of the authors and should not be interpreted as nec-
essarily representing the official policies or endorsements,
either expressed or implied, of IARPA, DoI/NBC, or the
U.S. Government.

References
Galen Andrew and Jianfeng Gao. 2007. Scalable training

of l1-regularized log-linear models. In Proc. of ICML.
David M. Blei and John D. Lafferty. 2006. Dynamic topic

models. In Proc. of ICML.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.

2003. Latent Dirichlet allocation. Journal of Machine
Learning Research, 3:993–1022.

Sébastien Bubeck. 2011. Introduction to online opti-
mization. Technical report, Department of Operations
Research and Financial Engineering, Princeton Univer-
sity.

Nicolò Cesa-Bianchi and Gábor Lugosi. 2006. Prediction,
Learning, and Games. Cambridge University Press.

William S. Cleveland. 1979. Robust locally weighted
regression and smoothing scatterplots. Journal of the
American Statistical Association, 74(368):829–836.

Thomas M. Cover and Joy A. Thomas. 1991. Elements of
Information Theory. John Wiley & Sons.

John Duchi and Yoram Singer. 2009. Efficient online
and batch learning using forward backward splitting.
Journal of Machine Learning Research, 10(7):2899–
2934.

Jacob Eisenstein, Brendan O’Connor, Noah A. Smith,
and Eric P. Xing. 2010. A latent variable model for
geographic lexical variation. In Proc. of EMNLP.

Jacob Eisenstein, Amr Ahmed, and Eric P. Xing. 2011.
Sparse additive generative models of text. In Proc. of
ICML.

Amit Goyal, Hal Daume III, and Suresh Venkatasubrama-
nian. 2009. Streaming for large scale NLP: Language
modeling. In Proc. of HLT-NAACL.

191


Matt Hoffman, David M. Blei, Chong Wang, and John
Paisley. 2013. Stochastic variational inference. Jour-
nal of Machine Learning Research, 14:1303–1347.

Antti Honkela and Harri Valpola. 2003. On-line varia-
tional Bayesian learning. In Proc. of ICA.

Tomoharu Iwata, Takeshi Yamada, Yasushi Sakurai, and
Naonori Ueda. 2010. Online multiscale dynamic topic
models. In Proc. of KDD.

Frederick Jelinek. 1997. Statistical Methods for Speech
Recognition. MIT Press.

Jyrki Kivinen and Manfred K. Warmuth. 1997. Expo-
nentiated gradient versus gradient descent for linear
predictors. Information and Computation, 132:1–63.

Victor Lavrenko, Matt Schmill, Dawn Lawrie, Paul
Ogilvie, David Jensen, and James Allan. 2000. Mining
of concurrent text and time series. In Proc. of KDD
Workshop on Text Mining.

Abby Levenberg and Miles Osborne. 2009. Stream-based
randomised language models for SMT. In Proc. of
EMNLP.

Abby Levenberg, Chris Callison-Burch, and Miles Os-
borne. 2010. Stream-based translation models for sta-
tistical machine translation. In Proc. of HLT-NAACL.

Dong C. Liu and Jorge Nocedal. 1989. On the limited
memory BFGS method for large scale optimization.
Mathematical Programming B, 45(3):503–528.

David Mimno and Andrew McCallum. 2008. Topic mod-
els conditioned on arbitrary features with Dirichlet-
multinomial regression. In Proc. of UAI.

Alexander Rakhlin. 2009. Lecture notes on online learn-
ing. Technical report, Department of Statistics, The
Wharton School, University of Pennsylvania.

Masaaki Sato. 2001. Online model selection based on the
variational bayes. Neural Computation, 13(7):1649–
1681.

Shai Shalev-Shwartz. 2012. Online learning and online
convex optimization. Foundations and Trends in Ma-
chine Learning, 4(2):107–194.

Martin J. Wainwright and Michael I. Jordan. 2008. Graph-
ical models, exponential families, and variational infer-
ence. Foundations and Trends in Machine Learning,
1(1–2):1–305.

Chong Wang, David M. Blei, and David Heckerman.
2008. Continuous time dynamic topic models. In Proc.
of UAI.

Chong Wang, John Paisley, and David M. Blei. 2011. On-
line variational inference for the hierarchical Dirichlet
process. In Proc. of AISTATS.

Martin Zinkevich. 2003. Online convex programming
and generalized infinitesimal gradient ascent. In Proc.
of ICML.

192