Edinburgh Research Explorer 
 
 
A Bayesian Model of Diachronic Meaning Change

Citation for published version:
Frermann, L & Lapata, M 2016, 'A Bayesian Model of Diachronic Meaning Change', Transactions of the
Association for Computational Linguistics, vol. 4, pp. 31-45.
<https://transacl.org/ojs/index.php/tacl/article/view/796>

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Publisher's PDF, also known as Version of record

Published In:
Transactions of the Association for Computational Linguistics

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.

Download date: 06. Apr. 2021

https://transacl.org/ojs/index.php/tacl/article/view/796
https://www.research.ed.ac.uk/portal/en/publications/a-bayesian-model-of-diachronic-meaning-change(d4f70944-eee3-4e68-88bc-477b1b11d1ef).html


A Bayesian Model of Diachronic Meaning Change

Lea Frermann and Mirella Lapata
Institute for Language, Cognition and Computation

School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB

l.frermann@ed.ac.uk, mlap@inf.ed.ac.uk

Abstract

Word meanings change over time and an au-
tomated procedure for extracting this infor-
mation from text would be useful for histor-
ical exploratory studies, information retrieval
or question answering. We present a dy-
namic Bayesian model of diachronic meaning
change, which infers temporal word represen-
tations as a set of senses and their prevalence.
Unlike previous work, we explicitly model
language change as a smooth, gradual pro-
cess. We experimentally show that this model-
ing decision is beneficial: our model performs
competitively on meaning change detection
tasks whilst inducing discernible word senses
and their development over time. Application
of our model to the SemEval-2015 temporal
classification benchmark datasets further re-
veals that it performs on par with highly op-
timized task-specific systems.

1 Introduction

Language is a dynamic system, constantly evolv-
ing and adapting to the needs of its users and their
environment (Aitchison, 2001). Words in all lan-
guages naturally exhibit a range of senses whose dis-
tribution or prevalence varies according to the genre
and register of the discourse as well as its historical
context. As an example, consider the word cute
which according to the Oxford English Dictionary
(OED, Stevenson 2010) first appeared in the early
18th century and originally meant clever or keen-
witted.1 By the late 19th century cute was used in

1Throughout this paper we denote words in true type, their
senses in italics, and sense-specific context words as {lists}.

the same sense as cunning. Today it mostly refers
to objects or people perceived as attractive, pretty or
sweet. Another example is the word mouse which
initially was only used in the rodent sense. The
OED dates the computer pointing device sense of
mouse to 1965. The latter sense has become par-
ticularly dominant in recent decades due to the ever-
increasing use of computer technology.

The arrival of large-scale collections of historic
texts (Davies, 2010) and online libraries such as
the Internet Archive and Google Books have greatly
facilitated computational investigations of language
change. The ability to automatically detect how the
meaning of words evolves over time is potentially
of significant value to lexicographic and linguistic
research but also to real world applications. Time-
specific knowledge would presumably render word
meaning representations more accurate, and benefit
several downstream tasks where semantic informa-
tion is crucial. Examples include information re-
trieval and question answering, where time-related
information could increase the precision of query
disambiguation and document retrieval (e.g., by re-
turning documents with newly created senses or fil-
tering out documents with obsolete senses).

In this paper we present a dynamic Bayesian
model of diachronic meaning change. Word mean-
ing is modeled as a set of senses, which are tracked
over a sequence of contiguous time intervals. We
infer temporal meaning representations, consisting
of a word’s senses (as a probability distribution over
words) and their relative prevalence. Our model is
thus able to detect that mouse had one sense until
the mid-20th century (characterized by words such
as {cheese, tail, rat}) and subsequently acquired a

31

Transactions of the Association for Computational Linguistics, vol. 4, pp. 31–45, 2016. Action Editor: Tim Baldwin.
Submission batch: 12/2015; Revision batch: 2/2016; Published 2/2016.

c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


second sense relating to computer device. More-
over, it infers subtle changes within a single sense.
For instance, in the 1970s the words {cable, ball,
mousepad} were typical for the computer device
sense, whereas nowadays the terms {optical, laser,
usb} are more typical. Contrary to previous work
(Mitra et al., 2014; Mihalcea and Nastase, 2012; Gu-
lordava and Baroni, 2011) where temporal represen-
tations are learnt in isolation, our model assumes
that adjacent representations are co-dependent, thus
capturing the nature of meaning change being fun-
damentally smooth and gradual (McMahon, 1994).
This also serves as a form of smoothing: temporally
neighboring representations influence each other if
the available data is sparse.

Experimental evaluation shows that our model
(a) induces temporal representations which reflect
word senses and their development over time, (b) is
able to detect meaning change between two time pe-
riods, and (c) is expressive enough to obtain useful
features for identifying the time interval in which a
piece of text was written. Overall, our results indi-
cate that an explicit model of temporal dynamics is
advantageous for tracking meaning change. Com-
parisons across evaluations and against a variety of
related systems show that despite not being designed
with any particular task in mind, our model performs
competitively across the board.

2 Related Work

Most work on diachronic language change has fo-
cused on detecting whether and to what extent a
word’s meaning changed (e.g., between two epochs)
without identifying word senses and how these vary
over time. A variety of methods have been applied
to the task ranging from the use of statistical tests in
order to detect significant changes in the distribution
of terms from two time periods (Popescu and Strap-
parava, 2013; Cook and Stevenson, 2010), to train-
ing distributional similarity models on time slices
(Gulordava and Baroni, 2011; Sagi et al., 2009), and
neural language models (Kim et al., 2014; Kulkarni
et al., 2015). Other work (Mihalcea and Nastase,
2012) takes a supervised learning approach and pre-
dicts the time period to which a word belongs given
its surrounding context.

Bayesian models have been previously developed
for various tasks in lexical semantics (Brody and La-

pata, 2009; Ó Séaghdha, 2010; Ritter et al., 2010)
and word meaning change detection is no exception.
Using techniques from non-parametric topic model-
ing, Lau et al. (2012) induce word senses (aka. top-
ics) for a given target word over two time periods.
Novel senses are then are detected based on the
discrepancy between sense distributions in the two
periods. Follow-up work (Cook et al., 2014; Lau
et al., 2014) further explores methods for how to best
measure this sense discrepancy. Rather than infer-
ring word senses, Wijaya and Yeniterzi (2011) use
a Topics-over-Time model and k-means clustering
to identify the periods during which selected words
move from one topic to another.

A non-Bayesian approach is put forward in Mi-
tra et al. (2014, 2015) who adopt a graph-based
framework for representing word meaning (see Tah-
masebi et al. (2011) for a similar earlier proposal).
In this model words correspond to nodes in a se-
mantic network and edges are drawn between words
sharing contextual features (extracted from a depen-
dency parser). A graph is constructed for each time
interval, and nodes are clustered into senses with
Chinese Whispers (Biemann, 2006), a randomized
graph clustering algorithm. By comparing the in-
duced senses for each time slice and observing inter-
cluster differences, their method can detect whether
senses emerge or disappear.

Our work draws ideas from dynamic topic mod-
eling (Blei and Lafferty, 2006b) where the evolu-
tion of topics is modeled via (smooth) changes in
their associated distributions over the vocabulary.
Although the dynamic component of our model is
closely related to previous work in this area (Mimno
et al., 2008), our model is specifically constructed
for capturing sense rather than topic change. Our ap-
proach is conceptually similar to Lau et al. (2012).
We also learn a joint sense representation for multi-
ple time slices. However, in our case the number of
time slices in not restricted to two and we explicitly
model temporal dynamics. Like Mitra et al. (2014,
2015), we model how senses change over time. In
our model, temporal representations are not inde-
pendent, but influenced by their temporal neighbors,
encouraging smooth change over time. We therefore
induce a global and consistent set of temporal repre-
sentations for each word. Our model is knowledge-
lean (it does not make use of a parser) and language

32


independent (all that is needed is a time-stamped
corpus and tools for basic pre-processing). Contrary
to Mitra et al. (2014, 2015), we do not treat the tasks
of inferring a semantic representation for words and
their senses as two separate processes.

Evaluation of models which detect meaning
change is fraught with difficulties. There is no stan-
dard set of words which have undergone meaning
change or benchmark corpus which represents a va-
riety of time intervals and genres, and is thematically
consistent. Previous work has generally focused on
a few hand-selected words and models were evalu-
ated qualitatively by inspecting their output, or the
extent to which they can detect meaning changes
from two time periods. For example, Cook et al.
(2014) manually identify 13 target words which un-
dergo meaning change in a focus corpus with re-
spect to a reference corpus (both news text). They
then assess how their models fare at learning sense
differences for these targets compared to distractors
which did not undergo meaning change. They also
underline the importance of using thematically com-
parable reference and focus corpora to avoid spuri-
ous differences in word representations.

In this work we evaluate our model’s ability to
detect and quantify meaning change across several
time intervals (not just two). Instead of relying
on a few hand-selected target words, we use larger
sets sampled from our learning corpus or found to
undergo meaning change in a judgment elicitation
study (Gulordava and Baroni, 2011). In addition,
we adopt the evaluation paradigm of Mitra et al.
(2014) and validate our findings against WordNet.
Finally, we apply our model to the recently es-
tablished SemEval-2015 diachronic text evaluation
subtasks (Popescu and Strapparava, 2015). In order
to present a consistent set of experiments, we use our
own corpus throughout which covers a wider range
of time intervals and is compiled from a variety of
genres and sources and is thus thematically coher-
ent (see Section 4 for details). Wherever possible,
we compare against prior art, with the caveat that
the use of a different underlying corpus unavoidably
influences the obtained semantic representations.

3 A Bayesian Model of Sense Change

In this section we introduce SCAN, our dynamic
Bayesian model of Sense ChANge. SCAN captures

how a word’s senses evolve over time (e.g., whether
new senses emerge), whether some senses become
more or less prevalent, as well as phenomena per-
taining to individual senses such as meaning exten-
sion, shift, or modification. We assume that time is
discrete, divided into contiguous intervals. Given a
word, our model infers its senses for each time in-
terval and their probability. It captures the gradual
nature of meaning change explicitly, through depen-
dencies between temporally adjacent meaning rep-
resentations. Senses themselves are expressed as a
probability distribution over words, which can also
change over time.

3.1 Model Description
We create a SCAN model for each target word c. The
input to the model is a corpus of short text snippets,
each consisting of a mention of the target word c and
its local context w (in our experiments this is a sym-
metric context window of ±5 words). Each snip-
pet is annotated with its year of origin. The model
is parametrized with regard to the number of senses
k ∈ [1...K] of the target word c, and the length of
time intervals ∆T which might be finely or coarsely
defined (e.g., spanning a year or a decade).

We conflate all documents originating from the
same time interval t ∈ [1...T] and infer a tempo-
ral representation of the target word per interval. A
temporal meaning representation for time t is (a) a
K-dimensional multinomial distribution over word
senses φt and (b) a V -dimensional distribution over
the vocabulary ψt,k for each word sense k. In ad-
dition, our model infers a precision parameter κφ,
which controls the extent to which sense prevalence
changes for word c over time (see Section 3.2 for
details on how we model temporal dynamics).

We place individual logistic normal priors (Blei
and Lafferty, 2006a) on our multinomial sense dis-
tributions φ and sense-word distributions ψk. A
draw from the logistic normal distribution con-
sists of (a) a draw of an n-dimensional random
vector x from the multivariate normal distribution
parametrized by an n-dimensional mean vector µ
and a n × n variance-covariance matrix Σ, x ∼
N(x|µ, Σ); and (b) a mapping of the drawn param-
eters to the simplex through the logistic transforma-
tion φn = exp(xn)/

∑
n′ exp(xn′ ), which ensures a

draw of valid multinomial parameters. The normal
distributions are parametrized to encourage smooth

33


wz z w z w

φt−1 φt φt+1

κφa,b

ψt−1 ψt ψt+1

κψ

I
Dt−1

I
Dt

I
Dt+1

K

Draw κφ ∼ Gamma(a,b)
for time interval t = 1..T do

Draw sense distribution
φt|φ−t,κφ ∼N( 1

2
(φt−1 + φt+1),κφ)

for sense k = 1..K do
Draw word distribution
ψt,k|ψ−t,κψ ∼N( 1

2
(ψt−1,k+ψt+1,k),κψ)

for document d = 1..D do
Draw sense zd ∼ Mult(φt)
for context position i = 1..I do

Draw word wd,i ∼ Mult(ψt,zd )

Figure 1: Left: plate diagram for the dynamic sense model for three time steps {t−1, t, t+1}. Constant parameters are
shown as dashed nodes, latent variables as clear nodes, and observed variables as gray nodes. Right: the corresponding
generative story.

change in multinomial parameters, over time (see
Section 3.2 for details), and the extent of change
is controlled through a precision parameter κ. We
learn the value of κφ during inference, which al-
lows us to model the extent of temporal change in
sense prevalence individually for each target word.
We draw κφ from a conjugate Gamma prior. We
do not infer the sense-word precision parameter κψ

on all ψk. Instead, we fix it at a high value, trig-
gering little variation of word distributions within
senses. This leads to senses being thematically co-
herent over time.

We now describe the generative story of our
model, which is depicted in Figure 1 (right), along-
side its plate diagram representation (left). First,
we draw the sense precision parameter κφ from a
Gamma prior. For each time interval t we draw (a) a
multinomial distribution over senses φt from a lo-
gistic normal prior; and (b) a multinomial distribu-
tion over the vocabulary ψt,k for each sense k, from
another logistic normal prior. Next, we generate
time-specific text snippets. For each snippet d, we
first observe the time interval t, and draw a sense zd

from Mult(φt). Finally, we generate I context
words wd,i independently from Mult(ψt,z

d
).

3.2 Background on iGMRFs

Let φ = {φ1...φT} denote a T-dimensional random
vector, where each φt might for example correspond
to a sense probability at time t. We define a prior

which encourages smooth change of parameters at
neighboring times, in terms of a first order random
walk on the line (graphically shown in Figure 2, and
the chains of φ and ψ in Figure 1(left)). Specifically,
we define this prior as an intrinsic Gaussian Markov
Random Field (iGMRF; Rue and Held 2005), which
allows us to model the change of adjacent parame-
ters as drawn from a normal distribution, e.g.:

∆φt ∼N(0,κ−1). (1)

The iGMRF is defined with respect to the graph in
Figure 2; it is sparsely connected with only first-
order dependencies which allows for efficient in-
ference. A second feature, which makes iGMRFs
popular as priors in Bayesian modeling, is the fact
that they can be defined purely in terms of the lo-
cal changes between dependent (i.e., adjacent) vari-
ables, without the need to specify an overall mean
of the model. The full conditionals explicitly cap-
ture these intuitions:

φt|φ−t,κ ∼N
(1

2
(φt−1 + φt+1),

1

2κ

)
, (2)

for 1 < t < T , where φ−t is the vector φ ex-
cept element φt and κ is a precision parameter. The
value of parameter φt is distributed normally, cen-
tered around the mean of the values of its neighbors,
without reference to a global mean. The precision
parameter κ controls the extent of variation: how
tightly coupled are the neighboring parameters? Or,

34


φ1 φt−1 φt φt+1 φT

Figure 2: A linear chain iGMRF.

in our case: how tightly coupled are temporally ad-
jacent meaning representations of a word c? We es-
timate the precision parameter κφ during inference.
This allows us to flexibly capture sense variation
over time individually for each target word.

For a detailed introduction to (i)GMRFs we refer
the interested reader to Rue and Held (2005). For an
application of iGMRFs to topic models see Mimno
et al. (2008).

3.3 Inference
We use a blocked Gibbs sampler for approximate in-
ference. The logistic normal prior is not conjugate
to the multinomial distribution. This means that the
straightforward parameter updates known for sam-
pling standard, Dirichlet-multinomial, topic models
do not apply. However, sampling-based methods for
logistic normal topic models have been proposed in
the literature (Mimno et al., 2008; Chen et al., 2013).

At each iteration, we sample: (a) document-
sense assignments, (b) multinomial parameters from
the logistic normal prior, and (c) the sense preci-
sion parameter from a Gamma prior. Our blocked
sampler first iterates over all input text snippets d
with context w, and re-samples their sense assign-
ments under the current model parameters {φ}T
and {ψ}K×T ,

p(zd|w, t,φ,ψ) ∝ p
(
zd|t

)
p
(
w|t,zd

)

= φt
zd

∏

w∈w
ψt,z

d

w (3)

Next, we re-sample parameters {φ}T and {ψ}K×T
from the logistic normal prior, given the current
sense assignments. We use the auxiliary variable
method proposed in Mimno et al. (2008) (see also
Groenewald and Mokgatlhe (2005)). Intuitively,
each individual parameter (e.g., sense k’s prevalence
at time t, φtk) is ‘shifted’ within a weighted region
which is bounded by the number of times sense k
was observed at time t. The weights of the region
are determined by the prior, in our case the normal
distributions defined by the iGMRF, which ensure

Corpus years covered #words
COHA 1810–2009 142,587,656
DTE 1700–2010 124,771
CLMET3.0 1710–1810 4,531,505

Table 1: Size and coverage of our three training corpora
(after pre-processing).

an influence of temporal neighbors φt−1k and φ
t+1
k

on the new parameter value φtk, and smooth tempo-
ral variation as desired. The same procedure applies
to each word parameter under each {time, sense}
ψ
t,k
w (see Mimno et al. 2008 for a more detailed de-

scription of the sampler). Finally, we periodically
re-sample the sense precision parameter κφ from its
conjugate Gamma prior.

4 The DATE Corpus

Before presenting our evaluation we describe the
corpus used as a basis for the experiments per-
formed in this work. We applied our model to
a DiAchronic TExt corpus (DATE) which collates
documents spanning years 1700–2010 from three
sources: (a) the COHA corpus2 (Davies, 2010), a
large collection of texts from various genres cover-
ing the years 1810–2010; (b) the training data pro-
vided by the DTE task3 organizers (see Section 8);
and (c) the portion of the CLMET3.04 corpus (Diller
et al., 2011) corresponding to the period 1710–1810
(which is not covered by the COHA corpus and thus
underrepresented in our training data). CLMET3.0
contains texts representative of a range of genres in-
cluding narrative fiction, drama, letters, and was col-
lected from various online archives. Table 1 pro-
vides details on the size of our corpus.

Documents were clustered by their year of pub-
lication as indicated in the original corpora. In the
CLMET3.0 corpus, occasionally a range of years
would be provided. In this case we used the fi-
nal year of the range. We tokenized, lemmatized,
and part of speech tagged DATE using the NLTK
(Bird et al., 2009). We removed stopwords and func-
tion words. After preprocessing, we extracted target

2http://corpus.byu.edu/coha/
3http://alt.qcri.org/semeval2015/task7/

index.php?id=data-and-tools
4http://www.kuleuven.be/˜u0044428/

clmet3_0.htm

35

http://corpus.byu.edu/coha/
http://alt.qcri.org/semeval2015/task7/index.php?id=data-and-tools
http://alt.qcri.org/semeval2015/task7/index.php?id=data-and-tools
http://www.kuleuven.be/~u0044428/clmet3_0.htm
http://www.kuleuven.be/~u0044428/clmet3_0.htm


word-specific input corpora for our models. These
consisted of mentions of a target c and its surround-
ing context, a symmetric window of ± 5 words.

5 Experiment 1: Temporal Dynamics

As discussed earlier our model departs from previ-
ous approaches (e.g., Mitra et al. 2014) in that it
learns globally consistent temporal representations
for each word. In order to assess whether temporal
dependencies are indeed beneficial, we implemented
a stripped-down version of our model (SCAN-NOT)
which does not have any temporal dependencies be-
tween individual time steps (i.e., without the chain
iGMRF priors). Word meaning is still represented
as senses and sense prevalence is modeled as a dis-
tribution over senses for each time interval. How-
ever, time intervals are now independent. Inference
works as described in Section 3.3, without having to
learn the κ precision parameters.

Models and Parameters We compared the two
models in terms of their predictive power. We split
the DATE corpus into a training period {d1...dt}
of time slices 1 through t and computed the like-
lihood p(dt+1|φt,ψt) of the data at test time
slice t + 1, under the parameters inferred for the
previous time slice. The time slice size was set
to ∆T = 20 years. We set the number of senses to
K = 8, the word precision parameter κψ = 10, a
high value which enforces individual senses to re-
main thematically consistent across time. We set
the initial sense precision parameter κφ = 4, and the
Gamma parameters a = 7 and b = 3. These pa-
rameters were optimized once on the development
data used for the task-based evaluation discussed
in Section 8. Unless otherwise specified all ex-
periments use these values. No parameters were
tuned on the test set for any task. In all exper-
iments we ran the Gibbs sampler for 1,000 itera-
tions, and resampled κφ after every 50 iterations,
starting from iteration 150. We used the final state
of the sampler throughout. We randomly selected
50 mid-frequency target concepts from a larger set
of target concepts described in Section 8. Predictive
loglikelihood scores were averaged across concepts
and were calculated as the average under 10 param-
eter samples {φt,ψt} from the trained models.

1920-39 1940-59 1960-79 1980-99

−1,7

−1,65

·105

predicted time period

lo
gl

ik
el

ih
oo

d

SCAN
SCAN-NOT

Figure 3: Predictive log likelihood of SCAN and a ver-
sion without temporal dependencies (SCAN-NOT) across
various test time periods.

Results Figure 3 displays predictive loglikelihood
scores for four test time intervals. SCAN outper-
forms its stripped-down version throughout (higher
is better). Since the representations learnt by SCAN
are influenced (or smoothed) by neighboring repre-
sentations, they overfit specific time intervals less
which leads to better predictive performance. Fig-
ure 4 further shows how SCAN models meaning
change for the words band, power, transport
and bank. The sense distributions over time are
shown as a sequence of stacked histograms, senses
themselves are color-coded (and enumerated) below,
in the same order as in the histograms. Each sense k
is illustrated as the 10 words w assigned the highest
posterior probability, marginalizing over the time-
specific representations p(w|k) =

∑
t ψ

t,k
w . Words

representative of prevalent senses are highlighted in
bold face.

Figure 4 (top left) demonstrates that the model is
able to capture various senses of the word band,
such as strip used for binding (yellow bars/number
3 in the figure) or musical band (grey/1, orange/7).
Our model predicts an increase in prevalence over
the modeled time period for both senses. This is cor-
roborated by the OED which provides the majority
of references for the binding strip sense for the 20th
century and dates the musical band sense to 1812. In
addition a social band sense (violet/6, darkgreen/8;
in the sense of bonding) emerges, which is present
across time slices. The sense colored brown/2 refers
to the British Band, a group of native Americans
involved in the Black Hawk War in 1832, and the
model indeed indicates a prevalence of this sense
around this time (see bars 1800–1840 in the figure).

For the word power (Figure 4 (top right)),

36


1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000

band

8 band play people time little call father day love boy
7 play band music time country day march military frequency jazz
6 little hand play land love time night speak strong name
5 little soldier leader time land arm hand country war indian
4 music play dance band hear time little evening stand house
3 black white hat broad gold wear hair band head rubber
2 indian little day horse time people meet chief leave war
1 play music hand hear sound march street air look strike

1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000

power

8 power idea god hand mind body life time object nature
7 power nation world war country time government sir mean lord
6 power time company water force line electric plant day run
5 power government law congress executive president legislative constitution
4 love power life time woman heart god tell little day
3 mind power time life friend woman nature love world reason
2 power people law government mind call king time hand nature
1 power country government nation war increase world political people europe

1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000

transport

8 road cost public railway transport rail average service bus time
7 ozone epa example section transport air policy region measure caa
6 time transport land public ship line water vessel london joy
5 air plane ship army day transport land look leave hand
4 time road worker union service public system industry air railway
3 air international worker plane association united union aircraft line president
2 troop ship day land army war send plane supply fleet
1 air joy love heart heaven time company eye hand smile

1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000

bank

8 bank tell cashier teller money day ned president house city
7 bank note money deposit credit amount pay species issue bill
6 bank money national note government credit united time currency loan
5 bank dollar money note national president account director company little
4 river day opposite mile bank danube town left country shore
3 bank capital company stock rate national president fund city loan
2 river water stream foot mile tree stand reach little land
1 note bank money time tell leave hard day dollar account

Figure 4: Tracking meaning change for the words band, power, transport and bank over 20-year time intervals
between 1700 and 2010. Each bar shows the proportion of each sense (color-coded) and is labeled with the start year
of the respective time interval. Senses are shown as the 10 most probable words, and particularly representative words
are highlighted for illustration.

1700 2010time

p
(w

|s
,
t
)

time line water water company company power power company power power power nuclear

line time power force power water company company power company plant nuclear power

power water line company time power force force force plant nuclear plant plant

water power time power force force water time plant electric electric time utility

force force force line water time electric water water time time company company

war war company time steam day day plant day force company utility time

run day run steam electric line time day time day run run people

equal house electric day run steam steam electric electric run utility electric energy

carry run steam electric day purchase line steam run water day cost cost

electric company day run plant run plant run line people force people run

Figure 5: Sense-internal temporal dynamics for the energy sense of the word power (violet/6 in Figure 4). Columns
show the ten most highly associated words for each time interval for the period between 1700 and 2010 (ordered by
decreasing probability). We highlight how four terms characteristic of the sense develop over time (see {water, steam,
plant, nuclear} in the figure).

three senses emerge: the institutional power (col-
ors gray/1, brown/2, pink/5, orange/7 in the figure),
mental power (yellow/3, lightgreen/4, darkgreen/8),
and power as supply of energy (violet/6). The latter
is an example of a “sense birth” (Mitra et al., 2014):

the sense was hardly present before the mid-19th
century. This is corroborated by the OED which
dates the sense to 1889, whereas the OED contains
references to the remaining senses for the whole
modeled time period, as predicted by our model.

37


1900-19
1920-39

1900-19
1940-59

1900-19
1960-79

1900-19
1980-99

1900-19
2000-10

1960-79
1980-99

1960-79
2000-10

1980-99
2000-10

0.2

0.4

0.6
pr

ec
is

io
n

SCAN
SCAN-NOT

t1
t2

Figure 6: Precision results for the SCAN and SCAN-NOT models on the WordNet-based novel sense detection (Exper-
iment 2). Results are shown for a selection of reference times (t1) and focus times (t2).

Similar trends of meaning change emerge for
transport (Figure 4 bottom left). The bot-
tom right plot shows the sense development for the
word bank. Although the well-known senses river
bank (brown/2, lightgreen/4) and monetary institu-
tion (rest) emerge clearly, the overall sense pattern
appears comparatively stable across intervals indi-
cating that the meaning of the word has not changed
much over time.

Besides tracking sense prevalence over time, our
model can also detect changes within individual
senses. Because we are interested in tracking se-
mantically stable senses, we fixed the precision pa-
rameter κψ to a high value, to discourage too much
variance within each sense. Figure 5 illustrates how
the energy sense of the word power (violet/6 in
Figure 4) has changed over time. Characteristic
terms for a given sense are highlighted in bold face.
For example, the term “water” is initially prevalent,
while the term “steam” rises in prevalence towards
the middle of the modeled period, and is superseded
by the terms “plant” and “nuclear” towards the end.

6 Experiment 2: Novel Sense Detection

In this section and the next we will explicitly eval-
uate the temporal representations (i.e., probability
distributions) induced by our model, and discuss its
performance in the context of previous work.

Large-scale evaluation of meaning change is noto-
riously difficult, and many evaluations are based on
limited hand-annotated goldstandard data sets. Mi-
tra et al. (2015), however, bypass this issue by eval-
uating the output of their system against WordNet
(Fellbaum, 1998). Here, we consider their auto-
matic evaluation of sense-births, i.e., the emergence

of novel senses. We assume that novel senses are
detected at a focus time t2 whilst being compared to
a reference time t1. WordNet is used to confirm that
the proposed novel sense is indeed distinct from all
other induced senses for a given word.

Method Mitra et al.’s (2015) evaluation method
presupposes a system which is able to detect senses
for a set of target words and identify which ones are
novel. Our model does not automatically yield nov-
elty scores for the induced senses. However, Cook
et al. (2014) propose several ways to perform this
task post-hoc. We use their relevance score, which
is based on the intuition that keywords (or collo-
cations) which characterize the difference of a fo-
cus corpus from a reference corpus are indicative of
word sense novelty.

We identify keywords for a focus corpus with re-
spect to a reference corpus using Kilgarriff’s (2009)
method which is based on smoothed relative fre-
quencies.5 The novelty of an induced sense s can
be then defined in terms of the aggregate keyword
probabilities given that sense (and focus time of in-
terest):

rel(s) =
∑

w∈W
p(w|s,t2). (4)

where W is a keyword list and t2 the focus time.
Cook et al. (2014) suggest a straightforward extrap-
olation from sense novelty to word novelty:

rel(c) = max
s

rel(s), (5)

5We set the smoothing parameter to n = 10, and like Cook
et al. (2014) retrieve the top 1000 keywords.

38


t1=1900–1919 t2=1980–1999
union soviet united american union european war civil military people liberty
dos system window disk pc operate program run computer de dos
entertainment television industry program time business people world president entertainment company
station radio station television local program network space tv broadcast air

t1=1960–1969 t2=1990–1999
environmental supra note law protection id agency impact policy factor federal
users computer window information software system wireless drive web building available
virtual reality virtual computer center experience week community separation increase
disk hard disk drive program computer file store ram business embolden

Table 2: Example target terms (left) with novel senses (right) as identified by SCAN in focus corpus t2 (when compared
against reference corpus t1). Top: terms used in novel sense detection study (Experiment 2). Bottom: terms from the
Gulordava and Baroni (2011) gold standard (Experiment 3).

where rel(c) is the highest novelty score assigned to
any of the target word’s senses. A high rel(c) score
suggests that a word has undergone meaning change.

We obtained candidate terms and their associ-
ated novel senses from the DATE corpus, using
the relevance metric described above. The novel
senses from the focus period and all senses induced
for the reference period, except for the one corre-
sponding to the novel sense, were passed on to Mitra
et al.’s (2015) WordNet-based evaluator which pro-
ceeds as follows. Firstly, each induced sense s is
mapped to the WordNet synset u with the maximum
overlap:

synset(s) = arg max
u

overlap(s,u). (6)

Next, a predicted novel sense n is deemed truly
novel if its mapped synset is distinct from any synset
mapped to a different induced sense:

∀s′synset(s′) 6= synset(n). (7)
Finally, overall precision is calculated as the frac-
tion of sense-births confirmed by WordNet over all
birth-candidates proposed by the model. Like Mitra
et al. (2015) we only report results on target words
for which all induced senses could be successfully
mapped to a synset.

Models and Parameters We obtained the broad
set of target words used for the task-based evalua-
tion (in Section 8) and trained models on the DATE
corpus. We set the number of senses K = 4 fol-
lowing Mitra et al. (2015) who note that the Word-
Net mapper works best for words with a small num-
ber of senses, and the time intervals to ∆T = 20

as in Experiment 1. We identified 200 words6 with
highest novelty score (Equation (5)) as sense birth
candidates. We compared the performance of the
full SCAN model against SCAN-NOT which learns
senses independently for time intervals. We trained
both models on the same data with identical pa-
rameters. For SCAN-NOT, we must post-hoc iden-
tify corresponding senses across time intervals. We
used the Jensen-Shannon divergence between the
reference- and focus-time specific word distribu-
tions JS(p(w|s,t1)||p(w|s,t2)) and assigned each
focus-time sense to the sense with smallest diver-
gence at reference time.

Results Figure 6 shows the performance of our
models on the task of sense birth detection. SCAN
performs better than SCAN-NOT, underscoring the
importance of joint modeling of senses across time
slices and incorporation of temporal dynamics. Our
accuracy scores are in the same ballpark as Mitra
et al. (2014, 2015). Note, however that the scores are
not directly comparable due to differences in train-
ing corpora, focus and reference times, and candi-
date words. Mitra et al. (2015) use the larger Google
syntactic n-gram corpus, as well as richer linguis-
tic information in terms of syntactic dependencies.
We show that our model which does not rely on
syntactic annotations performs competitively even
when trained on smaller data. Table 2 (top) displays
examples of words assigned highest novelty scores
for the reference period 1900–1919 and focus pe-
riod 1980–1999.

6This threshold was tuned on one reference-focus time pair.

39


7 Experiment 3: Word Meaning Change

In this experiment we evaluate whether model in-
duced temporal word representations capture per-
ceived word novelty. Specifically, we adopt the eval-
uation framework (and dataset) introduced in Gulor-
dava and Baroni (2011)7 and discussed below.

Method Gulordava and Baroni (2011) do not
model word senses directly; instead they obtain dis-
tributional representations of words from the Google
Books (bigram) data for two time slices, namely
the 1960s (reference corpus) and 1990s (focus cor-
pus). To detect change in meaning, they measure
cosine similarity between the vector representations
of a target word in the reference and focus corpus.
It is assumed that low similarity indicates signifi-
cant meaning change. To evaluate the output of their
system, they created a test set of 100 target words
(nouns, verbs, and adjectives), and asked five anno-
tators to rate each word with respect to its degree of
meaning change between the 1960s and the 1990s.
The annotators used a 4-point ordinal scale (0: no
change, 1: almost no change, 2: somewhat change,
3: changed significantly). Words were subsequently
ranked according to the mean rating given by the
annotators. Inter-annotator agreement on the novel
sense detection task was 0.51 (pairwise Pearson cor-
relation) and can be regarded as an upper bound on
model performance.

Models and Parameters We trained models for
all words in Gulordava and Baroni’s (2011) gold-
standard. We used the DATE subcorpus cover-
ing years 1960 through 1999 partitioned by decade
(∆T = 10). The first and last time interval
were defined as reference and focus time, respec-
tively (t1=1960–1969, t2=1990–1999). As in Ex-
periment 2, a novelty score was assigned to each
target word (using Equation (5)). We computed
Spearman’s ρ rank correlations between gold stan-
dard and model rankings (Gulordava and Baroni,
2011). We trained SCAN models setting the num-
ber of senses to K = 8. We also trained SCAN-NOT
models with identical parameters. We report results
averaged over five independent parameter estimates.
Finally, as in Gulordava and Baroni (2011) we com-
pare against a frequency baseline which ranks words

7We thank Kristina Gulordava for sharing their evaluation
data set of target words and human judgments.

system corpus Spearman’s ρ
Gulordava (2011) Google 0.386
SCAN DATE 0.377
SCAN-NOT DATE 0.255
frequency baseline DATE 0.325

Table 3: Spearman’s ρ rank correlations between system
novelty rankings and the human-produced ratings. All
correlations are statistically significant (p < 0.02). Re-
sults for SCAN and SCAN-NOT are averages over five
trained models.

by their log relative frequency in the reference and
focus corpus.

Results The results of this evaluation are shown
in Table 3. As can be seen, SCAN outperforms
SCAN-NOT and the frequency baseline. For refer-
ence, we also report the correlation coefficient ob-
tained in Gulordava and Baroni (2011) but empha-
size that the scores are not directly comparable due
to differences in training data: Gulordava and Ba-
roni (2011) use the Google bigrams corpus (which is
much larger compared to DATE). Table 2 (bottom)
displays examples of words which achieved highest
novelty scores in this evaluation, and their associated
novel senses.

8 Experiment 4: Task-based Evaluation

In the previous sections we demonstrated how SCAN
captures meaning change between two periods. In
this section, we assess our model on an extrinsic
task which relies on meaning representations span-
ning several time slices. We quantitatively evaluate
our model on the SemEval-2015 benchmark datasets
released as part of the Diachronic Text Evaluation
exercise (Popescu and Strapparava 2015; DTE). In
the following we first present the DTE subtasks, and
then move on to describe our training data, parame-
ter settings, and systems used for comparison to our
model.

SemEval DTE Tasks Diachronic text evaluation
is an umbrella term used by the SemEval-2015 or-
ganizers to represent three subtasks aiming to assess
the performance of computational methods used to
identify when a piece of text was written. A simi-
lar problem is tackled in Chambers (2012) who la-
bel documents with time stamps whilst focusing on

40


explicit time expressions and their discriminatory
power. The SemEval data consists of news snip-
pets, which range between a few words and mul-
tiple sentences. A set of training snippets, as well
as gold-annotated development and test datasets are
provided. DTE subtasks 1 and 2 involve tempo-
ral classification: given a news snippet and a set
of non-overlapping time intervals covering the pe-
riod 1700 through 2010, the system’s task is to se-
lect the interval corresponding to the snippet’s year
of origin. Temporal intervals are consecutive and
constructed such that the correct interval is centered
around the actual year of origin. For both tasks tem-
poral intervals are created at three levels of granular-
ity (fine, medium, and coarse).

Subtask 1 involves snippets which contain an ex-
plicit cue for time of origin. The presence of a
temporal cue was determined by the organizers by
checking the entities’ informativeness in external re-
sources. Consider the example below:

(8) President de Gaulle favors an independent
European nuclear striking force [...]

The mentions of French president de Gaulle and nu-
clear warfare suggest that the snippet was written
after the mid-1950s and indeed it was published in
1962. A hypothetical system would then have to de-
cide amongst the following classes:
{1700–1702, 1703–1705, ..., 1961–1963, ..., 2012–2014}
{1699–1706, 1707–1713, ..., 1959–1965, ..., 2008–2014}
{1696–1708, 1709–1721, ..., 1956–1968, ..., 2008–2020}
The first set of classes correspond to fine-grained in-
tervals of 2-years, the second set to medium-grained
intervals of 6-years and the third set to coarse-
grained intervals of 12-years. For the snippet in
example (8) classes 1961–1963, 1959–1965, and
1956–1968 are the correct ones.

Subtask 2 involves temporal classification of snip-
pets which lack explicit temporal cues, but contain
implicit ones, e.g., as indicated by lexical choice or
spelling. The snippet in example (9) was published
in 1891 and the spelling of to-day, which was com-
mon up to the early 20th century, is an implicit cue:

(9) The local wheat market was not quite so
strong to-day as yesterday.

Analogously to subtask 1, systems must select the
right temporal interval from a set of contiguous

time intervals of differing granularity. For this task,
which is admittedly harder, levels of temporal gran-
ularity are coarser corresponding to 6-year, 12-year
and 20-year intervals.

Participating SemEval Systems We compared
our model against three other systems which par-
ticipated in the SemEval task.8 AMBRA (Zampieri
et al., 2015) adopts a learning-to-rank modeling ap-
proach and uses several stylistic, grammatical, and
lexical features. IXA (Salaberri et al., 2015) uses
a combination of approaches to determine the pe-
riod of time in which a piece of news was writ-
ten. This involves searching for specific mentions
of time within the text, searching for named enti-
ties present in the text and then establishing their
reference time by linking these to Wikipedia, using
Google n-grams, and linguistic features indicative
of language change. Finally, UCD (Szymanski and
Lynch, 2015) employs SVMs for classification us-
ing a variety of informative features (e.g., POS-tag
n-grams, syntactic phrases), which were optimized
for the task through automatic feature selection.

Models and Parameters We trained our model
for individual words and obtained representations
of their meaning for different points in time. Our
set of target words consisted of all nouns which oc-
curred in the development datasets for DTE sub-
tasks 1 and 2 as well as all verbs which occurred
at least twice in this dataset. After removing in-
frequent words we were left with 883 words (out
of 1,116) which we used in this evaluation. Target
words were not optimized with respect to the test
data in any way; it is thus reasonable to expect bet-
ter performance with an adjusted set of words.

We set the model time interval to ∆T = 5 years
and the number of senses per word to K = 8.
We also evaluated SCAN-NOT, the stripped-down
version of SCAN, with identical parameters. Both
SCAN and SCAN-NOT predict the time of origin for
a test snippet as follows. We first detect mentions
of target words in the snippet. Then, for each men-
tion c we construct a document, akin to the training
documents, consisting of c and its context w, the ±5
words surrounding c. Given {c,w}, we approximate

8We do not report results for the system USAAR which
achieved close to 100% accuracy by searching for the test snip-
pets on the web, without performing any temporal inference.

41


Task 1 Task 2
2 yr 6 yr 12 yr 6 yr 12 yr 20 yr

acc p acc p acc p acc p acc p acc p

Baseline .097 .010 .214 .017 .383 .046 .199 .025 .343 .047 .499 .057
SCAN-NOT .265 .086 .435 .139 .609 .169 .259 .041 .403 .056 .567 .098
SCAN .353 .049 .569 .112 .748 .206 .376 .053 .572 .091 .719 .135
IXA .187 .020 .375 .041 .557 .090 .261 .037 .428 .067 .622 .098
AMBRA .167 .037 .367 .071 .554 .074 .605 .143 .767 .143 .868 .292
UCD – – – – – – .759 .463 .846 .472 .910 .542
SVM SCAN .192 .034 .417 .097 .545 .127 .573 .331 .667 .368 .790 .428
SVM SCAN+ngram .222 .030 .467 .079 .627 .142 .747 .481 .821 .500 .897 .569

Table 4: Results on Diachronic Text Evaluation Tasks 1 and 2 for a random baseline, our SCAN model, its stripped-
down version without iGMRFs (SCAN-NOT), the SemEval submissions (IXA, AMBRA and UCD), and SVMs trained
with SCAN features (SVM SCAN), and with additional character n-gram features (SVM SCAN+ngram). Results are
shown for three levels of granularity, a strict precision measure p, and a distance-discounting measure acc.

a distribution over time intervals as:

p(c)(t|w) ∝ p(c)(w|t) ×p(c)(t) (10)
where the superscript (c) indicates parameters from
the word-specific model, we marginalize over senses
and assume a uniform distribution over time slices
p(c)(t). Finally, we combine the word-wise predic-
tions into a final distribution p(t) =

∏
c p

(c)(t|,w),
and predict the time t with highest probability.

Supervised Classification We also apply our
model in a supervised setting, i.e., by extracting
features for classifier prediction. Specifically, we
trained a multiclass SVM (Chang and Lin, 2011) on
the training data provided by the SemEval organiz-
ers (for DTE tasks 1 and 2). For each observed word
within each snippet, we added as feature its most
likely sense k given t, the true time of origin:

arg max
k

p(c)(k|t). (11)

We also trained a multiclass SVM which uses char-
acter n-gram (n ∈ {1, 2, 3}) features in addition to
the model features. Szymanski and Lynch (2015)
identified character n-grams as the most predictive
feature for temporal text classification using SVMs.
Their system (UCD) achieved the best published
scores in DTE subtask 2. Following their approach,
we included all n-grams that were observed more
than 20 times in the DTE training data.

Results We employed two evaluation measures
proposed by the DTE organizers. These are pre-
cision p, i.e., the percentage of times a system has

predicted the correct time period. And accuracy acc
which is more lenient, and penalizes system predic-
tions proportional to their distance from the true in-
terval. We compute the p and acc scores for our
models using the evaluation script provided by the
SemEval organizers. Table 4 summarizes our re-
sults for DTE subtasks 1 and 2. We compare SCAN
against a baseline which selects a time interval at
random9 averaged over five runs. We also show re-
sults for a stripped-down version of our model with-
out the iGMRFs (SCAN-NOT) and for the systems
which participated in SemEval.

For subtask 1, the two versions of SCAN out-
perform all SemEval systems across the board.
SCAN-NOT occasionally outperforms SCAN in the
strict precision metric, however, the full SCAN
model consistently achieves better accuracy scores
which are more representative since they factor in
the proximity of the prediction to the true value. In
subtask 2, the UCD and SVM SCAN+ngram systems
perform comparably. They both use SVMs for the
classification task, however our own model employs
a less expressive feature set based on SCAN and
character n-grams, and does not take advantage of
feature selection which would presumably enhance
performance. With the exception of AMBRA, all
other participating systems used external resources
(such as Wikipedia and Google n-grams); it is thus
fair to assume they had access to at least as much
training data as our SCAN model. Consequently, the

9We recomputed the baseline scores for subtasks 1 and 2 due
to inconsistencies in the results provided by the DTE organizers.

42


gap in performance can not solely be attributed to a
difference in the size of the training data.

We also observe that IXA and SCAN, given iden-
tical granularity, perform better on subtask 1, while
AMBRA and our own SVM-based systems exhibit
the opposite trend. The IXA system uses a combi-
nation of knowledge sources in order to determine
when a piece of news was written, including ex-
plicit mentions of temporal expressions within the
text, named entities, and linked information to those
named entities from Wikipedia. AMBRA on the
other hand exploits more shallow stylistic, grammat-
ical and lexical features within the learning-to-rank
paradigm. An interesting direction for future work
would be to investigate which features are most ap-
propriate for different DTE tasks. Overall, it is en-
couraging to see that the generic temporal word rep-
resentations inferred by SCAN lead to competitively
performing models on both temporal classification
tasks without any explicit tuning.

9 Conclusion

In this paper we introduced SCAN, a dynamic
Bayesian model of diachronic meaning change.
Our model learns a coherent set of co-dependent
time-specific senses for individual words and their
prevalence. Evaluation of the model’s output
showed that the learnt representations reflect (a) dif-
ferent senses of ambiguous words (b) different kinds
of meaning change (such as new senses being estab-
lished), and (c) connotational changes within senses.
SCAN departs from previous work in that it models
temporal dynamics explicitly. We demonstrated that
this feature yields more general semantic represen-
tations as indicated by predictive loglikelihood and
a variety of extrinsic evaluations. We also experi-
mentally evaluated SCAN on novel sense detection
and the SemEval DTE task, where it performed on
par with the best published results, without any ex-
tensive feature engineering or task specific tuning.

We conclude by discussing limitations of our
model and directions for future work. In our exper-
iments we fix the number of senses K for all words
across all time periods. Although this approach did
not harm performance (even in case of SemEval
where we handled more than 800 target concepts),
it is at odds with the fact that words vary in their
degree of ambiguity, and that word senses continu-

ously appear and disappear. A non-parametric ver-
sion of our model would infer an appropriate number
of senses from the data, individually for each time
period. Also note that in our experiments we used
context as a bag of words. It would be interesting
to explore more systematically how different kinds
of contexts (e.g., named entities, multiword expres-
sions, verbs vs. nouns) influence the representations
the model learns. Furthermore, while SCAN cap-
tures the temporal dynamics of word senses, it can-
not do so for words themselves. Put differently, the
model cannot identify whether a new word is used
which did not exist before, or that a word ceased to
exist after a specific point in time. A model internal
way of detecting word (dis)appearance would be de-
sirable, especially since new terms are continuously
being introduced thanks to popular culture and vari-
ous new media sources.

In the future, we would like to apply our model to
different text genres and levels of temporal granular-
ity. For example, we could work with Twitter data,
an increasingly popular source for opinion tracking,
and use our model to identify short-term changes in
word meanings or connotations.

Acknowledgments

We are grateful to the anonymous reviewers whose
feedback helped to substantially improve the present
paper. We thank Charles Sutton and Iain Murray for
helpful discussions, and acknowledge the support of
EPSRC through project grant EP/I037415/1.

References

Aitchison, Jean. 2001. Language Change: Progress
Or Decay?. Cambridge Approaches to Linguis-
tics. Cambridge University Press.

Biemann, Chris. 2006. Chinese Whispers - an Effi-
cient Graph Clustering Algorithm and its Appli-
cation to Natural Language Processing Problems.
In Proceedings of TextGraphs: the 1st Workshop
on Graph Based Methods for Natural Language
Processing. New York City, NY, USA, pages 73–
80.

Bird, Steven, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with Python.
O’Reilly Media, Inc., 1st edition.

Blei, David M. and John D. Lafferty. 2006a. Cor-

43


related Topic Models. In Advances in Neural In-
formation Processing Systems 18, Vancouver, BC,
Canada, pages 147–154.

Blei, David M. and John D. Lafferty. 2006b. Dy-
namic Topic Models. In Proceedings of the 23rd
International Conference on Machine Learning.
Pittsburgh, PA, USA, pages 113–120.

Brody, Samuel and Mirella Lapata. 2009. Bayesian
Word Sense Induction. In Proceedings of the 12th
Conference of the European Chapter of the ACL.
Athens, Greece, pages 103–111.

Chambers, Nathanael. 2012. Labeling Documents
with Timestamps: Learning from their Time Ex-
pressions. In Proceedings of the 50th Annual
Meeting of the Association for Computational
Linguistics. Jeju Island, Korea, pages 98–106.

Chang, Chih-Chung and Chih-Jen Lin. 2011.
LIBSVM: A library for support vector ma-
chines. ACM Transactions on Intelligent Sys-
tems and Technology 2:27:1–27:27. Software
available at http://www.csie.ntu.edu.
tw/˜cjlin/libsvm.

Chen, Jianfei, Jun Zhu, Zi Wang, Xun Zheng, and
Bo Zhang. 2013. Scalable Inference for Logistic-
Normal Topic Models. In Advances in Neural In-
formation Processing Systems, Lake Tahoe, NV,
USA, pages 2445–2453.

Cook, Paul, Jey Han Lau, Diana McCarthy, and
Timothy Baldwin. 2014. Novel Word-sense Iden-
tification. In Proceedings of the 25th Interna-
tional Conference on Computational Linguistics:
Technical Papers. Dublin, Ireland, pages 1624–
1635.

Cook, Paul and Suzanne Stevenson. 2010. Automat-
ically Identifying Changes in the Semantic Orien-
tation of Words. In Proceedings of the Seventh
International Conference on Language Resources
and Evaluation. Valletta, Malta, pages 28–34.

Davies, Mark. 2010. The Corpus of Historical
American English: 400 million words, 1810-
2009. Available online at http://corpus.
byu.edu/coha/.

Diller, Hans-Jürgen, Hendrik de Smet, and Jukka
Tyrkkö. 2011. A European database of descriptors
of english electronic texts. The European English
messenger 19(2):29–35.

Fellbaum, Christiane. 1998. WordNet: An Electronic
Lexical Database. Bradford Books.

Groenewald, Pieter C. N. and Lucky Mokgatlhe.
2005. Bayesian Computation for Logistic Regres-
sion. Computational Statistics & Data Analysis
48(4):857–868.

Gulordava, Kristina and Marco Baroni. 2011. A Dis-
tributional Similarity Approach to the Detection
of Semantic Change in the Google Books Ngram
Corpus. In Proceedings of the Workshop on GEo-
metrical Models of Natural Language Semantics.
Edinburgh, Scotland, pages 67–71.

Kilgarriff, Adam. 2009. Simple maths for keywords.
In Proceedings of the Corpus Linguistics Confer-
ence.

Kim, Yoon, Yi-I Chiu, Kentaro Hanaki, Darshan
Hegde, and Slav Petrov. 2014. Temporal Anal-
ysis of Language through Neural Language Mod-
els. In Proceedings of the ACL 2014 Workshop on
Language Technologies and Computational So-
cial Science. Baltimore, MD, USA, pages 61–65.

Kulkarni, Vivek, Rami Al-Rfou, Bryan Perozzi, and
Steven Skiena. 2015. Statistically Significant De-
tection of Linguistic Change. In Proceedings of
the 24th International Conference on World Wide
Web. Geneva, Switzerland, pages 625–635.

Lau, Han Jey, Paul Cook, Diana McCarthy, Span-
dana Gella, and Timothy Baldwin. 2014. Learn-
ing Word Sense Distributions, Detecting Unat-
tested Senses and Identifying Novel Senses us-
ing Topic Models. In Proceedings of the 52nd
Annual Meeting of the Association for Compu-
tational Linguistics. Baltimore, MD, USA, pages
259–270.

Lau, Jey Han, Paul Cook, Diana McCarthy, David
Newman, and Timothy Baldwin. 2012. Word
Sense Induction for Novel Sense Detection. In
Proceedings of the 13th Conference of the Eu-
ropean Chapter of the Association for Computa-
tional Linguistics. Avignon, France, pages 591–
601.

McMahon, April M.S. 1994. Understanding Lan-
guage Change. Cambridge University Press.

Mihalcea, Rada and Vivi Nastase. 2012. Word
Epoch Disambiguation: Finding How Words
Change over Time. In Proceedings of the 50th

44

http://www.csie.ntu.edu.tw/~cjlin/libsvm
http://www.csie.ntu.edu.tw/~cjlin/libsvm
http://corpus.byu.edu/coha/
http://corpus.byu.edu/coha/


Annual Meeting of the Association for Computa-
tional Linguistics. Jeju Island, Korea, pages 259–
263.

Mimno, David, Hanna Wallach, and Andrew Mc-
Callum. 2008. Gibbs Sampling for Logistic Nor-
mal Topic Models with Graph-Based Priors. In
NIPS Workshop on Analyzing Graphs. Vancouver,
Canada.

Mitra, Sunny, Ritwik Mitra, Suman Kalyan Maity,
Martin Riedl, Chris Biemann, Pawan Goyal, and
Animesh Mukherjee. 2015. An automatic ap-
proach to identify word sense changes in text me-
dia across timescales. Natural Language Engi-
neering 21:773–798.

Mitra, Sunny, Ritwik Mitra, Martin Riedl, Chris
Biemann, Animesh Mukherjee, and Pawan Goyal.
2014. That’s sick dude!: Automatic identification
of word sense change across different timescales.
In Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics. Balti-
more, MD, USA, pages 1020–1029.

Ó Séaghdha, Diarmuid. 2010. Latent Variable Mod-
els of Selectional Preference. In Proceedings
of the 48th Annual Meeting of the Association
for Computational Linguistics. Uppsala, Sweden,
pages 435–444.

Popescu, Octavian and Carlo Strapparava. 2013. Be-
hind the Times: Detecting Epoch Changes using
Large Corpora. In Proceedings of the Sixth Inter-
national Joint Conference on Natural Language
Processing. Nagoya, Japan, pages 347–355.

Popescu, Octavian and Carlo Strapparava. 2015. Se-
mEval 2015, Task 7: Diachronic Text Evaluation.
In Proceedings of the 9th International Workshop
on Semantic Evaluation (SemEval 2015). Denver,
CO, USA, pages 869–877.

Ritter, Alan, Mausam, and Oren Etzioni. 2010. A
Latent Dirichlet Allocation Method for Selec-
tional Preferences. In Proceedings of the 48th
Annual Meeting of the Association for Computa-
tional Linguistics. Uppsala, Sweden, pages 424–
434.

Rue, Håvard and Leonhard Held. 2005. Gaussian
Markov Random Fields: Theory and Applica-
tions. Chapman & Hall/CRC Monographs on
Statistics & Applied Probability. CRC Press.

Sagi, Eyal, Stefan Kaufmann, and Brady Clark.
2009. Semantic Density Analysis: Comparing
Word Meaning across Time and Phonetic Space.
In Proceedings of the Workshop on Geometrical
Models of Natural Language Semantics. Athens,
Greece, pages 104–111.

Salaberri, Haritz, Iker Salaberri, Olatz Arregi, and
Beñat Zapirain. 2015. IXAGroupEHUDiac:
A Multiple Approach System towards the Di-
achronic Evaluation of Texts. In Proceedings of
the 9th International Workshop on Semantic Eval-
uation (SemEval 2015). Denver, CO, USA, pages
840–845.

Stevenson, Angus, editor. 2010. The Oxford English
Dictionary. Oxford University Press, third edi-
tion.

Szymanski, Terrence and Gerard Lynch. 2015.
UCD: Diachronic Text Classification with Char-
acter, Word, and Syntactic N-grams. In Pro-
ceedings of the 9th International Workshop on Se-
mantic Evaluation (SemEval 2015). Denver, CO,
USA, pages 879–883.

Tahmasebi, Nina, Thomas Risse, and Stefan Di-
etze. 2011. Towards automatic language evolu-
tion tracking, A study on word sense tracking.
In Proceedings of the Joint Workshop on Knowl-
edge Evolution and Ontology Dynamics (EvoDyn
2011). Bonn, Germany.

Wijaya, Derry Tanti and Reyyan Yeniterzi. 2011.
Understanding Semantic Change of Words over
Centuries. In Proceedings of the 2011 Inter-
national Workshop on DETecting and Exploiting
Cultural diversiTy on the Social Web. Glasgow,
Scotland, UK, pages 35–40.

Zampieri, Marcos, Alina Maria Ciobanu, Vlad Nic-
ulae, and Liviu P. Dinu. 2015. AMBRA: A Rank-
ing Approach to Temporal Text Classification. In
Proceedings of the 9th International Workshop
on Semantic Evaluation (SemEval 2015). Denver,
CO, USA, pages 851–855.

45


46