Edinburgh Research Explorer A Bayesian Model of Diachronic Meaning Change Citation for published version: Frermann, L & Lapata, M 2016, 'A Bayesian Model of Diachronic Meaning Change', Transactions of the Association for Computational Linguistics, vol. 4, pp. 31-45. Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://transacl.org/ojs/index.php/tacl/article/view/796 https://www.research.ed.ac.uk/portal/en/publications/a-bayesian-model-of-diachronic-meaning-change(d4f70944-eee3-4e68-88bc-477b1b11d1ef).html A Bayesian Model of Diachronic Meaning Change Lea Frermann and Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB l.frermann@ed.ac.uk, mlap@inf.ed.ac.uk Abstract Word meanings change over time and an au- tomated procedure for extracting this infor- mation from text would be useful for histor- ical exploratory studies, information retrieval or question answering. We present a dy- namic Bayesian model of diachronic meaning change, which infers temporal word represen- tations as a set of senses and their prevalence. Unlike previous work, we explicitly model language change as a smooth, gradual pro- cess. We experimentally show that this model- ing decision is beneficial: our model performs competitively on meaning change detection tasks whilst inducing discernible word senses and their development over time. Application of our model to the SemEval-2015 temporal classification benchmark datasets further re- veals that it performs on par with highly op- timized task-specific systems. 1 Introduction Language is a dynamic system, constantly evolv- ing and adapting to the needs of its users and their environment (Aitchison, 2001). Words in all lan- guages naturally exhibit a range of senses whose dis- tribution or prevalence varies according to the genre and register of the discourse as well as its historical context. As an example, consider the word cute which according to the Oxford English Dictionary (OED, Stevenson 2010) first appeared in the early 18th century and originally meant clever or keen- witted.1 By the late 19th century cute was used in 1Throughout this paper we denote words in true type, their senses in italics, and sense-specific context words as {lists}. the same sense as cunning. Today it mostly refers to objects or people perceived as attractive, pretty or sweet. Another example is the word mouse which initially was only used in the rodent sense. The OED dates the computer pointing device sense of mouse to 1965. The latter sense has become par- ticularly dominant in recent decades due to the ever- increasing use of computer technology. The arrival of large-scale collections of historic texts (Davies, 2010) and online libraries such as the Internet Archive and Google Books have greatly facilitated computational investigations of language change. The ability to automatically detect how the meaning of words evolves over time is potentially of significant value to lexicographic and linguistic research but also to real world applications. Time- specific knowledge would presumably render word meaning representations more accurate, and benefit several downstream tasks where semantic informa- tion is crucial. Examples include information re- trieval and question answering, where time-related information could increase the precision of query disambiguation and document retrieval (e.g., by re- turning documents with newly created senses or fil- tering out documents with obsolete senses). In this paper we present a dynamic Bayesian model of diachronic meaning change. Word mean- ing is modeled as a set of senses, which are tracked over a sequence of contiguous time intervals. We infer temporal meaning representations, consisting of a word’s senses (as a probability distribution over words) and their relative prevalence. Our model is thus able to detect that mouse had one sense until the mid-20th century (characterized by words such as {cheese, tail, rat}) and subsequently acquired a 31 Transactions of the Association for Computational Linguistics, vol. 4, pp. 31–45, 2016. Action Editor: Tim Baldwin. Submission batch: 12/2015; Revision batch: 2/2016; Published 2/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. second sense relating to computer device. More- over, it infers subtle changes within a single sense. For instance, in the 1970s the words {cable, ball, mousepad} were typical for the computer device sense, whereas nowadays the terms {optical, laser, usb} are more typical. Contrary to previous work (Mitra et al., 2014; Mihalcea and Nastase, 2012; Gu- lordava and Baroni, 2011) where temporal represen- tations are learnt in isolation, our model assumes that adjacent representations are co-dependent, thus capturing the nature of meaning change being fun- damentally smooth and gradual (McMahon, 1994). This also serves as a form of smoothing: temporally neighboring representations influence each other if the available data is sparse. Experimental evaluation shows that our model (a) induces temporal representations which reflect word senses and their development over time, (b) is able to detect meaning change between two time pe- riods, and (c) is expressive enough to obtain useful features for identifying the time interval in which a piece of text was written. Overall, our results indi- cate that an explicit model of temporal dynamics is advantageous for tracking meaning change. Com- parisons across evaluations and against a variety of related systems show that despite not being designed with any particular task in mind, our model performs competitively across the board. 2 Related Work Most work on diachronic language change has fo- cused on detecting whether and to what extent a word’s meaning changed (e.g., between two epochs) without identifying word senses and how these vary over time. A variety of methods have been applied to the task ranging from the use of statistical tests in order to detect significant changes in the distribution of terms from two time periods (Popescu and Strap- parava, 2013; Cook and Stevenson, 2010), to train- ing distributional similarity models on time slices (Gulordava and Baroni, 2011; Sagi et al., 2009), and neural language models (Kim et al., 2014; Kulkarni et al., 2015). Other work (Mihalcea and Nastase, 2012) takes a supervised learning approach and pre- dicts the time period to which a word belongs given its surrounding context. Bayesian models have been previously developed for various tasks in lexical semantics (Brody and La- pata, 2009; Ó Séaghdha, 2010; Ritter et al., 2010) and word meaning change detection is no exception. Using techniques from non-parametric topic model- ing, Lau et al. (2012) induce word senses (aka. top- ics) for a given target word over two time periods. Novel senses are then are detected based on the discrepancy between sense distributions in the two periods. Follow-up work (Cook et al., 2014; Lau et al., 2014) further explores methods for how to best measure this sense discrepancy. Rather than infer- ring word senses, Wijaya and Yeniterzi (2011) use a Topics-over-Time model and k-means clustering to identify the periods during which selected words move from one topic to another. A non-Bayesian approach is put forward in Mi- tra et al. (2014, 2015) who adopt a graph-based framework for representing word meaning (see Tah- masebi et al. (2011) for a similar earlier proposal). In this model words correspond to nodes in a se- mantic network and edges are drawn between words sharing contextual features (extracted from a depen- dency parser). A graph is constructed for each time interval, and nodes are clustered into senses with Chinese Whispers (Biemann, 2006), a randomized graph clustering algorithm. By comparing the in- duced senses for each time slice and observing inter- cluster differences, their method can detect whether senses emerge or disappear. Our work draws ideas from dynamic topic mod- eling (Blei and Lafferty, 2006b) where the evolu- tion of topics is modeled via (smooth) changes in their associated distributions over the vocabulary. Although the dynamic component of our model is closely related to previous work in this area (Mimno et al., 2008), our model is specifically constructed for capturing sense rather than topic change. Our ap- proach is conceptually similar to Lau et al. (2012). We also learn a joint sense representation for multi- ple time slices. However, in our case the number of time slices in not restricted to two and we explicitly model temporal dynamics. Like Mitra et al. (2014, 2015), we model how senses change over time. In our model, temporal representations are not inde- pendent, but influenced by their temporal neighbors, encouraging smooth change over time. We therefore induce a global and consistent set of temporal repre- sentations for each word. Our model is knowledge- lean (it does not make use of a parser) and language 32 independent (all that is needed is a time-stamped corpus and tools for basic pre-processing). Contrary to Mitra et al. (2014, 2015), we do not treat the tasks of inferring a semantic representation for words and their senses as two separate processes. Evaluation of models which detect meaning change is fraught with difficulties. There is no stan- dard set of words which have undergone meaning change or benchmark corpus which represents a va- riety of time intervals and genres, and is thematically consistent. Previous work has generally focused on a few hand-selected words and models were evalu- ated qualitatively by inspecting their output, or the extent to which they can detect meaning changes from two time periods. For example, Cook et al. (2014) manually identify 13 target words which un- dergo meaning change in a focus corpus with re- spect to a reference corpus (both news text). They then assess how their models fare at learning sense differences for these targets compared to distractors which did not undergo meaning change. They also underline the importance of using thematically com- parable reference and focus corpora to avoid spuri- ous differences in word representations. In this work we evaluate our model’s ability to detect and quantify meaning change across several time intervals (not just two). Instead of relying on a few hand-selected target words, we use larger sets sampled from our learning corpus or found to undergo meaning change in a judgment elicitation study (Gulordava and Baroni, 2011). In addition, we adopt the evaluation paradigm of Mitra et al. (2014) and validate our findings against WordNet. Finally, we apply our model to the recently es- tablished SemEval-2015 diachronic text evaluation subtasks (Popescu and Strapparava, 2015). In order to present a consistent set of experiments, we use our own corpus throughout which covers a wider range of time intervals and is compiled from a variety of genres and sources and is thus thematically coher- ent (see Section 4 for details). Wherever possible, we compare against prior art, with the caveat that the use of a different underlying corpus unavoidably influences the obtained semantic representations. 3 A Bayesian Model of Sense Change In this section we introduce SCAN, our dynamic Bayesian model of Sense ChANge. SCAN captures how a word’s senses evolve over time (e.g., whether new senses emerge), whether some senses become more or less prevalent, as well as phenomena per- taining to individual senses such as meaning exten- sion, shift, or modification. We assume that time is discrete, divided into contiguous intervals. Given a word, our model infers its senses for each time in- terval and their probability. It captures the gradual nature of meaning change explicitly, through depen- dencies between temporally adjacent meaning rep- resentations. Senses themselves are expressed as a probability distribution over words, which can also change over time. 3.1 Model Description We create a SCAN model for each target word c. The input to the model is a corpus of short text snippets, each consisting of a mention of the target word c and its local context w (in our experiments this is a sym- metric context window of ±5 words). Each snip- pet is annotated with its year of origin. The model is parametrized with regard to the number of senses k ∈ [1...K] of the target word c, and the length of time intervals ∆T which might be finely or coarsely defined (e.g., spanning a year or a decade). We conflate all documents originating from the same time interval t ∈ [1...T] and infer a tempo- ral representation of the target word per interval. A temporal meaning representation for time t is (a) a K-dimensional multinomial distribution over word senses φt and (b) a V -dimensional distribution over the vocabulary ψt,k for each word sense k. In ad- dition, our model infers a precision parameter κφ, which controls the extent to which sense prevalence changes for word c over time (see Section 3.2 for details on how we model temporal dynamics). We place individual logistic normal priors (Blei and Lafferty, 2006a) on our multinomial sense dis- tributions φ and sense-word distributions ψk. A draw from the logistic normal distribution con- sists of (a) a draw of an n-dimensional random vector x from the multivariate normal distribution parametrized by an n-dimensional mean vector µ and a n × n variance-covariance matrix Σ, x ∼ N(x|µ, Σ); and (b) a mapping of the drawn param- eters to the simplex through the logistic transforma- tion φn = exp(xn)/ ∑ n′ exp(xn′ ), which ensures a draw of valid multinomial parameters. The normal distributions are parametrized to encourage smooth 33 wz z w z w φt−1 φt φt+1 κφa,b ψt−1 ψt ψt+1 κψ I Dt−1 I Dt I Dt+1 K Draw κφ ∼ Gamma(a,b) for time interval t = 1..T do Draw sense distribution φt|φ−t,κφ ∼N( 1 2 (φt−1 + φt+1),κφ) for sense k = 1..K do Draw word distribution ψt,k|ψ−t,κψ ∼N( 1 2 (ψt−1,k+ψt+1,k),κψ) for document d = 1..D do Draw sense zd ∼ Mult(φt) for context position i = 1..I do Draw word wd,i ∼ Mult(ψt,zd ) Figure 1: Left: plate diagram for the dynamic sense model for three time steps {t−1, t, t+1}. Constant parameters are shown as dashed nodes, latent variables as clear nodes, and observed variables as gray nodes. Right: the corresponding generative story. change in multinomial parameters, over time (see Section 3.2 for details), and the extent of change is controlled through a precision parameter κ. We learn the value of κφ during inference, which al- lows us to model the extent of temporal change in sense prevalence individually for each target word. We draw κφ from a conjugate Gamma prior. We do not infer the sense-word precision parameter κψ on all ψk. Instead, we fix it at a high value, trig- gering little variation of word distributions within senses. This leads to senses being thematically co- herent over time. We now describe the generative story of our model, which is depicted in Figure 1 (right), along- side its plate diagram representation (left). First, we draw the sense precision parameter κφ from a Gamma prior. For each time interval t we draw (a) a multinomial distribution over senses φt from a lo- gistic normal prior; and (b) a multinomial distribu- tion over the vocabulary ψt,k for each sense k, from another logistic normal prior. Next, we generate time-specific text snippets. For each snippet d, we first observe the time interval t, and draw a sense zd from Mult(φt). Finally, we generate I context words wd,i independently from Mult(ψt,z d ). 3.2 Background on iGMRFs Let φ = {φ1...φT} denote a T-dimensional random vector, where each φt might for example correspond to a sense probability at time t. We define a prior which encourages smooth change of parameters at neighboring times, in terms of a first order random walk on the line (graphically shown in Figure 2, and the chains of φ and ψ in Figure 1(left)). Specifically, we define this prior as an intrinsic Gaussian Markov Random Field (iGMRF; Rue and Held 2005), which allows us to model the change of adjacent parame- ters as drawn from a normal distribution, e.g.: ∆φt ∼N(0,κ−1). (1) The iGMRF is defined with respect to the graph in Figure 2; it is sparsely connected with only first- order dependencies which allows for efficient in- ference. A second feature, which makes iGMRFs popular as priors in Bayesian modeling, is the fact that they can be defined purely in terms of the lo- cal changes between dependent (i.e., adjacent) vari- ables, without the need to specify an overall mean of the model. The full conditionals explicitly cap- ture these intuitions: φt|φ−t,κ ∼N (1 2 (φt−1 + φt+1), 1 2κ ) , (2) for 1 < t < T , where φ−t is the vector φ ex- cept element φt and κ is a precision parameter. The value of parameter φt is distributed normally, cen- tered around the mean of the values of its neighbors, without reference to a global mean. The precision parameter κ controls the extent of variation: how tightly coupled are the neighboring parameters? Or, 34 φ1 φt−1 φt φt+1 φT Figure 2: A linear chain iGMRF. in our case: how tightly coupled are temporally ad- jacent meaning representations of a word c? We es- timate the precision parameter κφ during inference. This allows us to flexibly capture sense variation over time individually for each target word. For a detailed introduction to (i)GMRFs we refer the interested reader to Rue and Held (2005). For an application of iGMRFs to topic models see Mimno et al. (2008). 3.3 Inference We use a blocked Gibbs sampler for approximate in- ference. The logistic normal prior is not conjugate to the multinomial distribution. This means that the straightforward parameter updates known for sam- pling standard, Dirichlet-multinomial, topic models do not apply. However, sampling-based methods for logistic normal topic models have been proposed in the literature (Mimno et al., 2008; Chen et al., 2013). At each iteration, we sample: (a) document- sense assignments, (b) multinomial parameters from the logistic normal prior, and (c) the sense preci- sion parameter from a Gamma prior. Our blocked sampler first iterates over all input text snippets d with context w, and re-samples their sense assign- ments under the current model parameters {φ}T and {ψ}K×T , p(zd|w, t,φ,ψ) ∝ p ( zd|t ) p ( w|t,zd ) = φt zd ∏ w∈w ψt,z d w (3) Next, we re-sample parameters {φ}T and {ψ}K×T from the logistic normal prior, given the current sense assignments. We use the auxiliary variable method proposed in Mimno et al. (2008) (see also Groenewald and Mokgatlhe (2005)). Intuitively, each individual parameter (e.g., sense k’s prevalence at time t, φtk) is ‘shifted’ within a weighted region which is bounded by the number of times sense k was observed at time t. The weights of the region are determined by the prior, in our case the normal distributions defined by the iGMRF, which ensure Corpus years covered #words COHA 1810–2009 142,587,656 DTE 1700–2010 124,771 CLMET3.0 1710–1810 4,531,505 Table 1: Size and coverage of our three training corpora (after pre-processing). an influence of temporal neighbors φt−1k and φ t+1 k on the new parameter value φtk, and smooth tempo- ral variation as desired. The same procedure applies to each word parameter under each {time, sense} ψ t,k w (see Mimno et al. 2008 for a more detailed de- scription of the sampler). Finally, we periodically re-sample the sense precision parameter κφ from its conjugate Gamma prior. 4 The DATE Corpus Before presenting our evaluation we describe the corpus used as a basis for the experiments per- formed in this work. We applied our model to a DiAchronic TExt corpus (DATE) which collates documents spanning years 1700–2010 from three sources: (a) the COHA corpus2 (Davies, 2010), a large collection of texts from various genres cover- ing the years 1810–2010; (b) the training data pro- vided by the DTE task3 organizers (see Section 8); and (c) the portion of the CLMET3.04 corpus (Diller et al., 2011) corresponding to the period 1710–1810 (which is not covered by the COHA corpus and thus underrepresented in our training data). CLMET3.0 contains texts representative of a range of genres in- cluding narrative fiction, drama, letters, and was col- lected from various online archives. Table 1 pro- vides details on the size of our corpus. Documents were clustered by their year of pub- lication as indicated in the original corpora. In the CLMET3.0 corpus, occasionally a range of years would be provided. In this case we used the fi- nal year of the range. We tokenized, lemmatized, and part of speech tagged DATE using the NLTK (Bird et al., 2009). We removed stopwords and func- tion words. After preprocessing, we extracted target 2http://corpus.byu.edu/coha/ 3http://alt.qcri.org/semeval2015/task7/ index.php?id=data-and-tools 4http://www.kuleuven.be/˜u0044428/ clmet3_0.htm 35 http://corpus.byu.edu/coha/ http://alt.qcri.org/semeval2015/task7/index.php?id=data-and-tools http://alt.qcri.org/semeval2015/task7/index.php?id=data-and-tools http://www.kuleuven.be/~u0044428/clmet3_0.htm http://www.kuleuven.be/~u0044428/clmet3_0.htm word-specific input corpora for our models. These consisted of mentions of a target c and its surround- ing context, a symmetric window of ± 5 words. 5 Experiment 1: Temporal Dynamics As discussed earlier our model departs from previ- ous approaches (e.g., Mitra et al. 2014) in that it learns globally consistent temporal representations for each word. In order to assess whether temporal dependencies are indeed beneficial, we implemented a stripped-down version of our model (SCAN-NOT) which does not have any temporal dependencies be- tween individual time steps (i.e., without the chain iGMRF priors). Word meaning is still represented as senses and sense prevalence is modeled as a dis- tribution over senses for each time interval. How- ever, time intervals are now independent. Inference works as described in Section 3.3, without having to learn the κ precision parameters. Models and Parameters We compared the two models in terms of their predictive power. We split the DATE corpus into a training period {d1...dt} of time slices 1 through t and computed the like- lihood p(dt+1|φt,ψt) of the data at test time slice t + 1, under the parameters inferred for the previous time slice. The time slice size was set to ∆T = 20 years. We set the number of senses to K = 8, the word precision parameter κψ = 10, a high value which enforces individual senses to re- main thematically consistent across time. We set the initial sense precision parameter κφ = 4, and the Gamma parameters a = 7 and b = 3. These pa- rameters were optimized once on the development data used for the task-based evaluation discussed in Section 8. Unless otherwise specified all ex- periments use these values. No parameters were tuned on the test set for any task. In all exper- iments we ran the Gibbs sampler for 1,000 itera- tions, and resampled κφ after every 50 iterations, starting from iteration 150. We used the final state of the sampler throughout. We randomly selected 50 mid-frequency target concepts from a larger set of target concepts described in Section 8. Predictive loglikelihood scores were averaged across concepts and were calculated as the average under 10 param- eter samples {φt,ψt} from the trained models. 1920-39 1940-59 1960-79 1980-99 −1,7 −1,65 ·105 predicted time period lo gl ik el ih oo d SCAN SCAN-NOT Figure 3: Predictive log likelihood of SCAN and a ver- sion without temporal dependencies (SCAN-NOT) across various test time periods. Results Figure 3 displays predictive loglikelihood scores for four test time intervals. SCAN outper- forms its stripped-down version throughout (higher is better). Since the representations learnt by SCAN are influenced (or smoothed) by neighboring repre- sentations, they overfit specific time intervals less which leads to better predictive performance. Fig- ure 4 further shows how SCAN models meaning change for the words band, power, transport and bank. The sense distributions over time are shown as a sequence of stacked histograms, senses themselves are color-coded (and enumerated) below, in the same order as in the histograms. Each sense k is illustrated as the 10 words w assigned the highest posterior probability, marginalizing over the time- specific representations p(w|k) = ∑ t ψ t,k w . Words representative of prevalent senses are highlighted in bold face. Figure 4 (top left) demonstrates that the model is able to capture various senses of the word band, such as strip used for binding (yellow bars/number 3 in the figure) or musical band (grey/1, orange/7). Our model predicts an increase in prevalence over the modeled time period for both senses. This is cor- roborated by the OED which provides the majority of references for the binding strip sense for the 20th century and dates the musical band sense to 1812. In addition a social band sense (violet/6, darkgreen/8; in the sense of bonding) emerges, which is present across time slices. The sense colored brown/2 refers to the British Band, a group of native Americans involved in the Black Hawk War in 1832, and the model indeed indicates a prevalence of this sense around this time (see bars 1800–1840 in the figure). For the word power (Figure 4 (top right)), 36 1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000 band 8 band play people time little call father day love boy 7 play band music time country day march military frequency jazz 6 little hand play land love time night speak strong name 5 little soldier leader time land arm hand country war indian 4 music play dance band hear time little evening stand house 3 black white hat broad gold wear hair band head rubber 2 indian little day horse time people meet chief leave war 1 play music hand hear sound march street air look strike 1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000 power 8 power idea god hand mind body life time object nature 7 power nation world war country time government sir mean lord 6 power time company water force line electric plant day run 5 power government law congress executive president legislative constitution 4 love power life time woman heart god tell little day 3 mind power time life friend woman nature love world reason 2 power people law government mind call king time hand nature 1 power country government nation war increase world political people europe 1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000 transport 8 road cost public railway transport rail average service bus time 7 ozone epa example section transport air policy region measure caa 6 time transport land public ship line water vessel london joy 5 air plane ship army day transport land look leave hand 4 time road worker union service public system industry air railway 3 air international worker plane association united union aircraft line president 2 troop ship day land army war send plane supply fleet 1 air joy love heart heaven time company eye hand smile 1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000 bank 8 bank tell cashier teller money day ned president house city 7 bank note money deposit credit amount pay species issue bill 6 bank money national note government credit united time currency loan 5 bank dollar money note national president account director company little 4 river day opposite mile bank danube town left country shore 3 bank capital company stock rate national president fund city loan 2 river water stream foot mile tree stand reach little land 1 note bank money time tell leave hard day dollar account Figure 4: Tracking meaning change for the words band, power, transport and bank over 20-year time intervals between 1700 and 2010. Each bar shows the proportion of each sense (color-coded) and is labeled with the start year of the respective time interval. Senses are shown as the 10 most probable words, and particularly representative words are highlighted for illustration. 1700 2010time p (w |s , t ) time line water water company company power power company power power power nuclear line time power force power water company company power company plant nuclear power power water line company time power force force force plant nuclear plant plant water power time power force force water time plant electric electric time utility force force force line water time electric water water time time company company war war company time steam day day plant day force company utility time run day run steam electric line time day time day run run people equal house electric day run steam steam electric electric run utility electric energy carry run steam electric day purchase line steam run water day cost cost electric company day run plant run plant run line people force people run Figure 5: Sense-internal temporal dynamics for the energy sense of the word power (violet/6 in Figure 4). Columns show the ten most highly associated words for each time interval for the period between 1700 and 2010 (ordered by decreasing probability). We highlight how four terms characteristic of the sense develop over time (see {water, steam, plant, nuclear} in the figure). three senses emerge: the institutional power (col- ors gray/1, brown/2, pink/5, orange/7 in the figure), mental power (yellow/3, lightgreen/4, darkgreen/8), and power as supply of energy (violet/6). The latter is an example of a “sense birth” (Mitra et al., 2014): the sense was hardly present before the mid-19th century. This is corroborated by the OED which dates the sense to 1889, whereas the OED contains references to the remaining senses for the whole modeled time period, as predicted by our model. 37 1900-19 1920-39 1900-19 1940-59 1900-19 1960-79 1900-19 1980-99 1900-19 2000-10 1960-79 1980-99 1960-79 2000-10 1980-99 2000-10 0.2 0.4 0.6 pr ec is io n SCAN SCAN-NOT t1 t2 Figure 6: Precision results for the SCAN and SCAN-NOT models on the WordNet-based novel sense detection (Exper- iment 2). Results are shown for a selection of reference times (t1) and focus times (t2). Similar trends of meaning change emerge for transport (Figure 4 bottom left). The bot- tom right plot shows the sense development for the word bank. Although the well-known senses river bank (brown/2, lightgreen/4) and monetary institu- tion (rest) emerge clearly, the overall sense pattern appears comparatively stable across intervals indi- cating that the meaning of the word has not changed much over time. Besides tracking sense prevalence over time, our model can also detect changes within individual senses. Because we are interested in tracking se- mantically stable senses, we fixed the precision pa- rameter κψ to a high value, to discourage too much variance within each sense. Figure 5 illustrates how the energy sense of the word power (violet/6 in Figure 4) has changed over time. Characteristic terms for a given sense are highlighted in bold face. For example, the term “water” is initially prevalent, while the term “steam” rises in prevalence towards the middle of the modeled period, and is superseded by the terms “plant” and “nuclear” towards the end. 6 Experiment 2: Novel Sense Detection In this section and the next we will explicitly eval- uate the temporal representations (i.e., probability distributions) induced by our model, and discuss its performance in the context of previous work. Large-scale evaluation of meaning change is noto- riously difficult, and many evaluations are based on limited hand-annotated goldstandard data sets. Mi- tra et al. (2015), however, bypass this issue by eval- uating the output of their system against WordNet (Fellbaum, 1998). Here, we consider their auto- matic evaluation of sense-births, i.e., the emergence of novel senses. We assume that novel senses are detected at a focus time t2 whilst being compared to a reference time t1. WordNet is used to confirm that the proposed novel sense is indeed distinct from all other induced senses for a given word. Method Mitra et al.’s (2015) evaluation method presupposes a system which is able to detect senses for a set of target words and identify which ones are novel. Our model does not automatically yield nov- elty scores for the induced senses. However, Cook et al. (2014) propose several ways to perform this task post-hoc. We use their relevance score, which is based on the intuition that keywords (or collo- cations) which characterize the difference of a fo- cus corpus from a reference corpus are indicative of word sense novelty. We identify keywords for a focus corpus with re- spect to a reference corpus using Kilgarriff’s (2009) method which is based on smoothed relative fre- quencies.5 The novelty of an induced sense s can be then defined in terms of the aggregate keyword probabilities given that sense (and focus time of in- terest): rel(s) = ∑ w∈W p(w|s,t2). (4) where W is a keyword list and t2 the focus time. Cook et al. (2014) suggest a straightforward extrap- olation from sense novelty to word novelty: rel(c) = max s rel(s), (5) 5We set the smoothing parameter to n = 10, and like Cook et al. (2014) retrieve the top 1000 keywords. 38 t1=1900–1919 t2=1980–1999 union soviet united american union european war civil military people liberty dos system window disk pc operate program run computer de dos entertainment television industry program time business people world president entertainment company station radio station television local program network space tv broadcast air t1=1960–1969 t2=1990–1999 environmental supra note law protection id agency impact policy factor federal users computer window information software system wireless drive web building available virtual reality virtual computer center experience week community separation increase disk hard disk drive program computer file store ram business embolden Table 2: Example target terms (left) with novel senses (right) as identified by SCAN in focus corpus t2 (when compared against reference corpus t1). Top: terms used in novel sense detection study (Experiment 2). Bottom: terms from the Gulordava and Baroni (2011) gold standard (Experiment 3). where rel(c) is the highest novelty score assigned to any of the target word’s senses. A high rel(c) score suggests that a word has undergone meaning change. We obtained candidate terms and their associ- ated novel senses from the DATE corpus, using the relevance metric described above. The novel senses from the focus period and all senses induced for the reference period, except for the one corre- sponding to the novel sense, were passed on to Mitra et al.’s (2015) WordNet-based evaluator which pro- ceeds as follows. Firstly, each induced sense s is mapped to the WordNet synset u with the maximum overlap: synset(s) = arg max u overlap(s,u). (6) Next, a predicted novel sense n is deemed truly novel if its mapped synset is distinct from any synset mapped to a different induced sense: ∀s′synset(s′) 6= synset(n). (7) Finally, overall precision is calculated as the frac- tion of sense-births confirmed by WordNet over all birth-candidates proposed by the model. Like Mitra et al. (2015) we only report results on target words for which all induced senses could be successfully mapped to a synset. Models and Parameters We obtained the broad set of target words used for the task-based evalua- tion (in Section 8) and trained models on the DATE corpus. We set the number of senses K = 4 fol- lowing Mitra et al. (2015) who note that the Word- Net mapper works best for words with a small num- ber of senses, and the time intervals to ∆T = 20 as in Experiment 1. We identified 200 words6 with highest novelty score (Equation (5)) as sense birth candidates. We compared the performance of the full SCAN model against SCAN-NOT which learns senses independently for time intervals. We trained both models on the same data with identical pa- rameters. For SCAN-NOT, we must post-hoc iden- tify corresponding senses across time intervals. We used the Jensen-Shannon divergence between the reference- and focus-time specific word distribu- tions JS(p(w|s,t1)||p(w|s,t2)) and assigned each focus-time sense to the sense with smallest diver- gence at reference time. Results Figure 6 shows the performance of our models on the task of sense birth detection. SCAN performs better than SCAN-NOT, underscoring the importance of joint modeling of senses across time slices and incorporation of temporal dynamics. Our accuracy scores are in the same ballpark as Mitra et al. (2014, 2015). Note, however that the scores are not directly comparable due to differences in train- ing corpora, focus and reference times, and candi- date words. Mitra et al. (2015) use the larger Google syntactic n-gram corpus, as well as richer linguis- tic information in terms of syntactic dependencies. We show that our model which does not rely on syntactic annotations performs competitively even when trained on smaller data. Table 2 (top) displays examples of words assigned highest novelty scores for the reference period 1900–1919 and focus pe- riod 1980–1999. 6This threshold was tuned on one reference-focus time pair. 39 7 Experiment 3: Word Meaning Change In this experiment we evaluate whether model in- duced temporal word representations capture per- ceived word novelty. Specifically, we adopt the eval- uation framework (and dataset) introduced in Gulor- dava and Baroni (2011)7 and discussed below. Method Gulordava and Baroni (2011) do not model word senses directly; instead they obtain dis- tributional representations of words from the Google Books (bigram) data for two time slices, namely the 1960s (reference corpus) and 1990s (focus cor- pus). To detect change in meaning, they measure cosine similarity between the vector representations of a target word in the reference and focus corpus. It is assumed that low similarity indicates signifi- cant meaning change. To evaluate the output of their system, they created a test set of 100 target words (nouns, verbs, and adjectives), and asked five anno- tators to rate each word with respect to its degree of meaning change between the 1960s and the 1990s. The annotators used a 4-point ordinal scale (0: no change, 1: almost no change, 2: somewhat change, 3: changed significantly). Words were subsequently ranked according to the mean rating given by the annotators. Inter-annotator agreement on the novel sense detection task was 0.51 (pairwise Pearson cor- relation) and can be regarded as an upper bound on model performance. Models and Parameters We trained models for all words in Gulordava and Baroni’s (2011) gold- standard. We used the DATE subcorpus cover- ing years 1960 through 1999 partitioned by decade (∆T = 10). The first and last time interval were defined as reference and focus time, respec- tively (t1=1960–1969, t2=1990–1999). As in Ex- periment 2, a novelty score was assigned to each target word (using Equation (5)). We computed Spearman’s ρ rank correlations between gold stan- dard and model rankings (Gulordava and Baroni, 2011). We trained SCAN models setting the num- ber of senses to K = 8. We also trained SCAN-NOT models with identical parameters. We report results averaged over five independent parameter estimates. Finally, as in Gulordava and Baroni (2011) we com- pare against a frequency baseline which ranks words 7We thank Kristina Gulordava for sharing their evaluation data set of target words and human judgments. system corpus Spearman’s ρ Gulordava (2011) Google 0.386 SCAN DATE 0.377 SCAN-NOT DATE 0.255 frequency baseline DATE 0.325 Table 3: Spearman’s ρ rank correlations between system novelty rankings and the human-produced ratings. All correlations are statistically significant (p < 0.02). Re- sults for SCAN and SCAN-NOT are averages over five trained models. by their log relative frequency in the reference and focus corpus. Results The results of this evaluation are shown in Table 3. As can be seen, SCAN outperforms SCAN-NOT and the frequency baseline. For refer- ence, we also report the correlation coefficient ob- tained in Gulordava and Baroni (2011) but empha- size that the scores are not directly comparable due to differences in training data: Gulordava and Ba- roni (2011) use the Google bigrams corpus (which is much larger compared to DATE). Table 2 (bottom) displays examples of words which achieved highest novelty scores in this evaluation, and their associated novel senses. 8 Experiment 4: Task-based Evaluation In the previous sections we demonstrated how SCAN captures meaning change between two periods. In this section, we assess our model on an extrinsic task which relies on meaning representations span- ning several time slices. We quantitatively evaluate our model on the SemEval-2015 benchmark datasets released as part of the Diachronic Text Evaluation exercise (Popescu and Strapparava 2015; DTE). In the following we first present the DTE subtasks, and then move on to describe our training data, parame- ter settings, and systems used for comparison to our model. SemEval DTE Tasks Diachronic text evaluation is an umbrella term used by the SemEval-2015 or- ganizers to represent three subtasks aiming to assess the performance of computational methods used to identify when a piece of text was written. A simi- lar problem is tackled in Chambers (2012) who la- bel documents with time stamps whilst focusing on 40 explicit time expressions and their discriminatory power. The SemEval data consists of news snip- pets, which range between a few words and mul- tiple sentences. A set of training snippets, as well as gold-annotated development and test datasets are provided. DTE subtasks 1 and 2 involve tempo- ral classification: given a news snippet and a set of non-overlapping time intervals covering the pe- riod 1700 through 2010, the system’s task is to se- lect the interval corresponding to the snippet’s year of origin. Temporal intervals are consecutive and constructed such that the correct interval is centered around the actual year of origin. For both tasks tem- poral intervals are created at three levels of granular- ity (fine, medium, and coarse). Subtask 1 involves snippets which contain an ex- plicit cue for time of origin. The presence of a temporal cue was determined by the organizers by checking the entities’ informativeness in external re- sources. Consider the example below: (8) President de Gaulle favors an independent European nuclear striking force [...] The mentions of French president de Gaulle and nu- clear warfare suggest that the snippet was written after the mid-1950s and indeed it was published in 1962. A hypothetical system would then have to de- cide amongst the following classes: {1700–1702, 1703–1705, ..., 1961–1963, ..., 2012–2014} {1699–1706, 1707–1713, ..., 1959–1965, ..., 2008–2014} {1696–1708, 1709–1721, ..., 1956–1968, ..., 2008–2020} The first set of classes correspond to fine-grained in- tervals of 2-years, the second set to medium-grained intervals of 6-years and the third set to coarse- grained intervals of 12-years. For the snippet in example (8) classes 1961–1963, 1959–1965, and 1956–1968 are the correct ones. Subtask 2 involves temporal classification of snip- pets which lack explicit temporal cues, but contain implicit ones, e.g., as indicated by lexical choice or spelling. The snippet in example (9) was published in 1891 and the spelling of to-day, which was com- mon up to the early 20th century, is an implicit cue: (9) The local wheat market was not quite so strong to-day as yesterday. Analogously to subtask 1, systems must select the right temporal interval from a set of contiguous time intervals of differing granularity. For this task, which is admittedly harder, levels of temporal gran- ularity are coarser corresponding to 6-year, 12-year and 20-year intervals. Participating SemEval Systems We compared our model against three other systems which par- ticipated in the SemEval task.8 AMBRA (Zampieri et al., 2015) adopts a learning-to-rank modeling ap- proach and uses several stylistic, grammatical, and lexical features. IXA (Salaberri et al., 2015) uses a combination of approaches to determine the pe- riod of time in which a piece of news was writ- ten. This involves searching for specific mentions of time within the text, searching for named enti- ties present in the text and then establishing their reference time by linking these to Wikipedia, using Google n-grams, and linguistic features indicative of language change. Finally, UCD (Szymanski and Lynch, 2015) employs SVMs for classification us- ing a variety of informative features (e.g., POS-tag n-grams, syntactic phrases), which were optimized for the task through automatic feature selection. Models and Parameters We trained our model for individual words and obtained representations of their meaning for different points in time. Our set of target words consisted of all nouns which oc- curred in the development datasets for DTE sub- tasks 1 and 2 as well as all verbs which occurred at least twice in this dataset. After removing in- frequent words we were left with 883 words (out of 1,116) which we used in this evaluation. Target words were not optimized with respect to the test data in any way; it is thus reasonable to expect bet- ter performance with an adjusted set of words. We set the model time interval to ∆T = 5 years and the number of senses per word to K = 8. We also evaluated SCAN-NOT, the stripped-down version of SCAN, with identical parameters. Both SCAN and SCAN-NOT predict the time of origin for a test snippet as follows. We first detect mentions of target words in the snippet. Then, for each men- tion c we construct a document, akin to the training documents, consisting of c and its context w, the ±5 words surrounding c. Given {c,w}, we approximate 8We do not report results for the system USAAR which achieved close to 100% accuracy by searching for the test snip- pets on the web, without performing any temporal inference. 41 Task 1 Task 2 2 yr 6 yr 12 yr 6 yr 12 yr 20 yr acc p acc p acc p acc p acc p acc p Baseline .097 .010 .214 .017 .383 .046 .199 .025 .343 .047 .499 .057 SCAN-NOT .265 .086 .435 .139 .609 .169 .259 .041 .403 .056 .567 .098 SCAN .353 .049 .569 .112 .748 .206 .376 .053 .572 .091 .719 .135 IXA .187 .020 .375 .041 .557 .090 .261 .037 .428 .067 .622 .098 AMBRA .167 .037 .367 .071 .554 .074 .605 .143 .767 .143 .868 .292 UCD – – – – – – .759 .463 .846 .472 .910 .542 SVM SCAN .192 .034 .417 .097 .545 .127 .573 .331 .667 .368 .790 .428 SVM SCAN+ngram .222 .030 .467 .079 .627 .142 .747 .481 .821 .500 .897 .569 Table 4: Results on Diachronic Text Evaluation Tasks 1 and 2 for a random baseline, our SCAN model, its stripped- down version without iGMRFs (SCAN-NOT), the SemEval submissions (IXA, AMBRA and UCD), and SVMs trained with SCAN features (SVM SCAN), and with additional character n-gram features (SVM SCAN+ngram). Results are shown for three levels of granularity, a strict precision measure p, and a distance-discounting measure acc. a distribution over time intervals as: p(c)(t|w) ∝ p(c)(w|t) ×p(c)(t) (10) where the superscript (c) indicates parameters from the word-specific model, we marginalize over senses and assume a uniform distribution over time slices p(c)(t). Finally, we combine the word-wise predic- tions into a final distribution p(t) = ∏ c p (c)(t|,w), and predict the time t with highest probability. Supervised Classification We also apply our model in a supervised setting, i.e., by extracting features for classifier prediction. Specifically, we trained a multiclass SVM (Chang and Lin, 2011) on the training data provided by the SemEval organiz- ers (for DTE tasks 1 and 2). For each observed word within each snippet, we added as feature its most likely sense k given t, the true time of origin: arg max k p(c)(k|t). (11) We also trained a multiclass SVM which uses char- acter n-gram (n ∈ {1, 2, 3}) features in addition to the model features. Szymanski and Lynch (2015) identified character n-grams as the most predictive feature for temporal text classification using SVMs. Their system (UCD) achieved the best published scores in DTE subtask 2. Following their approach, we included all n-grams that were observed more than 20 times in the DTE training data. Results We employed two evaluation measures proposed by the DTE organizers. These are pre- cision p, i.e., the percentage of times a system has predicted the correct time period. And accuracy acc which is more lenient, and penalizes system predic- tions proportional to their distance from the true in- terval. We compute the p and acc scores for our models using the evaluation script provided by the SemEval organizers. Table 4 summarizes our re- sults for DTE subtasks 1 and 2. We compare SCAN against a baseline which selects a time interval at random9 averaged over five runs. We also show re- sults for a stripped-down version of our model with- out the iGMRFs (SCAN-NOT) and for the systems which participated in SemEval. For subtask 1, the two versions of SCAN out- perform all SemEval systems across the board. SCAN-NOT occasionally outperforms SCAN in the strict precision metric, however, the full SCAN model consistently achieves better accuracy scores which are more representative since they factor in the proximity of the prediction to the true value. In subtask 2, the UCD and SVM SCAN+ngram systems perform comparably. They both use SVMs for the classification task, however our own model employs a less expressive feature set based on SCAN and character n-grams, and does not take advantage of feature selection which would presumably enhance performance. With the exception of AMBRA, all other participating systems used external resources (such as Wikipedia and Google n-grams); it is thus fair to assume they had access to at least as much training data as our SCAN model. Consequently, the 9We recomputed the baseline scores for subtasks 1 and 2 due to inconsistencies in the results provided by the DTE organizers. 42 gap in performance can not solely be attributed to a difference in the size of the training data. We also observe that IXA and SCAN, given iden- tical granularity, perform better on subtask 1, while AMBRA and our own SVM-based systems exhibit the opposite trend. The IXA system uses a combi- nation of knowledge sources in order to determine when a piece of news was written, including ex- plicit mentions of temporal expressions within the text, named entities, and linked information to those named entities from Wikipedia. AMBRA on the other hand exploits more shallow stylistic, grammat- ical and lexical features within the learning-to-rank paradigm. An interesting direction for future work would be to investigate which features are most ap- propriate for different DTE tasks. Overall, it is en- couraging to see that the generic temporal word rep- resentations inferred by SCAN lead to competitively performing models on both temporal classification tasks without any explicit tuning. 9 Conclusion In this paper we introduced SCAN, a dynamic Bayesian model of diachronic meaning change. Our model learns a coherent set of co-dependent time-specific senses for individual words and their prevalence. Evaluation of the model’s output showed that the learnt representations reflect (a) dif- ferent senses of ambiguous words (b) different kinds of meaning change (such as new senses being estab- lished), and (c) connotational changes within senses. SCAN departs from previous work in that it models temporal dynamics explicitly. We demonstrated that this feature yields more general semantic represen- tations as indicated by predictive loglikelihood and a variety of extrinsic evaluations. We also experi- mentally evaluated SCAN on novel sense detection and the SemEval DTE task, where it performed on par with the best published results, without any ex- tensive feature engineering or task specific tuning. We conclude by discussing limitations of our model and directions for future work. In our exper- iments we fix the number of senses K for all words across all time periods. Although this approach did not harm performance (even in case of SemEval where we handled more than 800 target concepts), it is at odds with the fact that words vary in their degree of ambiguity, and that word senses continu- ously appear and disappear. A non-parametric ver- sion of our model would infer an appropriate number of senses from the data, individually for each time period. Also note that in our experiments we used context as a bag of words. It would be interesting to explore more systematically how different kinds of contexts (e.g., named entities, multiword expres- sions, verbs vs. nouns) influence the representations the model learns. Furthermore, while SCAN cap- tures the temporal dynamics of word senses, it can- not do so for words themselves. Put differently, the model cannot identify whether a new word is used which did not exist before, or that a word ceased to exist after a specific point in time. A model internal way of detecting word (dis)appearance would be de- sirable, especially since new terms are continuously being introduced thanks to popular culture and vari- ous new media sources. In the future, we would like to apply our model to different text genres and levels of temporal granular- ity. For example, we could work with Twitter data, an increasingly popular source for opinion tracking, and use our model to identify short-term changes in word meanings or connotations. Acknowledgments We are grateful to the anonymous reviewers whose feedback helped to substantially improve the present paper. We thank Charles Sutton and Iain Murray for helpful discussions, and acknowledge the support of EPSRC through project grant EP/I037415/1. References Aitchison, Jean. 2001. Language Change: Progress Or Decay?. Cambridge Approaches to Linguis- tics. Cambridge University Press. Biemann, Chris. 2006. Chinese Whispers - an Effi- cient Graph Clustering Algorithm and its Appli- cation to Natural Language Processing Problems. In Proceedings of TextGraphs: the 1st Workshop on Graph Based Methods for Natural Language Processing. New York City, NY, USA, pages 73– 80. Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media, Inc., 1st edition. Blei, David M. and John D. Lafferty. 2006a. Cor- 43 related Topic Models. In Advances in Neural In- formation Processing Systems 18, Vancouver, BC, Canada, pages 147–154. Blei, David M. and John D. Lafferty. 2006b. Dy- namic Topic Models. In Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, PA, USA, pages 113–120. Brody, Samuel and Mirella Lapata. 2009. Bayesian Word Sense Induction. In Proceedings of the 12th Conference of the European Chapter of the ACL. Athens, Greece, pages 103–111. Chambers, Nathanael. 2012. Labeling Documents with Timestamps: Learning from their Time Ex- pressions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea, pages 98–106. Chang, Chih-Chung and Chih-Jen Lin. 2011. LIBSVM: A library for support vector ma- chines. ACM Transactions on Intelligent Sys- tems and Technology 2:27:1–27:27. Software available at http://www.csie.ntu.edu. tw/˜cjlin/libsvm. Chen, Jianfei, Jun Zhu, Zi Wang, Xun Zheng, and Bo Zhang. 2013. Scalable Inference for Logistic- Normal Topic Models. In Advances in Neural In- formation Processing Systems, Lake Tahoe, NV, USA, pages 2445–2453. Cook, Paul, Jey Han Lau, Diana McCarthy, and Timothy Baldwin. 2014. Novel Word-sense Iden- tification. In Proceedings of the 25th Interna- tional Conference on Computational Linguistics: Technical Papers. Dublin, Ireland, pages 1624– 1635. Cook, Paul and Suzanne Stevenson. 2010. Automat- ically Identifying Changes in the Semantic Orien- tation of Words. In Proceedings of the Seventh International Conference on Language Resources and Evaluation. Valletta, Malta, pages 28–34. Davies, Mark. 2010. The Corpus of Historical American English: 400 million words, 1810- 2009. Available online at http://corpus. byu.edu/coha/. Diller, Hans-Jürgen, Hendrik de Smet, and Jukka Tyrkkö. 2011. A European database of descriptors of english electronic texts. The European English messenger 19(2):29–35. Fellbaum, Christiane. 1998. WordNet: An Electronic Lexical Database. Bradford Books. Groenewald, Pieter C. N. and Lucky Mokgatlhe. 2005. Bayesian Computation for Logistic Regres- sion. Computational Statistics & Data Analysis 48(4):857–868. Gulordava, Kristina and Marco Baroni. 2011. A Dis- tributional Similarity Approach to the Detection of Semantic Change in the Google Books Ngram Corpus. In Proceedings of the Workshop on GEo- metrical Models of Natural Language Semantics. Edinburgh, Scotland, pages 67–71. Kilgarriff, Adam. 2009. Simple maths for keywords. In Proceedings of the Corpus Linguistics Confer- ence. Kim, Yoon, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. 2014. Temporal Anal- ysis of Language through Neural Language Mod- els. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational So- cial Science. Baltimore, MD, USA, pages 61–65. Kulkarni, Vivek, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2015. Statistically Significant De- tection of Linguistic Change. In Proceedings of the 24th International Conference on World Wide Web. Geneva, Switzerland, pages 625–635. Lau, Han Jey, Paul Cook, Diana McCarthy, Span- dana Gella, and Timothy Baldwin. 2014. Learn- ing Word Sense Distributions, Detecting Unat- tested Senses and Identifying Novel Senses us- ing Topic Models. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics. Baltimore, MD, USA, pages 259–270. Lau, Jey Han, Paul Cook, Diana McCarthy, David Newman, and Timothy Baldwin. 2012. Word Sense Induction for Novel Sense Detection. In Proceedings of the 13th Conference of the Eu- ropean Chapter of the Association for Computa- tional Linguistics. Avignon, France, pages 591– 601. McMahon, April M.S. 1994. Understanding Lan- guage Change. Cambridge University Press. Mihalcea, Rada and Vivi Nastase. 2012. Word Epoch Disambiguation: Finding How Words Change over Time. In Proceedings of the 50th 44 http://www.csie.ntu.edu.tw/~cjlin/libsvm http://www.csie.ntu.edu.tw/~cjlin/libsvm http://corpus.byu.edu/coha/ http://corpus.byu.edu/coha/ Annual Meeting of the Association for Computa- tional Linguistics. Jeju Island, Korea, pages 259– 263. Mimno, David, Hanna Wallach, and Andrew Mc- Callum. 2008. Gibbs Sampling for Logistic Nor- mal Topic Models with Graph-Based Priors. In NIPS Workshop on Analyzing Graphs. Vancouver, Canada. Mitra, Sunny, Ritwik Mitra, Suman Kalyan Maity, Martin Riedl, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. 2015. An automatic ap- proach to identify word sense changes in text me- dia across timescales. Natural Language Engi- neering 21:773–798. Mitra, Sunny, Ritwik Mitra, Martin Riedl, Chris Biemann, Animesh Mukherjee, and Pawan Goyal. 2014. That’s sick dude!: Automatic identification of word sense change across different timescales. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Balti- more, MD, USA, pages 1020–1029. Ó Séaghdha, Diarmuid. 2010. Latent Variable Mod- els of Selectional Preference. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, pages 435–444. Popescu, Octavian and Carlo Strapparava. 2013. Be- hind the Times: Detecting Epoch Changes using Large Corpora. In Proceedings of the Sixth Inter- national Joint Conference on Natural Language Processing. Nagoya, Japan, pages 347–355. Popescu, Octavian and Carlo Strapparava. 2015. Se- mEval 2015, Task 7: Diachronic Text Evaluation. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver, CO, USA, pages 869–877. Ritter, Alan, Mausam, and Oren Etzioni. 2010. A Latent Dirichlet Allocation Method for Selec- tional Preferences. In Proceedings of the 48th Annual Meeting of the Association for Computa- tional Linguistics. Uppsala, Sweden, pages 424– 434. Rue, Håvard and Leonhard Held. 2005. Gaussian Markov Random Fields: Theory and Applica- tions. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press. Sagi, Eyal, Stefan Kaufmann, and Brady Clark. 2009. Semantic Density Analysis: Comparing Word Meaning across Time and Phonetic Space. In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics. Athens, Greece, pages 104–111. Salaberri, Haritz, Iker Salaberri, Olatz Arregi, and Beñat Zapirain. 2015. IXAGroupEHUDiac: A Multiple Approach System towards the Di- achronic Evaluation of Texts. In Proceedings of the 9th International Workshop on Semantic Eval- uation (SemEval 2015). Denver, CO, USA, pages 840–845. Stevenson, Angus, editor. 2010. The Oxford English Dictionary. Oxford University Press, third edi- tion. Szymanski, Terrence and Gerard Lynch. 2015. UCD: Diachronic Text Classification with Char- acter, Word, and Syntactic N-grams. In Pro- ceedings of the 9th International Workshop on Se- mantic Evaluation (SemEval 2015). Denver, CO, USA, pages 879–883. Tahmasebi, Nina, Thomas Risse, and Stefan Di- etze. 2011. Towards automatic language evolu- tion tracking, A study on word sense tracking. In Proceedings of the Joint Workshop on Knowl- edge Evolution and Ontology Dynamics (EvoDyn 2011). Bonn, Germany. Wijaya, Derry Tanti and Reyyan Yeniterzi. 2011. Understanding Semantic Change of Words over Centuries. In Proceedings of the 2011 Inter- national Workshop on DETecting and Exploiting Cultural diversiTy on the Social Web. Glasgow, Scotland, UK, pages 35–40. Zampieri, Marcos, Alina Maria Ciobanu, Vlad Nic- ulae, and Liviu P. Dinu. 2015. AMBRA: A Rank- ing Approach to Temporal Text Classification. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver, CO, USA, pages 851–855. 45 46