Encoding Prior Knowledge with Eigenword Embeddings Dominique Osborne Department of Mathematics and Statistics University of Strathclyde Glasgow, G1 1XH, UK dominique.osborne.13@uni.strath.ac.uk Shashi Narayan and Shay B. Cohen School of Informatics University of Edinburgh Edinburgh, EH8 9LE, UK {snaraya2,scohen}@inf.ed.ac.uk Abstract Canonical correlation analysis (CCA) is a method for reducing the dimension of data represented using two views. It has been previously used to derive word embeddings, where one view indicates a word, and the other view indicates its context. We describe a way to incorporate prior knowledge into CCA, give a theoretical justification for it, and test it by deriving word embeddings and evaluating them on a myriad of datasets. 1 Introduction In recent years there has been an immense in- terest in representing words as low-dimensional continuous real-vectors, namely word embeddings. Word embeddings aim to capture lexico-semantic information such that regularities in the vocabulary are topologically represented in a Euclidean space. Such word embeddings have achieved state-of-the- art performance on many natural language process- ing (NLP) tasks, e.g., syntactic parsing (Socher et al., 2013), word or phrase similarity (Mikolov et al., 2013b), dependency parsing (Bansal et al., 2014), unsupervised learning (Parikh et al., 2014) and oth- ers. Since the discovery that word embeddings are useful as features for various NLP tasks, research on word embeddings has taken on a life of its own, with a vibrant community searching for better word rep- resentations in a variety of problems and datasets. These word embeddings are often induced from large raw text capturing distributional co-occurrence information via neural networks (Bengio et al., 2003; Mikolov et al., 2013b; Mikolov et al., 2013c) or spectral methods (Deerwester et al., 1990; Dhillon et al., 2015). While these general pur- pose word embeddings have achieved significant im- provement in various tasks in NLP, it has been dis- covered that further tuning of these continuous word representations for specific tasks improves their per- formance by a larger margin. For example, in de- pendency parsing, word embeddings could be tai- lored to capture similarity in terms of context within syntactic parses (Bansal et al., 2014) or they could be refined using semantic lexicons such as WordNet (Miller, 1995), FrameNet (Baker et al., 1998) and the Paraphrase Database (Ganitkevitch et al., 2013) to improve various similarity tasks (Yu and Dredze, 2014; Faruqui et al., 2015; Rothe and Schütze, 2015). This paper proposes a method to encode prior semantic knowledge in spectral word embeddings (Dhillon et al., 2015). Spectral learning algorithms are of great inter- est for their speed, scalability, theoretical guaran- tees and performance in various NLP applications. These algorithms are no strangers to word embed- dings either. In latent semantic analysis (LSA, (Deerwester et al., 1990; Landauer et al., 1998)), word embeddings are learned by performing SVD on the word by document matrix. Recently, Dhillon et al. (2015) have proposed to use canonical cor- relation analysis (CCA) as a method to learn low- dimensional real vectors, called Eigenwords. Un- like LSA based methods, CCA based methods are scale invariant and can capture multiview informa- tion such as the left and right contexts of the words. As a result, the eigenword embeddings of Dhillon et al. (2015) that were learned using the simple lin- ear methods give accuracies comparable to or better than state of the art when compared with highly non- linear deep learning based approaches (Collobert and Weston, 2008; Mnih and Hinton, 2007; Mikolov et al., 2013b; Mikolov et al., 2013c). The main contribution of this paper is a technique 417 Transactions of the Association for Computational Linguistics, vol. 4, pp. 417–430, 2016. Action Editor: Hal Daume III. Submission batch: 3/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. to incorporate prior knowledge into the derivation of canonical correlation analysis. In contrast to previ- ous work where prior knowledge is introduced in the off-the-shelf embeddings as a post-processing step (Faruqui et al., 2015; Rothe and Schütze, 2015), our approach introduces prior knowledge in the CCA derivation itself. In this way it preserves the the- oretical properties of spectral learning algorithms for learning word embeddings. The prior knowl- edge is based on lexical resources such as WordNet, FrameNet and the Paraphrase Database. Our derivation of CCA to incorporate prior knowledge is not limited to eigenwords and can be used with CCA for other problems. It follows a sim- ilar idea to the one proposed by Koren and Carmel (2003) for improving the visualization of principal vectors with principal component analysis (PCA). Our derivation represents the solution to CCA as that of an optimization problem which maximizes the distance between the two view projections of training examples, while weighting these distances using the external source of prior knowledge. As such, our approach applies to other uses of CCA in the NLP literature, such as the one of Jagarlamudi and Daumé (2012), who used CCA for translitera- tion, or the one of Silberer et al. (2013), who used CCA for semantically representing visual attributes. 2 Background and Notation For an integer n, we denote by [n] the set of integers {1, . . . ,n}. We assume the existence of a vocabu- lary of words, usually taken from a corpus. This set of words is denoted by H = {h1, . . . ,h|H|}. For a square matrix A, we denote by diag(A) a diagonal matrix B which has the same dimensions as A such that Bii = Aii for all i. For vector v ∈ Rd, we de- note its `2 norm by ||v||, i.e. ||v|| = √∑d i=1 v 2 i . We also denote by vj or [v]j the jth coordinate of v. For a pair of vectors u and v, we denote their dot product by 〈u,v〉. We define a word embedding as a function f from H to Rm for some (relatively small) m. For exam- ple, in our experiments we vary m between 50 and 300. The word embedding function maps the word to some real-vector representation, with the inten- tion to capture regularities in the vocabulary that are topologically represented in the corresponding Eu- clidean space. For example, all vocabulary words that correspond to city names could be grouped to- gether in that space. Research on the derivation of word embeddings that capture various regularities has greatly accel- erated in recent years. Various methods used for this purpose range from low-rank approximations of co-occurrence statistics (Deerwester et al., 1990; Dhillon et al., 2015) to neural networks jointly learn- ing a language model (Bengio et al., 2003; Mikolov et al., 2013a) or models for other NLP tasks (Col- lobert and Weston, 2008). 3 Canonical Correlation Analysis for Deriving Word Embeddings One recent approach to derive word embeddings, developed by Dhillon et al. (2015), is through the use of canonical correlation analysis, resulting in so- called “eigenwords.” CCA is a technique for multi- view dimensionality reduction. It assumes the ex- istence of two views for a set of data, similarly to co-training (Yarowsky, 1995; Blum and Mitchell, 1998), and then projects the data in the two views in a way that maximizes the correlation between the projected views. Dhillon et al. (2015) used CCA to derive word embeddings through the following procedure. They first break each document in a corpus of documents into n sequences of words of a fixed length 2k + 1, where k is a window size. For example, if k = 2, the short document “Harry Potter has been a best- seller” would be broken into “Harry Potter has been a” and “Potter has been a best-seller.” In each such sequence, the middle word is identified as a pivot. This leads to the construction of the fol- lowing training set from a set of documents: {(w(i)1 , . . . ,w (i) k ,w (i),w (i) k+1, . . . ,w (i) 2k ) | i ∈ [n]}. With abuse of notation, this is a multiset, as cer- tain words are expected to appear in certain contexts multiple times. Each w(i) is a pivot word, and the rest of the elements are words in the sequence called “the context words.” With this training set in mind, the two views for CCA are defined as following. We define the first view through a sparse “context matrix” C ∈ Rn×2k|H| such that each row in the matrix is a vector, consisting of 2k one-hot vectors, each of length |H|. Each such one-hot vector corre- 418 |H| 1 2 i n W 1 0 2 0 0 0 j w(i) = hj 1 0 0 |H| 0 1 2 i n 1 k 2k C 1 0 2 0 0 0 j w (i) k = hj 1 0 0 |H| 0 Figure 1: The word and context views represented as ma- trix W and C. Each row in W is a vector of length |H|, corresponding to a one-hot vector for the word in the ex- ample indexed by the row. Each row in C is a vector of length 2k|H|, divided into sub-vectors each of length |H|. Each such sub-vector is a one-hot vector for one of the 2k context words in the example indexed by the row. sponds to a word that fired in a specific index in the context. In addition, we also define a second view through a matrix W ∈ Rn×|H| such that Wij = 1 if w(i) = hj. We present both views of the training set in Figure 1. Note that now the matrix M = W>C is in R|H|×(2k|H|) such that each element Mij gives the count of times that hi appeared with the correspond- ing context word and context index encoded by j. Similarly, we define a matrix D1 = diag(W>W) and D2 = diag(C>C). Finally, to get the word em- beddings, we perform singular value decomposition (SVD) on the matrix D−1/21 MD −1/2 2 . Note that in its original form, CCA requires use of W>W and C>C in their full form, and not just the correspond- ing diagonal matrices D1 and D2; however, in prac- tice, inverting these matrices can be quite intensive computationally and can lead to memory issues. As such, we approximate CCA by using the diagonal matrices D1 and D2. From the SVD step, we get two projections U ∈ R|H|×m and V ∈ R2k|H|×m such that D −1/2 1 MD −1/2 2 ≈ UΣV > where Σ ∈ Rm×m is a diagonal matrix with Σii > 0 being the ith largest singular value of D −1/2 1 MD −1/2 2 . In order to get the final word em- beddings, we calculate D−1/21 U ∈ R|H|×m. Each row in this matrix corresponds to an m-dimensional vector for the corresponding word in the vocabulary. This means that f(hi) for hi ∈ H is the ith row of the matrix D−1/21 U. The projection V can be used to get “context embeddings.” See more about this in Dhillon et al. (2015). This use of CCA to derive word embeddings follows the usual distributional hypothesis (Harris, 1957) that most word embeddings techniques rely on. In the case of CCA, this hypothesis is trans- lated into action in the following way. CCA finds projections for the contexts and for the pivot words which are most correlated. This means that if a word co-occurs in a specific context many times (either directly, or transitively through similarity to other words), then this context is expected to be projected to a point “close” to the point to which the word is projected. As such, if two words occur in a specific context many times, these two words are expected to be projected to points which are close to each other. For the next section, we denote X = WD−1/21 and Y = CD−1/22 . To refer to the dimensions of X and Y generically, we denote d = |H| and d′ = 2k|H|. In addition, we refer to the column vectors of U and V as u1, . . . ,um and v1, . . . ,vm. Mathematical Intuition Behind CCA The pro- cedure that CCA follows finds a projection of the two views in a shared space, such that the correla- tion between the two views is maximized at each co- ordinate, and there is minimal redundancy between the coordinates of each view. This means that CCA solves the following sequence of optimization prob- lems for j ∈ [m] where aj ∈ R1×d and bj ∈ R1×d ′ : arg max aj,bj corr(ajW >,bjC >) such that corr(ajW >,akW >) = 0, k < j corr(bjC >,bkC >) = 0, k < j 419 where corr is a function that accepts two vectors and return the Pearson correlation between the pair- wise elements of the two vectors. The approxi- mate solution to this optimization problem (when using diagonal D1 and D2) is â>i = D −1/2 1 ui and b̂>i = D −1/2 2 vi for i ∈ [m]. CCA also has a probabilistic interpretation as a maximum likelihood solution of a latent variable model for two normal random vectors, each drawn based on a third latent Gaussian vector (Bach and Jordan, 2005). The way we describe CCA for deriving word embeddings is related to Latent Semantic Indexing (LSI), which performs singular value decomposition on the matrix M directly, without doing any kind of variance normalization. Dhillon et al. (2015) de- scribe some differences between LSI and CCA. The extra normalization step decreases the importance of frequent words when doing SVD. 4 Incorporating Prior Knowledge into Canonical Correlation Analysis In this section, we detail the technique we use to incorporate prior knowledge into the derivation of canonical correlation analysis. The main motiva- tion behind our approach is to improve the opti- mization of correlation between the two views by weighing them using the external source of prior knowledge. The prior knowledge is based on lex- ical resources such as WordNet, FrameNet and the Paraphrase Database. Our approach follows a sim- ilar idea to the one proposed by Koren and Carmel (2003) for improving the visualization of principal vectors with principal component analysis (PCA). It is also related to Laplacian manifold regularization (Belkin et al., 2006). An important notion in our derivation is that of a Laplacian matrix. The Laplacian of an undirected weighted graph is an n × n matrix where n is the number of nodes in the graph. It equals D−A where A is the adjacency matrix of the graph (so that Aij is the weight for the edge (i,j) in the graph, if it exists, and 0 otherwise) and D is a diagonal matrix such that Dii = ∑ j Aij. The Laplacian is always a sym- metric square matrix such that the sum over rows (or columns) is 0. It is also positive semi-definite. We propose a generalization of CCA, in which we introduce a Laplacian matrix into the derivation of CCA itself, as shown in Figure 2. We encode prior knowledge about the distances between the projec- tions of two views into the Laplacian. The Laplacian allows us to improve the optimization of the correla- tion between the two views by weighing them using the external source of prior knowledge. 4.1 Generalization of CCA We present three lemmas (proofs are given in Ap- pendix A), followed by our main proposition. These three lemmas are useful to prove our final proposi- tion. The main proposition shows that CCA maximizes the distance between the two view projections for any pair of examples i and j, i 6= j, while mini- mizing the two view projection distance for the two views of an example i. The two views we discuss here in practice are the view of the word through a one-hot representation, and the view which repre- sents the context words for a specific word token. The distance between two view projections is de- fined in Eq. 2. Lemma 1. Let X and Y be two matrices of size n×d and n × d′, respectively, for example, as defined in §3. Assume that ∑n i=1 Xij = 0 for j ∈ [d] and∑n i=1 Yij = 0 for j ∈ [d′]. Let L be an n × n Laplacian matrix such that Lij = { n− 1 if i = j −1 if i 6= j. (1) Then X>LY equals X>Y up to a multiplication by a positive constant. Lemma 2. Let A ∈ Rd×d′ . Then the rank m thin- SVD of A can be found by solving the following op- timization problem: max u1, . . . ,um, v1, . . . ,vm m∑ i=1 u>i Avi such that ||ui|| = ||vi|| = 1 i ∈ [m] 〈ui,uj〉 = 〈vi,vj〉 = 0 i 6= j where ui ∈ Rd×1 denote the left singular vectors, and vi ∈ Rd ′×1 denote the right singular vectors. 420 d n W n n L prior knowledge (optional) d′ n C diag W>W −1 2 D1 × W> × C × M diag C>C −1 2 D2 ≈ m d U × Σ × d′ mV > X> Y Figure 2: Introducing prior knowledge in CCA. W ∈ Rn×d and C ∈ Rn×d′ denote the word and context views respectively. L ∈ Rn×n is a Laplacian matrix encoded with the prior knowledge about the distances between the projections of W and C. The last utility lemma we describe shows that in- terjecting the Laplacian between the two views can be expressed as a weighted sum of the distances be- tween the projections of the two views (these dis- tances are given in Eq. 2), where the weights come from the Laplacian. Lemma 3. Let u1, . . . ,um and v1, . . . ,vm be two sets of vectors of length d and d′ respectively. Let L ∈ Rn×n be a Laplacian and X ∈ Rn×d and Y ∈ Rn×d ′ . Then: m∑ k=1 (Xuk) >L (Y vk) = ∑ i,j −Lij ( dmij )2 , where dmij = √√√√1 2 ( m∑ k=1 ([Xuk]i − [Y vk]j)2 ) . (2) The following proposition is our main result for this section. Proposition 4. The matrices U ∈ Rd×m and V ∈ Rd ′×m that CCA computes are the m-dimensional projections that maximize ∑ i,j ( dmij )2 −n n∑ i=1 (dmii ) 2 , (3) where dmij is defined as in Eq. 2 for u1, . . . ,um being the columns of U and v1, . . . ,vm being the columns of V . Proof. According to Lemma 3, the objective in Eq. 3 equals ∑m k=1(Xuk) >L(Y vk) where L is defined as in Eq. 1. Therefore, maximizing Eq. 3 corresponds to maximization of ∑m k=1(Xuk) >L(Y vk) under the constraints that the U and V matrices have orthonor- mal vectors. Using Lemma 2, it can be shown that the solution to this maximization is done by doing singular value decomposition on X>LY . Accord- ing to Lemma 1, this corresponds to finding U and V by doing singular value decomposition on X>Y , because a multiplicative constant does not change the value of the right/left singular vectors. The above proposition shows that CCA tries to find projections of both views such that the distances between the two views for pairs of examples with in- dices i 6= j are maximized (first term in Eq. 3), while 421 minimizing the distance between the projections of the two views for a specific example (second term in Eq. 3). Therefore, CCA tries to project a context and a word in that context to points that are close to each other in a shared space, while maximizing the distance between a context and a word which do not often co-occur together. As long as L is a Laplacian, Proposition 4 is still true, only with the maximization of the objective ∑ i,j −Lij ( dmij )2 , (4) where Lij ≤ 0 for i 6= j and Lii ≥ 0. This result lends itself to a generalization of CCA, in which we use predefined weights for the Laplacian that encode some prior knowledge about the distances that the projections of two views should satisfy. If the weight −Lij is large for a specific (i,j), then we will try harder to maximize the distance be- tween one view of example i and the other view of example j (i.e. we will try to project the word w(i) and the context of example j into distant points in the space). This means that in the current formulation, −Lij plays the role of a dissimiliarity indicator between pairs of words. The more dissimilar words are, the larger the weight, and then the more distant the pro- jections are for the contexts and the words. 4.2 From CCA with Dissimilarities to CCA with Similarities It is often more convenient to work with similarity measures between pairs of words. To do that, we can retain the same formulation as before with the Laplacian, where −Lij now denotes a measure of similarity. Now, instead of maximizing the objective in Eq. 4, we are required to minimize it. It can be shown that such mirror formulation can be done with an algorithm similar to CCA, leading to a proposition in the style of Proposition 4. To solve this minimization formulation, we just need to choose the singular vectors associated with the smallest m singular values (instead of the largest). Once we change the CCA algorithm with the Laplacian to choose these projections, we can de- fine L, for example, based on a similarity graph. The graph is an undirected graph that has |H| nodes, for Inputs: Set of examples {(w(i)1 , . . . ,w (i) k ,w (i),w (i) k+1, . . . ,w (i) 2k ) | i ∈ [n]}, an integer m, an α ∈ (0, 1], an undirected graph G over H, an integer N. Data structures: A matrix M of size |H|× (2k|H|) (cross-covariance matrix), a matrix U corresponding to the word embed- dings Algorithm: (Cross-covariance estimation) ∀i,j ∈ [n] such that |i− j| ≤ N • If i = j, increase Mrs by 1 for r denoting the in- dex of word w(i) and for all s denoting the context indices of words w(i)1 , . . . ,w (i) k and w (i) k+1, . . . ,w (i) 2k . • If i 6= j and word w(i) is connected to word w(j) in G, increase Mrs by α for r denoting the index of word w(i) and for all s denoting the context indices of words w(j)1 , . . . ,w (j) k and w (j) k+1, . . . ,w (j) 2k . • Calculate D1 and D2 as specified in §3. (Singular value decomposition step) • Perform singular value decomposition on D −1/2 1 MD −1/2 2 to get a matrix U ∈ R|H|×m. (Word embedding projection) • For each word hi for i ∈ [|H|] return the word em- bedding that corresponds with the ith row of U. Figure 3: The CCA-like algorithm that returns word em- beddings with prior knowledge encoded based on a simi- larity graph. each word in the vocabulary, and there is an edge be- tween a pair of words whenever the two words are similar to each other based on some external source of information, such as WordNet (for example, if they are synonyms). We then define the Laplacian L such that Lij = −1 if i and j are adjacent in the graph (and i 6= j), Lii is the degree of the node i and Lij = 0 in all other cases. By using this variant of CCA, we strive to maximize the distance of the two views between words which are adjacent in the graph (or continuing the example above, maximize the distance between words which are not synonyms). In addition, the fewer adjacent nodes a word has (or the more syn- onyms it has), the less important it is to minimize the distance between the two views of that given word. 422 4.3 Final Algorithm In order to use an arbitrary Laplacian matrix with CCA, we require that the data is centered, i.e. that the average over all examples of each of the coordi- nates of the word and context vectors is 0. However, such a prerequisite would make the matrices C and W dense (with many non-zero values), and hard to maintain in memory, and would also make singular value decomposition inefficient. As such, we do not center the data to keep it sparse, and as such, use a matrix L which is not strictly a Laplacian, but that behaves better in prac- tice.1 Given the graph mentioned in §4 which is ex- tracted from an external source of information, we use L such that Lij = α for an α ∈ (0, 1) which is treated as a smoothing factor for the graph (see below the choices of α) if i and j are not adjacent in the graph, Lij = 0 if i 6= j are adjacent, and finally Lii = 1 for all i ∈ [n]. Therefore, this ma- trix is symmetric, and the only constraint it does not satisfy is that of rows and columns summing to 0. Scanning the documents and calculating the statistic matrix with the Laplacian is computation- ally infeasible with a large number of tokens given as input. It is quadratic in that number. As such, we make another modification to the algorithm, and calculate a “local” Laplacian. The modification re- quires an integer N as input (we use N = 12), and then it makes updates to pairs of word tokens only if they are within an N-sized window of each. The final algorithm we use is described in Figure 3. The algorithm works by directly computing the co- occurrence matrix M (instead of maintaining W and C). It does so by increasing by 1 any cells corre- sponding to word-context co-occurrence in the doc- uments and by α any cells corresponding to word and contexts that are connected in the graph. 5 Experiments In this section we describe our experiments. 5.1 Experimental Setup Training Data We used three datasets, WIKI1, WIKI2 and WIKI5, all based on the first 1, 2 and 1We note that other decompositions, such as PCA, also re- quire centering of the data, but in case of sparse data matrix, this step is not performed. 5 billion words from Wikipedia respectively.2 Each dataset is broken into chunks of length 13 (window sizes of 6), corresponding to a document. The above Laplacian L is calculated within each document sep- arately. This means that −Lij is 1 only if i and j denote two words that appear in the same document. This is done to make the calculations computation- ally feasible. We calculate word embeddings for the top most frequent 200K words. Prior Knowledge Resources We consider three sources of prior knowledge: WordNet (Miller, 1995), the Paraphrase Database of Ganitkevitch et al. (2013), abbreviated as PPDB,3 and FrameNet (Baker et al., 1998). Since FrameNet and WordNet index words in their base form, we use WordNet’s stemmer to identify the base form for the text in our corpora whenever we calculate the Laplacian graph. For WordNet, we have an edge in the graph if one word is a synonym, hypernym or hyponym of the other. For PPDB, we have an edge if one word is a paraphrase of the other, according to the database. For FrameNet, we connect two words in the graph if they appear in the same frame. System Implementation We modified the imple- mentation of the SWELL Java package4 of Dhillon et al. (2015). Specifically, we needed to modify the loop that iterates over words in each document to a nested loop that iterates over pairs of words, in or- der to compute a sum of the form ∑ ij XriLijYjs. 5 Dhillon et al. (2015) use window size k = 2, which we retain in our experiments.6 5.2 Baselines Off-the-shelf Word Embeddings We compare our word embeddings with existing state-of-the- 2We downloaded the data from https://dumps. wikimedia.org/, and preprocessed it using the tool avail- able at http://mattmahoney.net/dc/textdata. html. 3We use the XL subset of the PPDB. 4https://github.com/paramveerdhillon/ swell. 5Our implementation and the word embeddings that we calculated are available at http://cohort.inf.ed.ac. uk/cohort/eigen/. 6We also use the square-root transformation as mentioned in Dhillon et al. (2015) which controls the variance in the counts accumulated from the corpus. See a justification for this trans- form in Stratos et al. (2015). 423 A B C D E F G H I Word similarity average Geographic analogies NP bracketing NPK WN PD FN NPK WN PD FN NPK WN PD FN R et ro fi tt in g Glove 59.7 63.1 64.6 57.5 94.8 75.3 80.4 94.8 78.1 79.5 79.4 78.7 Skip-Gram 64.1 65.5 68.6 62.3 87.3 72.3 70.5 87.7 79.9 80.4 81.5 80.5 Global Context 44.4 50.0 50.4 47.3 7.3 4.5 18.2 7.3 79.4 79.1 80.5 80.2 Multilingual 62.3 66.9 68.2 62.8 70.7 46.2 53.7 72.7 81.9 81.8 82.7 82.0 Eigen (CCA) 59.5 62.2 63.6 61.4 89.9 79.2 73.5 89.9 81.3 81.7 81.2 80.7 C C A P ri or α = 0.1 - 59.1 59.6 59.5 - 88.9 88.7 89.9 - 81.0 82.4 81.0 α = 0.2 - 59.9 60.6 60.0 - 89.1 91.3 90.1 - 81.0 81.3 80.7 α = 0.5 - 59.9 59.7 59.6 - 86.9 89.3 89.3 - 81.8 81.4 80.9 α = 0.7 - 60.7 59.3 59.5 - 86.9 89.3 92.9 - 80.3 81.2 80.8 α = 0.9 - 60.6 59.6 58.9 - 89.1 93.2 92.5 - 81.3 80.7 81.0 C C A P ri or + R F α = 0.1 - 61.9 63.6 61.5 - 76.0 71.9 89.9 - 81.4 81.7 81.2 α = 0.2 - 62.6 64.9 61.6 - 78.0 69.3 90.1 - 81.7 81.1 80.6 α = 0.5 - 62.7 63.7 61.4 - 74.9 67.3 92.9 - 81.9 81.4 80.0 α = 0.7 - 63.3 63.0 61.0 - 77.4 65.6 90.3 - 81.0 80.8 80.4 α = 0.9 - 62.0 63.3 60.4 - 77.3 66.2 92.5 - 81.0 80.7 80.4 Table 1: Results for the word similarity datasets, geographic analogies and NP bracketing. The first upper blocks (A–C) present the results with retrofitting. NPK stands for no prior knowledge (no retrofitting is used), WN for WordNet, PD for PPDB and FN for FrameNet. Glove, Skip-Gram, Global Context, Multilingual and Eigen are the word embeddings of Pennington et al. (2014), Mikolov et al. (2013b), Huang et al. (2012), Faruqui and Dyer (2014) and Dhillon et al. (2015) respectively. The second middle blocks (D–F) show the results of our eigenword embeddings encoded with prior knowledge using our method. Each row in the block corresponds to a specific use of an α value (smoothing factor), as described in Figure 3. In the lower blocks (G–I) we take the word embeddings from the second block, and retrofit them using the method of Faruqui et al. (2015). Best results in each block are in bold. art word embeddings, such as Glove (Pennington et al., 2014), Skip-Gram (Mikolov et al., 2013b), Global Context (Huang et al., 2012) and Multilin- gual (Faruqui and Dyer, 2014). We also compare our word embeddings with the Eigen word embeddings of Dhillon et al. (2015) without any prior knowl- edge. Retrofitting for Prior Knowledge We compare our approach of incorporating prior knowledge into the derivation of CCA against the previous works where prior knowledge is introduced in the off-the- shelf embeddings as a post-processing step (Faruqui et al., 2015; Rothe and Schütze, 2015). In this pa- per, we focus on the retrofitting approach of Faruqui et al. (2015). Retrofitting works by optimizing an objective function which has two terms: one that tries to keep the distance between the word vectors close to the original distances, and the other which enforces the vectors of words which are adjacent in the prior knowledge graph to be close to each other in the new embedding space. We use the retrofitting package7 to compare our results in different settings against the results of retrofitting of Faruqui et al. (2015). 5.3 Evaluation Benchmarks We evaluated the quality of our eigenword embed- dings on three different tasks: word similarity, geo- graphic analogies and NP bracketing. Word Similarity For the word similarity task we experimented with 11 different widely used bench- marks. The WS-353-ALL dataset (Finkelstein et al., 2002) consists of 353 pairs of English words with their human similarity ratings. Later, Agirre et al. (2009) re-annotated WS-353-ALL for similarity (WS-353-SIM) and relatedness (WS-353-REL) with specific distinctions between them. The SimLex- 999 dataset (Hill et al., 2015) was built to measure how well models capture similarity, rather than relat- edness or association. The MEN-TR-3000 dataset (Bruni et al., 2014) consists of 3000 word pairs 7https://github.com/mfaruqui/ retrofitting. 424 sampled from words that occur at least 700 times in a large web corpus. The datasets, MTurk-287 (Radinsky et al., 2011) and MTurk-771 (Halawi et al., 2012), were scored by Amazon Mechanical Turk workers for relatedness of English word pairs. The YP-130 (Yang and Powers, 2005) and Verb-143 (Baker et al., 2014) datasets were developed for verb similarity predictions. The last two datasets, MC-30 (Miller and Charles, 1991) and RG-65 (Rubenstein and Goodenough, 1965) consist of 30 and 65 noun pairs respectively. For each dataset, we calculate the cosine similar- ity between the vectors of word pairs and measure Spearman’s rank correlation coefficient between the scores produced by the embeddings and human rat- ings. We report the average of the correlations on all 11 datasets. Each word similarity task in the above list represents a different aspect of word similarity, and as such, averaging the results points to the qual- ity of the word embeddings on several tasks. We later analyze specific datasets. Geographic Analogies Mikolov et al. (2013c) created a test set of analogous word pairs such as a:b c:d raising the analogy question of the form “a is to b as c is to ” where d is unknown. We report results on a subset of this dataset which focuses on finding capitals of common countries, e.g., Greece is to Athens as Iraq is to . This dataset consists of 506 word pairs. For given word pairs, a:b c:d where d is unknown, we use the vector offset method (Mikolov et al., 2013b), i.e., we compute a vector v = vb − va + vc where va, vb and vc are vector representations of the words a, b and c respectively; we then return the word d with the greatest cosine similarity to v. NP Bracketing Here the goal is to identify the correct bracketing of a three-word noun (Lazaridou et al., 2013). For example, the bracketing of annual (price growth) is “right,” while the bracketing of (en- try level) machine is “left.” Similarly to Faruqui and Dyer (2015), we concatenate the word vectors of the three words, and use this vector for binary classifi- cation into left or right. Since most of the datasets that we evaluate on in this paper are not standardly separated into develop- ment and test sets, we report all results we calculated (with respect to hyperparameter differences) and do not select just a subset of the results. 5.4 Evaluation Preliminary Experiments In our first set of ex- periments, we vary the dimension of the word em- bedding vectors. We try m ∈ {50, 100, 200, 300}. Our experiments showed that the results consistently improve when the dimension increases for all the different datasets. For example, for m = 50 and WIKI1, we get an average of 46.4 on the word sim- ilarity tasks, 50.1 for m = 100, 53.4 for m = 200 and 54.2 for m = 300. The more data are available, the more likely larger dimension will improve the quality of the word embeddings. Indeed, for WIKI5, we get an average of 49.4, 54.9, 57.0 and 59.5 for each of the dimensions. The improvements with re- spect to the dimension are consistent across all of our results, so we fix m at 300. We also noticed a consistent improvement in ac- curacy when using more data from Wikipedia. For example, for m = 300, using WIKI1 gives an av- erage of 54.1, while using WIKI2 gives an average of 54.9 and finally, using WIKI5 gives an average of 59.5. We fix the dataset we use to be WIKI5. Results Table 1 describes the results from our first set of experiments. (Note that the table is divided into 9 distinct blocks, labeled A through I.) In gen- eral, adding prior knowledge to eigenword embed- dings does improve the quality of word vectors for the word similarity, geographic analogies and NP bracketing tasks on several occasions (blocks D–F compared to last row in blocks A–C). For example, our eigenword vectors encoded with prior knowl- edge (CCAPrior) consistently perform better than the eigenword vectors that do not have any prior knowledge for the word similarity task (59.5, Eigen in the first row under NPK column, versus block D). The only exceptions are for α = 0.1 with Word- Net (59.1), for α = 0.7 with PPDB (59.3) and for α = 0.9 with FrameNet (58.9), where α denotes the smoothing factor. In several cases, running the retrofitting algorithm of Faruqui et al. (2015) on top of our word embed- dings helps further, as if “adding prior knowledge twice is better than once.” Results for these word embeddings (CCAPrior+RF) are shown in Table 1. Adding retrofitting to our encoding of prior knowl- 425 edge often performs better for word similarity and NP bracketing tasks (block D versus G and block F versus I). Interestingly, CCAPrior+RF embeddings also often perform better than eigenword vectors (Eigen) of Dhillon et al. (2015) when retrofitted using the method of Faruqui et al. (2015). For example, in the word similarity task, eigenwords retrofitted with WordNet get an accuracy of 62.2 whereas encoding prior knowledge using both CCA and retrofitting gets a maximum accuracy of 63.3. We see the same pattern for PPDB, with 63.6 for “Eigen” and 64.9 for “CCAPrior+RF”. We hypoth- esize that the reason for these changes is that the two methods for encoding prior knowledge maxi- mize different objective functions. The performance with FrameNet is weaker, in some cases leading to worse performance (e.g., with Glove and SG vectors). We believe that FrameNet does not perform as well as the other lexicons be- cause it groups words based on very abstract con- cepts; often words with seemingly distantly related meanings (e.g., push and growth) can evoke the same frame. This also supports the findings of Faruqui et al. (2015), who noticed that the use of FrameNet as a prior knowledge resource for improv- ing the quality of word embeddings is not as helpful as other resources such as WordNet and PPDB. We note that CCA works especially well for the geographic analogies dataset. The quality of eigen- word embeddings (and the other embeddings) de- grades when we encode prior knowledge using the method of Faruqui et al. (2015). Our method im- proves the quality of eigenword embeddings. Global Picture of the Results When comparing retrofitting to CCA with prior knowledge, there is a noticable difference. Retrofitting performs well or badly, depending on the dataset, while the re- sults with CCA are more stable. We attribute this to the difference between how our algorithm and retrofitting work. Retrofitting makes a direct use of the source of prior knowledge, by adding a regular- ization term that enforces words which are similar according to the prior knowledge to be closer in the embedding space. Our algorithm, on the other hand, makes a more indirect use of the source of prior knowledge, by changing the co-occurence matrix on which we do singular value decomposition. Specifically, we believe that our algorithm is more stable to cases in which words for the task at hand are unknown words with respect to the source of prior knowledge. This is demonstrated with the ge- ographical analogies task: in that case, retrofitting lowers the results in most cases. The city and coun- try names do not appear in the sources of prior knowledge we used. Further Analysis We further inspected the results on the word similarity tasks for the RG-65 and WS- 353-ALL datasets. Our goal was to find cases in which either CCA embeddings by themselves out- perform other types of embeddings or that encoding prior knowledge into CCA the way we describe sig- nificantly improves the results. For the WS-353-ALL dataset, the eigenword em- beddings get a correlation of 69.6. The next best performing word embeddings are the multilingual word embeddings (68.0) and skip-gram (58.3). In- terestingly enough, the multilingual word embed- dings also use CCA to project words into a low- dimensional space using a linear transformation, suggesting that linear projections are a good fit for the WS-353-ALL dataset. The dataset itself includes pairs of common words with a corresponding simi- larity score. The words that appear in the dataset are actually expected to occur in similar contexts, a property that CCA directly encodes when deriving word embeddings. The best performance on the RG-65 dataset is with the Glove word embeddings (76.6). CCA em- beddings give an accuracy of 69.7 on that dataset. However, with this dataset, we observe significant improvement when encoding prior knowledge using our method. For example, using WordNet with this dataset improves the results by 4.2 points (73.9). Us- ing the method of Faruqui et al. (2015) (with Word- Net) on top of our CCA word embeddings improves the results even further by 8.7 points (78.4). The Role of Prior Knowledge We also designed an experiment to test whether using distributional in- formation is necessary for having well-performing word embeddings, or whether it is sufficient to rely on the prior knowledge resource. In order to test this, we created a sparse matrix that corresponds to the graph based on the external resource graph. We then follow up with singular value decomposition on 426 Resource WordSim NP Bracketing WordNet 35.9 73.6 PPDB 37.5 77.9 FrameNet 19.9 74.5 Table 2: Results on word similarity dataset (average over 11 datasets) and NP bracketing. The word embed- dings are derived by using SVD on the similarity graph extracted from the prior knowledge source (WordNet, PPDB and FrameNet). that graph, and get embeddings of size 300. Table 2 gives the results when using these embeddings. We see that the results are consistently lower than the results that appear in Table 1, implying that the use of prior knowledge comes hand in hand with the use of distributional information. When using the retrofitting method by Faruqui et al. on top of these word embeddings, the results barely improved. 6 Related Work Our ideas in this paper for encoding prior knowl- edge in eigenword embeddings relate to three main threads in existing literature. One of the threads focuses on modifying the ob- jective of word vector training algorithms. Yu and Dredze (2014), Xu et al. (2014), Fried and Duh (2015) and Bian et al. (2014) augment the training objective in neural language models of Mikolov et al. (2013a) to encourage semantically related word vectors to come closer to each other. Wang et al. (2014) propose a method for jointly embedding en- tities (from FreeBase, a large community-curated knowledge base) and words (from Wikipedia) into the same continuous vector space. Chen and de Melo (2015) propose a similar joint model to im- prove the word embeddings, but rather than us- ing structured knowledge sources their model fo- cuses on discovering stronger semantic connections in specific contexts in a text corpus. Another research thread relies on post-processing steps to encode prior knowledge from semantic lex- icons in off-the-shelf word embeddings. The main intuition behind this trend is to update word vec- tors by running belief propagation on a graph ex- tracted from the relation information in semantic lexicons. The retrofitting approach of Faruqui et al. (2015) uses such techniques to obtain higher quality semantic vectors using WordNet, FrameNet, and the Paraphrase Database. They report on how retrofitting helps improve the performance of vari- ous off-the-shelf word vectors such as Glove, Skip- Gram, Global Context, and Multilingual, on vari- ous word similarity tasks. Rothe and Schütze (2015) also describe how standard word vectors can be ex- tended to various data types in semantic lexicons, e.g., synsets and lexemes in WordNet. Most of the standard word vector training algo- rithms use co-occurrence within window-based con- texts to measure relatedness among words. Sev- eral studies question the limitations of defining re- latedness in this way and investigate if the word co-occurrence matrix can be constructed to encode prior knowledge directly to improve the quality of word vectors. Wang et al. (2015) investigate the no- tion of relatedness in embedding models by incor- porating syntactic and lexicographic knowledge. In spectral learning, Yih et al. (2012) augment the word co-occurrence matrix on which LSA operates with relational information such that synonyms will tend to have positive cosine similarity, and antonyms will tend to have negative similarities. Their vector space representation successfully projects synonyms and antonyms on opposite sides in the projected space. Chang et al. (2013) further generalize this approach to encode multiple relations (and not just opposing relations, such as synonyms and antonyms) using multi-relational LSA. In spectral learning, most of the studies on in- corporating prior knowledge in word vectors focus on LSA based word embeddings (Yih et al., 2012; Chang et al., 2013; Turney and Littman, 2005; Tur- ney, 2006; Turney and Pantel, 2010). From the technical perspective, our work is also related to that of Jagarlamudi et al. (2011), who showed how to generalize CCA so that it uses lo- cality preserving projections (He and Niyogi, 2004). They also assume the existence of a weight matrix in a multi-view setting that describes the distances between pairs of points in the two views. More generally, CCA is an important component for spectral learning algorithms in the unsupervised setting and with latent variables (Cohen et al., 2014; Narayan and Cohen, 2016; Stratos et al., 2016). Our method for incorporating prior knowledge into CCA could potentially be transferred to these algorithms. 427 7 Conclusion We described a method for incorporating prior knowledge into CCA. Our method requires a rela- tively simple change to the original canonical cor- relation analysis, where extra counts are added to the matrix on which singular value decomposition is performed. We used our method to derive word em- beddings in the style of eigenwords, and tested them on a set of datasets. Our results demonstrate several advantages of encoding prior knowledge into eigen- word embeddings. Acknowledgements The authors would like to thank Paramveer Dhillon for his help with running the SWELL package. The authors would also like to thank Manaal Faruqui and Sujay Kumar Jauhar for their help and techni- cal assistance with the retrofitting package and the word embedding evaluation suite. Thanks also to Ankur Parikh for early discusions on this project. This work was completed while the first author was an intern at the University of Edinburgh, as part of the Equate Scotland program. This research was supported by an EPSRC grant (EP/L02411X/1) and an EU H2020 grant (688139/H2020-ICT-2015; SUMMA). Appendix A: Proofs Proof of Lemma 1. The proof is similar to the one that appears in Koren and Carmel (2003) for Lemma 3.1. The only difference is the use of two views. Note that [X>LY ]ij = ∑ k,k′ XkiLkk′Yk′j . As such, [X>LY ]ij = ∑ k,k′ (nδkk′ − 1)XkiYk′j = n∑ k=1 nXkiYkj − ( n∑ k=1 Xki ) ︸ ︷︷ ︸ 0 × ( n∑ k′=1 Yk′j ) ︸ ︷︷ ︸ 0 = n[X>Y ]ij, where δkk′ = 1 iff k = k′ and 0 otherwise, and the sec- ond equality relies on the assumption of the data being centered. Proof of Lemma 2. Without loss of generality, assume d ≤ d′. Let u′1, . . . ,u′d be the left singular vectors of A and v′1, . . . ,v ′ d′ be the right ones, and σ1, . . . ,σd be the singular values. Therefore A = ∑d j=1 σju ′ j(v ′ j) >. In addition, the objective equals (after substituting A): m∑ i=1 d∑ j=1 σj〈ui,u′j〉〈vi,v′j〉 = d∑ j=1 σj ( m∑ i=1 〈ui,u′j〉〈vi,v′j〉 ) (5) Note that by the Cauchy-Schwartz inequality: d∑ j=1 m∑ i=1 〈ui,u′j〉〈vi,v′j〉 = m∑ i=1 d∑ j=1 〈ui,u′j〉〈vi,v′j〉 ≤ m∑ i=1 √√√√ d∑ j=1 |〈ui,u′j〉|2 √√√√ d∑ j=1 |〈vi,v′j〉|2 ≤ m In addition, note that if we choose ui = u′i and vi = v′i, then the inequality above becomes an equality, and in addition, the objective in Eq. 5 will equal the sum of the m largest singular vectors ∑m j=1 σj . As such, this assignment to ui and vi maximizes the objective. Proof of Lemma 3. First, by definition of matrix multi- plication, m∑ k=1 (Xuk) >L (Y vk) = ∑ i,j Lij ( m∑ k=1 [Xuk]i[Y vk]j ) . (6) Also, ( dmij )2 = 1 2 ( m∑ k=1 [Xuk] 2 i − 2[Xuk]i[Y vk]j + [Y vk]2j ) . Therefore, 2 ∑ i,j −Lij ( dmij )2 = ∑ i,j −Lij ( m∑ k=1 −2[Xuk]i[Y vk]j ) + ∑ i,j −Lij ( m∑ k=1 [Xuk] 2 i + [Y vk] 2 j ) ︸ ︷︷ ︸ 0 = 2 ∑ i,j Lij ( m∑ k=1 [Xuk]i[Y vk]j, ) (7) where the first two terms disappear because of the defini- tion of the Laplacian. The comparison of Eq. 6 to Eq. 7 gives us the necessary result. 428 References Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distribu- tional and wordnet-based approaches. In Proceedings of HLT-NAACL. Francis Bach and Michael Jordan. 2005. A probabilistic interpretation of canonical correlation analysis. Tech Report 688, Department of Statistics, University of California, Berkeley. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In Proceed- ings of ACL. Simon Baker, Roi Reichart, and Anna Korhonen. 2014. An unsupervised model for instance level subcatego- rization acquisition. In Proceedings of EMNLP. Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Tailoring continuous word representations for depen- dency parsing. In Proceedings of ACL. Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. 2006. Manifold regularization: A geometric frame- work for learning from labeled and unlabeled exam- ples. Journal of Machine Learning Research, 7:2399– 2434. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic lan- guage model. Journal of Machine Learning Research, 3:1137–1155. Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014. Knowledge-powered deep learning for word embed- ding. In Machine Learning and Knowledge Discovery in Databases, volume 8724 of Lecture Notes in Com- puter Science, pages 132–148. Avrim Blum and Tom Mitchell. 1998. Combining la- beled and unlabeled data with co-training. In Proceed- ings of COLT. Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of Arti- ficial Intelligence Research, 49:1–47. Kai-Wei Chang, Wen-tau Yih, and Christopher Meek. 2013. Multi-relational latent semantic analysis. In Proceedings of EMNLP. Jiaqiang Chen and Gerard de Melo. 2015. Semantic in- formation extraction for improved word embeddings. In Proceedings of NAACL Workshop on Vector Space Modeling for NLP. Shay B. Cohen, K. Stratos, Michael Collins, Dean P. Fos- ter, and Lyle Ungar. 2014. Spectral learning of latent- variable PCFGs: Algorithms and sample complexity. Journal of Machine Learning Research. Ronan Collobert and Jason Weston. 2008. A unified ar- chitecture for natural language processing: Deep neu- ral networks with multitask learning. In Proceedings of ICML. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391– 407. Paramveer S. Dhillon, Dean P. Foster, and Lyle H. Ungar. 2015. Eigenwords: Spectral word embeddings. Jour- nal of Machine Learning Research, 16:3035–3078. Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correla- tion. In Proceedings of EACL. Manaal Faruqui and Chris Dyer. 2015. Non- distributional word vector representations. In Pro- ceedings of ACL. Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Pro- ceedings of NAACL. Lev Finkelstein, Gabrilovich Evgenly, Matias Yossi, Rivlin Ehud, Solan Zach, Wolfman Gadi, and Ruppin Eytan. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1):116–131. Daniel Fried and Kevin Duh. 2015. Incorporating both distributional and relational semantics in word repre- sentations. In Proceedings of ICLR. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of NAACL. Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. 2012. Large-scale learning of word relatedness with constraints. In Proceedings of ACM SIGKDD. Zellig S. Harris. 1957. Co-occurrence and transforma- tion in linguistic structure. Language, 33(3):283–340. Xiaofei He and Partha Niyogi. 2004. Locality preserving projections. In Proceedings of NIPS. Felix Hill, Roi Reichart, and Anna Korhonen. 2015. SimLex-999: Evaluating semantic models with (gen- uine) similarity estimation. Computational Linguis- tics, 41(4):665–695. Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representa- tions via global context and multiple word prototypes. In Proceedings of ACL. Jagadeesh Jagarlamudi and Hal Daumé. 2012. Regu- larized interlingual projections: Evaluation on mul- tilingual transliteration. In Proceedings of EMNLP- CoNLL. Jagadeesh Jagarlamudi, Raghavendra Udupa, and Hal Daumé. 2011. Generalization of CCA via spectral embedding. In Proceedings of the Snowbird Learning Workshop of AISTATS. 429 Yehuda Koren and Liran Carmel. 2003. Visualization of labeled data using linear transformations. In Proceed- ings of IEEE Conference on Information Visualization. Thomas K. Landauer, Peter W. Foltz, and Darrell La- ham. 1998. An introduction to latent semantic analy- sis. Discourse Processes, 25:259–284. Angeliki Lazaridou, Eva Maria Vecchi, and Marco Ba- roni. 2013. Fish transporters and miracle homes: How compositional distributional semantics can help NP parsing. In Proceedings of EMNLP. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word represen- tations in vector space. In Proceedings of ICLR Work- shop. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed representa- tions of words and phrases and their compositionality. In Proceedings of NIPS. Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT. George A. Miller and Walter G. Charles. 1991. Contex- tual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28. George A Miller. 1995. WordNet: A lexical database for English. Communications of the ACM, 38(11):39–41. Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In Proceedings of ICML. Shashi Narayan and Shay B. Cohen. 2016. Optimizing spectral learning for parsing. In Proceedings of ACL. Ankur P. Parikh, Shay B. Cohen, and Eric Xing. 2014. Spectral unsupervised parsing with additive tree met- rics. In Proceedings of ACL. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of EMNLP. Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: Com- puting word relatedness using temporal semantic anal- ysis. In Proceedings of ACM WWW. Sascha Rothe and Hinrich Schütze. 2015. AutoEx- tend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of ACL-IJCNLP. Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633. Carina Silberer, Vittorio Ferrari, and Mirella Lapata. 2013. Models of semantic representation with visual attributes. In Proceedings of ACL. Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of ACL. Karl Stratos, Michael Collins, and Daniel Hsu. 2015. Model-based word embeddings from decompositions of count matrices. In Proceedings of ACL. Karl Stratos, Michael Collins, and Daniel Hsu. 2016. Unsupervised part-of-speech tagging with anchor hid- den markov models. Transactions of the Association for Computational Linguistics, 4:245–257. Peter D. Turney and Michael L. Littman. 2005. Corpus- based learning of analogies and semantic relations. Machine Learning, 60(1-3):251–278. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1):141– 188. Peter D. Turney. 2006. Similarity of semantic relations. Computational Linguistics, 32(3):379–416. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph and text jointly em- bedding. In Proceedings of EMNLP. Tong Wang, Abdelrahman Mohamed, and Graeme Hirst. 2015. Learning lexical embeddings with syntactic and lexicographic knowledge. In Proceedings of ACL- IJCNLP. Chang Xu, Yalong Bai, Jiang Bian, Bin Gao, Gang Wang, Xiaoguang Liu, and Tie-Yan Liu. 2014. RC-NET: A general framework for incorporating knowledge into word representations. In Proceedings of the ACM CIKM. Dongqiang Yang and David MW Powers. 2005. Mea- suring semantic similarity in the taxonomy of Word- Net. In Proceedings of the Australasian Conference on Computer Science. David Yarowsky. 1995. Unsupervised word sense dis- ambiguation rivaling supervised methods. In Proceed- ings of ACL. Wen-tau Yih, Geoffrey Zweig, and John Platt. 2012. Po- larity inducing latent semantic analysis. In Proceed- ings of EMNLP-CoNLL. Mo Yu and Mark Dredze. 2014. Improving lexical em- beddings with semantic knowledge. In Proceedings of ACL. 430