Multi-Modal Models for Concrete and Abstract Concept Meaning Felix Hill Computer Laboratory University of Cambridge fh295@cam.ac.uk Roi Reichart Technion - IIT Haifa, Israel roiri@ie.technion.ac.il Anna Korhonen Computer Laboratory University of Cambridge alk23@cam.ac.uk Abstract Multi-modal models that learn semantic rep- resentations from both linguistic and percep- tual input outperform language-only models on a range of evaluations, and better reflect human concept acquisition. Most perceptual input to such models corresponds to concrete noun concepts and the superiority of the multi- modal approach has only been established when evaluating on such concepts. We there- fore investigate which concepts can be effec- tively learned by multi-modal models. We show that concreteness determines both which linguistic features are most informative and the impact of perceptual input in such mod- els. We then introduce ridge regression as a means of propagating perceptual informa- tion from concrete nouns to more abstract con- cepts that is more robust than previous ap- proaches. Finally, we present weighted gram matrix combination, a means of combining representations from distinct modalities that outperforms alternatives when both modalities are sufficiently rich. 1 Introduction What information is needed to learn the meaning of a word? Children learning words are exposed to a diverse mix of information sources. These include clues in the language itself, such as nearby words or speaker intention, but also what the child perceives about the world around it when the word is heard. Learning the meaning of words requires not only a sensitivity to both linguistic and perceptual input, but also the ability to process and combine informa- tion from these modalities in a productive way. Many computational semantic models represent words as real-valued vectors, encoding their rela- tive frequency of occurrence in particular forms and contexts in linguistic corpora (Sahlgren, 2006; Tur- ney et al., 2010). Motivated both by parallels with human language acquisition and by evidence that many word meanings are grounded in the percep- tual system (Barsalou et al., 2003), recent research has explored the integration into text-based models of input that approximates the visual or other sen- sory modalities (Silberer and Lapata, 2012; Bruni et al., 2014). Such models can learn higher-quality semantic representations than conventional corpus- only models, as evidenced by a range of evaluations. However, the majority of perceptual input for the models in these studies corresponds directly to con- crete noun concepts, such as chocolate or cheese- burger, and the superiority of the multi-modal over the corpus-only approach has only been established when evaluations include such concepts (Leong and Mihalcea, 2011; Bruni et al., 2012; Roller and Schulte im Walde, 2013; Silberer and Lapata, 2012). It is thus unclear if the multi-modal approach is ef- fective for more abstract words, such as guilt or obe- sity. Indeed, since empirical evidence indicates dif- ferences in the representational frameworks of both concrete and abstract concepts (Paivio, 1991; Hill et al., 2013), and verb and noun concepts (Markman and Wisniewski, 1997), perceptual information may not fulfill the same role in the representation of the various concept types. This potential challenge to the multi-modal approach is of particular practical importance since concrete nouns constitute only a small proportion of the open-class, meaning-bearing words in everyday language (Section 2). In light of these considerations, this paper ad- dresses three questions: (1) Which information sources (modalities) are important for acquiring concepts of different types? (2) Can perceptual in- put be propagated effectively from concrete to more abstract words? (3) What is the best way to combine information from the different sources? We construct models that acquire semantic repre- sentations for four sets of concepts: concrete nouns, abstract nouns, concrete verbs and abstract verbs. The linguistic input to the models comes from the recently released Google Syntactic N-Grams Corpus (Goldberg and Orwant, 2013), from which a selec- tion of linguistic features are extracted. Perceptual input is approximated by data from the McRae et al. (2005) norms, which encode perceptual proper- ties of concrete nouns, and the ESPGame dataset (Von Ahn and Dabbish, 2004), which contains man- ually generated descriptions of 100,000 images. To address (1) we extract representations for each concept type from combinations of information sources. We first focus on different classes of lin- guistic features, before extending our models to the multi-modal context. While linguistic information overall effectively reflects the meaning of all con- cept types, we show that features encoding syntac- tic patterns are only valuable for the acquisition of abstract concepts. On the other hand, perceptual in- formation, whether directly encoded or propagated through the model, plays a more important role in the representation of concrete concepts. In addressing (2), we propose ridge regression (Myers, 1990) as a means of propagating features from concrete nouns to more abstract concepts. The regularization term in ridge regression encourages solutions that generalize well across concept types. We show that ridge regression effectively propagates perceptual information to abstract nouns and con- crete verbs, and is overall preferable to both lin- ear regression and the method of Johns and Jones (2012) applied to a similar task by Silberer and La- pata (2012). However, for all propagation methods, the impact of integrating perceptual information de- pends on the concreteness of the target concepts. In- deed, for abstract verbs, the most abstract concept type in our evaluations, perceptual input actually de- grades representation quality. This highlights the need to consider the concreteness of the target do- main when constructing multi-modal models. To address (3), we present various means of com- bining information from different modalities. We propose weighted gram matrix combination, a tech- nique in which representations of distinct modalities are mapped to a space of common dimension where coordinates reflect proximity to other concepts. This transformation, which has been shown to enhance semantic representations in the context of verb- clustering (Reichart and Korhonen, 2013), reduces representation sparsity and facilitates a product- based combination that results in greater inter-modal dependency. Weighted gram matrix combination outperforms alternatives such as concatenation and Canonical Correlation Analysis (CCA) (Hardoon et al., 2004) when combining representations from two similarly rich information sources. In Section 3, we present experiments with linguis- tic features designed to address question (1). These analyses are extended to multi-modal models in Sec- tion 4, where we also address (2) and (3). We first discuss the relevance of concreteness and part-of- speech (lexical function) to concept representation. 2 Concreteness and Word Meaning A large and growing body of psychological evidence indicates differences between abstract and concrete concepts.1 It has been shown that concrete words are more easily learned, remembered and processed than abstract words (Paivio, 1991; Schwanenflugel and Shoben, 1983), while neuroimaging studies demonstrate differences in brain activity when sub- jects are presented with stimuli corresponding to the two concept types (Binder et al., 2005). The abstract/concrete distinction is important to computational semantics for various reasons. While many models construct representations of concrete words (Andrews et al., 2009; Landauer and Dumais, 1997), abstract words are in fact far more common in everyday language. For instance, based on an analy- sis of those noun concepts in the University of South Florida dataset (USF) and their occurrence in the British National Corpus (BNC) (Leech et al., 1994), 72% of noun tokens in corpora are rated by human 1Here concreteness is understood intuitively, as per the psy- chological literature (Rosen, 2001; Gallese and Lakoff, 2005). ● ●●● ●● mood praise beam clam sardine penguin look stab rule enjoy leave swingbelieve Nouns Verbs 0 2 4 6 Average Concreteness Rating Figure 1: Boxplot of concreteness distributions for noun and verb concepts in the USF data, with selected example concepts. The bold vertical line is the mean, boxes extend from the first to the third quartile, and dots represent outliers. judges as more abstract than the noun war, a concept that many would already consider quite abstract.2 The recent interest in multi-modal semantics fur- ther motivates a principled modelling approach to lexical concreteness. Many multi-modal models im- plicitly distinguish concrete and abstract concepts since their perceptual input corresponds only to con- crete words (Bruni et al., 2012; Silberer and Lapata, 2012; Roller and Schulte im Walde, 2013). How- ever, given that many abstract concepts express re- lations or modifications of concrete concepts (Gen- tner and Markman, 1997), it is reasonable to expect that perceptual information about concrete concepts could also enhance the quality of more abstract rep- resentations in an appropriately constructed model. Moreover, concreteness is closely related to more functional lexical distinctions, such as those be- tween adjectives, nouns and verbs. An analysis of the USF dataset, which includes concreteness ratings for over 4,000 words collected from thou- sands of participants, indicates that on average verbs (mean concreteness, 3.64) are considered more ab- stract than nouns (mean concreteness, 4.91), an ef- fect illustrated in Figure 1. This connection be- tween lexical function and concreteness suggests that a sensitivity to concreteness could improve models that already make principled distinctions be- tween words based on their part-of-speech (POS) (Im Walde, 2006; Baroni and Zamparelli, 2010). Although the focus of this paper is on multi- modal models, few conventional semantic mod- els make principled distinctions between concepts based on function or concreteness. Before turning to the multi-modal case, we thus investigate whether 2This sample covers 15.2% of all noun tokens in the BNC. these distinctions are pertinent to text-only models. 3 Concreteness and Linguistic Features It has long been known that aspects of word meaning can be inferred from nearby words in corpora. Ap- proaches that exploit this fact are often called dis- tributional models (Sahlgren, 2006; Turney et al., 2010). We take a distributional approach to learn- ing linguistic representations. The advantage of us- ing distributional methods to learn representations from corpora versus approaches that rely on knowl- edge bases (Pedersen et al., 2004; Leong and Mi- halcea, 2011) is that they are more scalable, easily applicable across languages and plausibly reflect the process of human word learning (Landauer and Du- mais, 1997; Griffiths et al., 2007). We group dis- tributional features into three classes to test which forms of linguistic information are most pertinent to the abstract/concrete and verb/noun distinctions. All features are extracted from The Google Syntactic N-grams Corpus. The dataset contains counted dependency-tree fragments for over 10bn words of the English Google Books Corpus. 3.1 Feature Classes Lexical Features Our lexical features are the co- occurrence counts of a concept word with each of the other 2,529 concepts in the USF data. Co- occurrences are counted in a 5-word window, and, as elsewhere (Erk and Padó, 2008), weighted by point- wise mutual information (PMI) to control for the un- derlying frequency of both concept and word. POS-tag Features Many words function as more than one POS, and this variation can be indicative of meaning (Manning, 2011). For example, deverbal Context Example indirect object gave it to the man Noun direct object gave the pie to him Concepts subject the man grinned in PP was in his mouth adject. modifier the portly man infinitive clause to eat is human transitive he bit the steak Verb intransitive he salivated Concepts distransitive put jam on the toast phrasal verb he gobbled it up infinitival comp. he wants to snooze clausal comp. I bet he won’t diet Table 1: Grammatical features for noun/verb concepts nouns, such as shiver or walk, often refer to pro- cesses rather than entities. To capture such effects, we count the frequency of occurrence with the POS categories ajdective, adverb, noun and verb. Grammatical Features Grammatical role is a strong predictor of semantics (Gildea and Jurafsky, 2002). For instance, the subject of transitive verbs is more likely to refer to an animate entity than a noun chosen at random. Syntactic context also pre- dicts verb semantics (Kipper et al., 2008). We thus count the frequency of nouns in a range of (non- lexicalized) syntactic contexts, and of verbs in one of the six most common subcategorization-frame classes as defined in Van de Cruys et al. (2012). These contexts are detailed in Table 1. 3.2 Evaluation Sets We create evaluation sets of abstract and con- crete concepts, and introduce a complementary di- chotomy between nouns and verbs, the two POS cat- egories most fundamental to propositional meaning. To construct these sets, we extract nouns and verbs from word pairs in the USF data based on their ma- jority POS-tag in the lemmatized BNC (Leech et al., 1994), excluding any word not assigned to either of the POS categories in more than 70% of instances. From the resulting 2175 nouns and 354 verbs, the abstract-concrete distinction is drawn by ordering words according to concreteness and sampling at random from the first and fourth quartiles. Any con- crete nouns not occurring in the McRae et al. (2005) Property Norm dataset were also excluded. Concept Type Words Pairs Examples concrete nouns 303 1280 yacht, cup abstract nouns 100 295 fear, respect all nouns 403 1716 cup, respect concrete verbs 50 66 kiss, launch abstract verbs 50 127 differ, obey all verbs 100 221 kiss, differ Table 2: Evaluation sets used throughout. All nouns and all verbs are the union of abstract and concrete subsets and mixed abstract-concrete or concrete-abstract pairs. For each list of concepts L = concrete nouns, concrete verbs, abstract nouns, abstract verbs, to- gether with lists all nouns and all verbs, a corre- sponding set of pairs {(w1,w2) ∈ USF : w1,w2 ∈ L} is defined for evaluation. These details are sum- marized in Table 2. Evaluation lists, sets of pairs and USF scores are downloadable from our website. 3.3 Evaluation Methodology All models are evaluated by measuring correlations with the free-association scores in the USF dataset (Nelson et al., 2004). This dataset contains the free- association strength of over 150,000 word pairs.3 These data reflect the cognitive proximity of con- cepts and have been widely used in NLP as a gold- standard for computational models (Andrews et al., 2009; Feng and Lapata, 2010; Silberer and Lapata, 2012; Roller and Schulte im Walde, 2013). For evaluation pairs (c1,c2) we calculate the co- sine similarity between our learned feature represen- tations for c1 and c2, a standard measure of the prox- imity of two vectors (Turney et al., 2010), and follow previous studies (Leong and Mihalcea, 2011; Huang et al., 2012) in using Spearman’s ρ as a measure of correlation between these values and our gold- standard.4 All representations in this section are combined by concatenation, since the present focus is not on combination methods.5 3Free-association strength is measured by presenting sub- jects with a cue word and asking them to produce the first word they can think of that is associated with that cue word. 4We consider Spearman’s ρ, a non-parametric ranking cor- relation, to be more appropriate than Pearson’s r for free asso- ciation data, which is naturally skewed and non-continuous. 5When combining multiple representations we normalize Feature Type All Nouns Conc. Nouns Abs. Nouns All Verbs Conc. Verbs Abs. Verbs (1) Lexical 0.168* 0.199* 0.248* 0.173* 0.268* 0.109 (2) POS-tag 0.059* 0.012 0.119* 0.052 -0.074 0.123 (3) Grammatical 0.078* 0.027 0.121* 0.009 -0.017 0.114 (1)+(2)+(3) 0.182 * 0.181* 0.247* 0.172* 0.267* 0.108 Table 3: Spearman correlation ρ of cosine similarity between vector representations derived from three feature classes with USF scores. * indicates statistically significant correlations (p < 0.05 ). 3.4 Results The performance of each feature class on the eval- uation sets is detailed in Table 3. When all linguis- tic features are included, performance is somewhat better on noun concepts (ρ = 0.182) than verbs (ρ = 0.172). However, while correlations are sig- nificant on concrete (ρ = 0.181) and abstract nouns (ρ = 0.247) and concrete verbs, the effect is not significant on abstract verbs (although it is on verbs overall). The highest correlations for the linguistic features together are on abstract nouns (ρ = 0.247) and concrete verbs (ρ = 0.267). Referring back to the continuum in Figure 1, it is possible that there is an optimum concreteness level, exhibited by ab- stract nouns and concrete verbs, at which conceptual meaning is best captured by linguistic models. The results indicate that the three feature classes convey distinct information. It is perhaps unsur- prising that lexical features produce the best perfor- mance in the majority of cases; the value of lexical co-occurrence statistics in conveying word meaning is expressed in the well known distributional hy- pothesis (Harris, 1954). More interestingly, on ab- stract concepts the contribution of POS-tag (nouns, ρ = 0.119; verbs, ρ = 0.123 ) and grammatical features (nouns, ρ = 0.121; verbs, ρ = 0.114) is no- tably higher than on the corresponding concrete con- cepts. The importance of such features to modelling free-association between abstract concepts suggests that they may convey information about how con- cepts are (subjectively) organized and interrelated in the minds of language users, independent of their realisation in the physical world. Indeed, since ab- stract representations rely to a lesser extent than con- crete representations on perceptual input (Section 4), it is perhaps unsurprising that more of their meaning is reflected in subtle linguistic patterns. The results in this section demonstrate that differ- each representation, then concatenate and then renormalize. ent information is required to learn representations for abstract and concrete concepts and for noun and verb concepts. In the next section, we investigate how perceptual information fits into this equation. 4 Acquiring Multi-Modal Representations As noted in Section 2, there is experimental evi- dence that perceptual information plays a distinct role in the representation of different concept types. We explore whether this finding extends to com- putational models by integrating such information into our corpus-based approaches. We focus on two aspects of the integration process. Propaga- tion: Can models infer useful information about ab- stract nouns and verbs from perceptual information corresponding to concrete nouns? And combina- tion: How can linguistic and (propagated or actual) perceptual information be integrated into a single, multi-modal representation? We begin by introduc- ing the two sources of perceptual information. 4.1 Perceptual Information Sources The McRae Dataset The McRae et al. (2005) Property Norms dataset is commonly used as a per- ceptual information source in cognitively-motivated semantic models (Kelly et al., 2010; Roller and Schulte im Walde, 2013). The dataset contains prop- erties of over 500 concrete noun concepts produced by 30 human annotators. The proportion of sub- jects producing each property gives a measure of the strength of that property for a given concept. We en- code this data in vectors with coordinates for each of the 2,526 properties in the dataset. A concept rep- resentation contains (real-valued) feature strengths in places corresponding to the features of that con- cept and zeros elsewhere. Having defined the con- crete noun evaluation set as the 303 concepts found in both the USF and McRae datasets, this informa- tion is available for all concrete nouns. The ESP-Game Dataset To complement the cognitively-driven McRae data with a more explic- itly visual information source, we also extract infor- mation from the ESP-Game dataset (Von Ahn and Dabbish, 2004) of 100,000 photographs, each an- notated with a list of entities depicted in that im- age. This input enables connections to be made be- tween concepts that co-occur in scenes, and thus might be experienced together by language learn- ers at a given time. Because we want our models to reflect human concept learning in inferring con- ceptual knowledge from comparatively unstructured data, we use the ESP-Game dataset in preference to resources such as ImageNet (Deng et al., 2009), in which the conceptual hierarchy is directly encoded by expert annotators. An additional motivation is that ESP-Game was produced by crowdsourcing a simple task with untrained annotators, and thus rep- resents a more scalable class of data source. We represent the ESP-Game data in 100,000 di- mensional vectors, with co-ordinates corresponding to each image in the dataset. A concept representa- tion contains a 1 in any place that corresponds to an image in which the concept appears, and a 0 other- wise. Although it is possible to portray actions and processes in static images, and several of the ESP- Game images are annotated with verb concepts, for a cleaner analysis of the information propagation pro- cess we only include ESP input in our models for the concrete nouns in the evaluation set. The data encoding outlined above results in per- ceptual representations of dimension ≈ 100, 000, for which, on average, fewer than 0.5% of entries are non-zero 6. In contrast, in our full linguistic repre- sentations of nouns (dimension ≈ 4, 000) and verbs (dimension ≈ 8, 000) (Section 3), an average of 24% of entries are non-zero. One of the challenges for the propagation and combination methods described in the following subsections is therefore to manage the differences in dimension and sparsity between lin- guistic and perceptual representations. 4.2 Information Propagation Johns and Jones Silberer and Lapata (2012) ap- ply a method designed by Johns and Jones (2012) to 6The ESP-Game and McRae representations are of approxi- mately equal sparsity. infer quasi-perceptual representations for a concept in the case that actual perceptual information is not available. Translating their approach to the present context, for verbs and abstract nouns we infer quasi- perceptual representations based on the perceptual features of concrete nouns that are nearby in the se- mantic space defined by the linguistic features. In the first step of their two-step method, for each abstract noun or verb k, a quasi-perceptual represen- tation is computed as an average of the perceptual representations of the concrete nouns, weighted by the proximity between these nouns and k kp = ∑ c∈C̄ S(kl,cl)λ ·cp where C̄ is the set of concrete nouns, cp and kp are the perceptual representations for c and k respec- tively, and cl and kl the linguistic representations. The exponent parameter λ reflects the learning rate. Following Johns and Jones (2012), we define the proximity function S between noun concepts to be cosine similarity. However, because our verb and noun representations are of different dimension, we take verb-noun proximity to be the PMI between the two words in the corpus, with co-occurrences counted within a 5-word window. In step two, the initial quasi-perceptual represen- tations are inferred for a second time, but with the weighted average calculated over the perceptual or initial quasi-perceptual representations of all other words, not just concrete nouns. As with Johns and Jones (2012), we set the learning rate parameter λ to be 3 in the first step and 13 in the second. Ridge Regression As an alternative propagation method we propose ridge regression (Myers, 1990). Ridge regression is a variant of least squares re- gression in which a regularization term is added to the training objective to favor solutions with cer- tain properties. Here we apply it to learn parame- ters for linear maps from linguistic representations of concrete nouns to features in their perceptual rep- resentations. For concepts with perceptual represen- tations of dimension np, we learn np linear functions fi : Rnl → R that map the linguistic representations (of dimension nl) to a particular perceptual feature i. These functions are then applied together to map the linguistic representations of abstract nouns and verbs to full quasi-perceptual representations.7 As our model is trained on concrete nouns but applied to other concept types, we do not wish the mapping to reflect the training data too faithfully. To mitigate against this we define our regulariza- tion term as the Euclidian l2 norm of the inferred parameter vector. This term ensures that the regres- sion favors lower coefficients and a smoother solu- tion function, which should provide better general- ization performance than simple linear regression. The objective for learning the fi is then to minimize ‖aX −Yi‖22 + ‖a‖ 2 2 where a is the vector of regression coefficients, X is a matrix of linguistic representations and Yi a vector of perceptual feature i for the set of concrete nouns. We now investigate ways in which the (quasi-) perceptual representations acquired via these meth- ods can be combined with linguistic representations. 4.3 Information Combination Canonical Correlation Analysis Canonical cor- relation analysis (CCA) (Hardoon et al., 2004) is an established statistical method for exploring re- lationships between two sets of random variables. The method determines a linear transformation of the space spanned by each of the sets of variables, such that the correlations between the sets of trans- formed variables is maximized. Silberer and Lapata (2012) apply CCA in the present context of information fusion, with one set of random variables corresponding to perceptual features and another corresponding to linguistic fea- tures. Applied in this way, CCA provides a mecha- nism for reducing the dimensionality of the linguis- tic and perceptual representations such that the im- portant interactions between them are preserved.8 The transformed linguistic and perceptual vectors are then concatenated. We follow Silberer and Lap- ata by applying a kernalized variant of CCA.9 7Because the POS-tag and grammatical features are differ- ent for nouns and for verbs, we exclude them from our linguistic representations when implementing ridge regression. 8Dimensionality reduction is desirable in the present context because of the sparsity of our perceptual representations. 9The KernelCCA package in Python: http://pythonhosted.org/apgl/KernelCCA.html Weighted Gram Matrix Combination The method we propose as an alternative means of fusing linguistic and extra-linguistic information is weighted gram matrix combination, which derives from an information combination technique applied to verb clustering by Reichart and Korhonen (2013). For a set of concepts C = {c1, . . . ,cn} with representations {r1, . . . ,rn}, the method involves creating an n×n weighted gram matrix L in which Lij = S(ri,rj) ·φ(ri) ·φ(rj). Here, S is again a similarity function (we use cosine similarity), and φ(r) is the quality score of r. The quality scoring function φ can be any map- ping Rn → R that reflects the importance of a con- cept relative to other concepts in C. In the present context, we follow Reichart and Korhonen (2013) in defining a quality score φ as the average cosine sim- ilarity of a concept with all other concepts in C φ(rj) = 1 n n∑ i=1 S(ri,rj). For cj ∈ C, the matrix L then encodes a scalar pro- jection of rj onto the other members ri≤n, weighted by their quality. Each word representation in the set is thus mapped into a new space of dimension n de- termined by the concepts in C. Converting concept representations to weighted gram matrix form has several advantages in the present context. First, both when evaluating and applying semantic representations, we generally re- quire models to determine relations between con- cepts relative to others. We might, for instance, re- quire close associates of a given word, a selection of potential synonyms, or the two most similar search queries in a given set. This relative nature of seman- tics is reflected by projecting representations into a space defined by the set of concepts themselves, rather than low-level features. It is also captured by the quality weighting, which lends primacy to con- cept dimensions that are central to the space. Second, mapping representations of different di- mension into vector spaces of equal dimension re- sults in dense representations of equal dimension for each modality. This naturally lends equal weighting or status to each modality and resolves any issues −0.1 0.0 0.1 0.2 Lexical POS Grammatical Feature Class P e rf o rm a n ce C h a n g e Concrete Nouns Lexical POS Grammatical Abstract Nouns (JJ) Lexical POS Grammatical Concrete Verbs (JJ) Lexical POS Grammatical Abstract Verbs (JJ) Perceptual Information Source McRae ESP McRae & ESP Figure 2: Additive change in Spearman’s ρ when representations acquired from particular classes of linguistic features are combined with (actual or inferred) perceptual representations. Perceptual representations are derived from either the McRae Dataset, the ESP-Game Dataset or both (concatenated). For concepts other than concrete nouns, perceptual information is propagated using the Johns and Jones (JJ) method, and combined with simple concatenation. of representations sparsity. In addition, the dimen- sion equality in particular enables a wider range of mathematical operations for combining information sources. Here, we follow Reichart and Korhonen (2013) in taking the product of the linguistic and per- ceptual weighted gram matrices L and P , produc- ing a new matrix containing fused representations for each concept M = LPPL. By taking the composite product LPPL rather than LP or PL, M is symmetric and no ad hoc status is conferred to one modality over the other. 4.4 Results The experiments in this section were designed to ad- dress the three questions specified in Section 1: (1) Which information sources are important for acquir- ing word concepts of different types? (2) Can per- ceptual information be propagated from concrete to abstract concepts? (3) What is the best way to com- bine the information from the different sources? Question (1) To build on insights from Section 3, we first examined how perceptual input interacts with the three classes of linguistic features defined there. Figure 2 shows the additive difference in cor- relation between (i) models in which perceptual and particular linguistic features are concatenated and (ii) models based on just the linguistic features. For concrete nouns and concrete verbs, (actual or inferred) perceptual information was beneficial in almost all cases. The largest improvement for both concept types was over grammatical features, achieved by including only the McRae data. This signals from this perceptual input and the grammat- ical features clearly reflect complementary aspects of the meaning of these concepts. We hypothe- size that grammatical features (and POS features, which also perform strongly in this combination) confer information to concrete representations about the function and mutual interaction of concepts (the most ‘relational’ aspects of their meaning (Gentner, 1978)) which complements the more intrinsic prop- erties conferred by perceptual features. For abstract concepts, it is perhaps unsurpris- ing that the overall contribution of perceptual in- formation was smaller. Indeed, combining linguis- tic and perceptual information actually harmed per- formance on abstract verbs in all cases. For these concepts, the inferred perceptual features seem to obscure or contradict some of the information con- veyed in the linguistic representations. While the McRae data was clearly the most valu- able source of perceptual input for concrete nouns and concrete verbs, for abstract nouns the combi- nation of ESP-Game and McRae data was most in- formative. Both inspection of the data and cogni- tive theories (Rosch et al., 1976) suggest that enti- ties identified in scenes, as in the ESP-Game dataset, generally correspond to a particular (basic) level of Model All Nouns Conc. Nouns† Abs. Nouns All Verbs Conc. Verbs Abs. Verbs Linguistic 0.175 — 0.335 0.169 — 0.317 0.233 — 0.344 0.148 — 0.178 0.204 — 0.191 0.094 — 0.330 (JJ)+Concat 0.116 — 0.375 0.258 — 0.442 0.190 — 0.267 0.129 — 0.162 0.301 — 0.062 0.019 — 0.280 (JJ)+CCA 0.082 — 0.021 0.001 — 0.067 0.085 — -0.018 0.027— 0.213 0.079 — 0.276 0.095 — 0.200 (JJ)+WGM 0.098 — 0.213 0.397 — 0.523 0.238 — 0.329 0.059 — 0.169 0.253 — 0.064 -0.080 — 0.254 RR+Concat 0.232 — 0.432 0.248 — 0.343 0.013 — 0.212 0.046 — 0.484 0.023 — 0.133 RR+CCA 0.033 — -0.045 0.044 — -0.023 0.001 — -0.006 0.018 — 0.344 0.018 — 0.085 RR+WGM 0.094 — 0.069 0.232 — 0.327 0.159 — 0.131 0.244 — 0.194 0.075 — 0.283 LR+ 0.216 — 0.402 0.216 — 0.282 0.004 — 0.051 -0.051 — 0.139 -0.008 — 0.197 Table 4: Performance of different methods of information propagation (JJ = Johns and Jones, RR = ridge regression, LR = linear regression) and combination (Concat = concatenation, CCA = canonical correlation analysis, WGM = weighted gram matrix multiplication) across evaluation sets. Values are Spearman’s ρ correlation with USF scores (left hand side of columns) and WordNet path similarity (right hand side). For the LR baseline we only report the highest score across the three combination types. †No propagation takes place for concrete nouns; this column reflects the performance of combination methods only. the conceptual hierarchy. The ESP-Game data re- flects relations between these basic-level concepts in the world, whereas the McRae data typically de- scribes their (intrinsic) properties. Together, these sources seem to combine information on the proper- ties of, and relations between, concepts in a way that particularly facilitates the learning of abstract nouns. Question (2) The performance of different meth- ods of information propagation and combination is presented in Table 4. The underlying linguistic rep- resentations in this case contained all three distribu- tional feature classes. For more robust conclusions, in addition to the USF gold-standard we also mea- sured the correlation between model output and the WordNet path similarity of words in our evaluation pairs. The path similarity between words w1 and w2 is the shortest distance between synsets of w1 and w2 in the WordNet taxonomy (Fellbaum, 1999), which correlates significantly with human judgements of concept similarity (Pedersen et al., 2004).10 The correlations with the USF data (left hand col- umn, Table 4) of our linguistic-only models (ρ = 0.094 − 0.233) and best performing multi-modal models (on both concrete nouns, ρ = 0.397, and more abstract concepts, ρ = 0.095 − 0.301) were higher than the best comparable models described elsewhere (Feng and Lapata, 2010; Silberer and La- pata, 2012; Silberer et al., 2013).11 This confirms 10Other widely-used evaluation gold-standards, such as WordSim 353 and the MEN dataset, do not contain a sufficient number of abstract concepts for the current purpose. 11Feng and Lapata (2010) report ρ = .08 for language-only both that the underlying linguistic space is of high quality and that the ESP and McRae perceptual in- put is similarly or more informative than the input applied in previous work. Consistent with previous studies, adding percep- tual input improved the quality of concrete noun representations as measured against both USF and path similarity gold-standards. Further, effective in- formation propagation was indeed possible for both abstract nouns (USF evaluation) and concrete verbs (both evaluations). Interestingly, however, this was not the case for abstract verbs, for which no mix of propagation and combination methods produced an improvement on the linguistic-only model on either evaluation set. Indeed, as shown in Figure 2, no type of perceptual input generated an improvement in ab- stract verb representations, regardless of the under- lying class of linguistic features. This result underlines the link between concrete- ness, cognition and perception proposed in the psy- chological literature. More practically, it shows that concreteness can determine if propagation of per- ceptual input will be effective and, if so, the potential degree of improvement over text-only models. Turning to means of propagation, both the Johns and Jones method and ridge regression outper- formed the linear regression baseline on the major- ity of concept types in our evaluation. Across the five sets and ten evaluations on which propagation and .12 for multi-modal models evaluated on USF over concrete and abstract concepts. Silberer and Lapata (2012) report ρ = .14 (language-only) and .35 (multi-modal) over concrete nouns. takes place (All Nouns, Abstract Nouns, All Verbs, Abstract Verbs and Concrete Verbs), ridge regression performed more robustly, achieving the best perfor- mance on six evaluation sets compared to two for the Johns and Jones method.12 Question (3) Weighted gram matrix multiplica- tion (ρ = 0.397 on USF and ρ = 0.523 on path sim- ilarity) outperformed both simple vector concatena- tion (ρ = 0.258 and ρ = 0.442) and CCA (ρ = 0.001 and ρ = 0.067) on concrete nouns. In the case of both abstract nouns and concrete verbs, how- ever, the most effective means of combining quasi- perceptual information with linguistic representa- tions was concatenation (abstract nouns, ρ = 0.248 and ρ = 0.343, concrete verbs, ρ = 0.301 and ρ = 0.484). One evident drawback of multiplica- tive methods such as weighted gram matrix combi- nation is the greater inter-dependence of the infor- mation sources; a weak signal from one modality can undermine the contribution of the other modal- ity. We hypothesize that this underlines the compar- atively poor performance of the method on verbs and abstract nouns, as the perceptual input for concrete nouns is clearly a richer information source than the propagated features of more abstract concepts. 5 Conclusion Motivated by the inherent difference between ab- stract and concrete concepts and the observation that abstract words occur more frequently in language, in this paper we have addressed the question of whether multi-modal models can enhance semantic representations of both concept types. In Section 3, we demonstrated that different infor- mation sources are important for acquiring concrete and abstract noun and verb concepts. Within the lin- guistic modality, while lexical features are informa- tive for all concept types, syntactic features are only significantly informative for abstract concepts. In contrast, in Section 4 we observed that per- ceptual input is a more valuable information source for concrete concepts than abstract concepts. Nev- ertheless, perceptual input can be effectively prop- agated from concrete nouns to enhance representa- tions of both abstract nouns and concrete verbs. In- 12For these comparisons, the optimal combination method is selected in each case. deed, conceptual concreteness appears to determine the degree to which perceptual input is beneficial, since representations of abstract verbs, the most ab- stract concepts in our experiments, were actually de- graded by this additional information. One impor- tant contribution of this work is therefore an insight into when multi-modal models should or should not aim to combine and/or propagate perceptual input to ensure that optimal representations are learned. In this respect, our conclusions align with the findings of Kiela and Hill (2014), who take an explicitly vi- sual approach to resolving the same question. Various methods for propagating and combining perceptual information with linguistic input were presented. We proposed ridge regression for in- ferring perceptual representations for abstract con- cepts, which proved more robust than alternatives across the range of concept types. This approach is particularly simple to implement, since it is based on an established statistical prodedure. In addition, we introduced weighted gram matrix combination for combining representations from distinct modalities of differing sparsity and dimension. This method produces the highest quality composite representa- tions for concrete nouns, where both modalities rep- resent high quality information sources. Overall, our results demonstrate that the potential practical benefits of multi-modal models extend be- yond concrete domains into a significant proportion of the lexical concepts found in language. In fu- ture work we aim to extend our experiments to con- cept types such as adjectives and adverbs, and to de- velop models that further improve the propagation and combination of extra-linguistic input. Moreover, while we cannot draw definitive con- clusions about human language processing, the ef- fectiveness of the methods presented in this paper offer tentative support for the idea that even ab- stract concepts are grounded in the perceptual sys- tem (Barsalou et al., 2003). As such, it may be that, even in the more abstract cases of human communi- cation, we find ways to see what people mean pre- cisely by finding ways to see what they mean. Acknowledgements We thank The Royal Society and St John’s College for their support. References Mark Andrews, Gabriella Vigliocco, and David Vinson. 2009. Integrating experiential and distributional data to learn semantic representations. Psychological Re- view, 116(3):463. Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empiri- cal Methods in Natural Language Processing, pages 1183–1193. Association for Computational Linguis- tics. Lawrence W Barsalou, W Kyle Simmons, Aron K Bar- bey, and Christine D Wilson. 2003. Grounding conceptual knowledge in modality-specific systems. Trends in cognitive sciences, 7(2):84–91. Jeffrey R Binder, Chris F Westbury, Kristen A McKier- nan, Edward T Possing, and David A Medler. 2005. Distinct brain systems for processing concrete and ab- stract concepts. Journal of Cognitive Neuroscience, 17(6):905–917. Elia Bruni, Gemma Boleda, Marco Baroni, and Nam- Khanh Tran. 2012. Distributional semantics in tech- nicolor. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguistics: Long Papers-Volume 1, pages 136–145. Association for Computational Linguistics. Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of Arti- ficial Intelligence Research, 49:1–47. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hier- archical image database. In Computer Vision and Pat- tern Recognition, 2009. CVPR 2009, pages 248–255. IEEE. Katrin Erk and Sebastian Padó. 2008. A structured vec- tor space model for word meaning in context. In Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 897–906. Asso- ciation for Computational Linguistics. Christiane Fellbaum. 1999. WordNet. Wiley Online Li- brary. Yansong Feng and Mirella Lapata. 2010. Visual infor- mation in semantic representation. In Human Lan- guage Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 91–99. Association for Computational Linguistics. Vittorio Gallese and George Lakoff. 2005. The brain’s concepts: The role of the sensory-motor system in con- ceptual knowledge. Cognitive neuropsychology, 22(3- 4):455–479. Dedre Gentner and Arthur B Markman. 1997. Structure mapping in analogy and similarity. American psychol- ogist, 52(1):45. Dedre Gentner. 1978. On relational meaning: The ac- quisition of verb meaning. Child development, pages 988–998. Daniel Gildea and Daniel Jurafsky. 2002. Automatic la- beling of semantic roles. Computational linguistics, 28(3):245–288. Yoav Goldberg and Jon Orwant. 2013. A dataset of syntactic-ngrams over time from a very large corpus of english books. In Second Joint Conference on Lexical and Computational Semantics, Association for Com- putational Linguistics, pages 241–247. Association for Computational Linguistics. Thomas L Griffiths, Mark Steyvers, and Joshua B Tenen- baum. 2007. Topics in semantic representation. Psy- chological review, 114(2):211. David R Hardoon, Sandor Szedmak, and John Shawe- Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neu- ral Computation, 16(12):2639–2664. Zellig Harris. 1954. Distributional structure. Word, 10(23):146–162. Felix Hill, Anna Korhonen, and Christian Bentz. 2013. A quantitative empirical analysis of the ab- stract/concrete distinction. Cognitive Science. Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representa- tions via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Asso- ciation for Computational Linguistics: Long Papers- Volume 1, pages 873–882. Association for Computa- tional Linguistics. Sabine Schulte Im Walde. 2006. Experiments on the automatic induction of german semantic verb classes. Computational Linguistics, 32(2):159–194. Brendan T Johns and Michael N Jones. 2012. Perceptual inference through global lexical similarity. Topics in Cognitive Science, 4(1):103–120. Colin Kelly, Barry Devereux, and Anna Korhonen. 2010. Acquiring human-like feature-based conceptual repre- sentations from corpora. In Proceedings of the NAACL HLT 2010 First Workshop on Computational Neurolin- guistics, pages 61–69. Association for Computational Linguistics. Douwe Kiela and Felix Hill. 2014. Improving multi- modal representations using image dispersion: Why less is sometimes more. In Proceedings of ACL 2014, Baltimore. Association for Computational Linguistics. Karin Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer. 2008. A large-scale classification of english verbs. Language Resources and Evaluation, 42(1):21–40. Thomas K Landauer and Susan T Dumais. 1997. A so- lution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211. Geoffrey Leech, Roger Garside, and Michael Bryant. 1994. Claws4: the tagging of the British National Cor- pus. In Proceedings of the 15th conference on Compu- tational linguistics-Volume 1, pages 622–628. Associ- ation for Computational Linguistics. Chee Wee Leong and Rada Mihalcea. 2011. Going be- yond text: A hybrid image-text approach for measur- ing word relatedness. In IJCNLP, pages 1403–1407. Christopher D Manning. 2011. Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In Computational Linguistics and Intelligent Text Pro- cessing, pages 171–189. Springer. Arthur B Markman and Edward J Wisniewski. 1997. Similar and different: The differentiation of basic- level categories. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23(1). Ken McRae, George S Cree, Mark S Seidenberg, and Chris McNorgan. 2005. Semantic feature production norms for a large set of living and non-living things. Behavior Research Methods, 37(4):547–559. Raymond H Myers. 1990. Classical and modern regres- sion with applications, volume 2. Duxbury Press Bel- mont, CA. Douglas L Nelson, Cathy L McEvoy, and Thomas A Schreiber. 2004. The University of South Florida free association, rhyme, and word fragment norms. Be- havior Research Methods, Instruments, & Computers, 36(3):402–407. Allan Paivio. 1991. Dual coding theory: Retrospect and current status. Canadian Journal of Psychology/Revue Canadienne de Psychologie, 45(3):255. Ted Pedersen, Siddharth Patwardhan, and Jason Miche- lizzi. 2004. Wordnet:: Similarity: measuring the relat- edness of concepts. In Demonstration Papers at HLT- NAACL 2004, pages 38–41. Association for Computa- tional Linguistics. Roi Reichart and Anna Korhonen. 2013. Improved lexical acquisition through dpp-based verb clustering. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Association for Computational Linguistics. Stephen Roller and Sabine Schulte im Walde. 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. In Proceedings of the 2013 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1146–1157, Seattle, Wash- ington, USA, October. Association for Computational Linguistics. Eleanor Rosch, Carolyn B Mervis, Wayne D Gray, David M Johnson, and Penny Boyes-Braem. 1976. Basic objects in natural categories. Cognitive Psychol- ogy, 8(3):382–439. Gideon Rosen. 2001. Nominalism, naturalism, epis- temic relativism. Noûs, 35(s15):69–91. Magnus Sahlgren. 2006. The Word-Space Model: Us- ing distributional analysis to represent syntagmatic and paradigmatic relations between words in high- dimensional vector spaces. Ph.D. thesis, Stockholm. Paula J Schwanenflugel and Edward J Shoben. 1983. Differential context effects in the comprehension of abstract and concrete verbal materials. Journal of Ex- perimental Psychology: Learning, Memory, and Cog- nition, 9(1):82. Carina Silberer and Mirella Lapata. 2012. Grounded models of semantic representation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1423–1433. Asso- ciation for Computational Linguistics. Carina Silberer, Vittorio Ferrari, and Mirella Lapata. 2013. Models of semantic representation with visual attributes. In Proceedings of the 51th Annual Meet- ing of the Association for Computational Linguistics, Sofia, Bulgaria. Peter D Turney, Patrick Pantel, et al. 2010. From fre- quency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37(1):141– 188. Tim Van de Cruys, Laura Rimell, Thierry Poibeau, Anna Korhonen, et al. 2012. Multiway tensor factorization for unsupervised lexical acquisition. COLING 2012: Technical Papers, pages 2703–2720. Luis Von Ahn and Laura Dabbish. 2004. Labeling im- ages with a computer game. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 319–326. ACM.