Visually Grounded and Textual Semantic Models Differentially Decode Brain Activity Associated with Concrete and Abstract Nouns Andrew J. Anderson Brain & Cognitive Sciences University of Rochester aander41@ur.rochester.edu Douwe Kiela Computer Laboratory University of Cambridge dk427@cam.ac.uk Stephen Clark Computer Laboratory University of Cambridge sc609@cam.ac.uk Massimo Poesio School of Computer Science and Electronic Engineering University of Essex poesio@essex.ac.uk Abstract Important advances have recently been made using computational semantic models to de- code brain activity patterns associated with concepts; however, this work has almost ex- clusively focused on concrete nouns. How well these models extend to decoding abstract nouns is largely unknown. We address this question by applying state-of-the-art compu- tational models to decode functional Magnetic Resonance Imaging (fMRI) activity patterns, elicited by participants reading and imagin- ing a diverse set of both concrete and abstract nouns. One of the models we use is linguistic, exploiting the recent word2vec skipgram ap- proach trained on Wikipedia. The second is visually grounded, using deep convolutional neural networks trained on Google Images. Dual coding theory considers concrete con- cepts to be encoded in the brain both linguisti- cally and visually, and abstract concepts only linguistically. Splitting the fMRI data accord- ing to human concreteness ratings, we indeed observe that both models significantly decode the most concrete nouns; however, accuracy is significantly greater using the text-based mod- els for the most abstract nouns. More gener- ally this confirms that current computational models are sufficiently advanced to assist in investigating the representational structure of abstract concepts in the brain. 1 Introduction Since the work of Mitchell et al. (2008), there has been increasing interest in using computational se- mantic models to interpret neural activity patterns scanned as participants engage in conceptual tasks. This research has almost exclusively focused on brain activity elicited as participants comprehend concrete nouns as experimental stimuli. Different modelling approaches — predominantly distribu- tional semantic models (Mitchell et al., 2008; De- vereux et al., 2010; Murphy et al., 2012; Pereira et al., 2013; Carlson et al., 2014) and semantic mod- els based on human behavioural estimation of con- ceptual features (Palatucci et al., 2009; Sudre et al., 2012; Chang et al., 2010; Bruffaerts et al., 2013; Fernandino et al., 2015) — have elucidated how dif- ferent brain regions contribute to semantic represen- tation of concrete nouns; however, how these results extend to non-concrete nouns is unknown. In computational modelling there has been in- creasing importance attributed to grounding seman- tic models in sensory modalities, e.g., Bruni et al. (2014), Kiela and Bottou (2014). Andrews et al. (2009) demonstrated that multi-modal models formed by combining text-based distributional in- formation with behaviourally generated conceptual properties (as a surrogate for perceptual experience) provide a better proxy for human-like intelligence. However, both the text-based and behaviourally- based components of their model were ultimately derived from linguistic information. Since then, in analyses of brain data, Anderson et al. (2013) have applied multi-modal models incorporating features that are truly grounded in natural image statistics to further support this claim. In addition, Anderson et al. (2015) have demonstrated that visually grounded models describe brain activity associated with inter- nally induced visual features of objects as the ob- 17 Transactions of the Association for Computational Linguistics, vol. 5, pp. 17–30, 2017. Action Editor: Daichi Mochihashi. Submission batch: 2/2016; Revision batch: 7/2016; Published 1/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. jects names are read and comprehended. Having both image- and text-based models of se- mantic representation, and neural activity patterns associated with concrete and abstract nouns, enables a natural test of Dual coding theory (Paivio, 1971). Dual coding posits that concrete concepts are repre- sented in the brain in terms of a visual and linguis- tic code, whereas abstract concepts are only repre- sented by a linguistic code. Whereas previous work has demonstrated that image- and text-based seman- tic models contribute to explaining neural activity patterns associated with concrete nouns, it remains unclear whether either text- or image-based seman- tic models can decode neural activity patterns asso- ciated with abstract words. We extend previous work by applying image- and text-based computational semantic models to de- code an fMRI data set spanning a diverse set of nouns of varying concreteness. The 70-word stim- uli for the fMRI experiment (listed in Table 1) are semantically structured according to taxonomic cat- egories and domains embedded in WordNet (Fell- baum, 1998) and its extensions. Participants read the noun and were instructed to imagine a situation that they personally associate with the noun. In this sense, the data solicited was targetting deep thought patterns (deeper than might be anticipated for rapid semantic processing required in conversations and many real time interactions with the world). In the analysis we split the fMRI data set into the most con- crete and most abstract words based on behavioural concreteness ratings. Our key contribution is in demonstrating a decoding advantage for text-based semantic models over the image-based models when decoding the more abstract nouns. In line with the previous results of Anderson et al. (2013) and An- derson et al. (2015), both visual and textual models decode the more concrete nouns. The image- and text-based computational models we use have recently been developed using neural networks (Mikolov et al., 2013; Jia et al., 2014). The image-based model is built using a deep con- volutional neural network approach, similar in na- ture to those recently used to study neural represen- tations of visual stimuli (see Kriegeskorte (2015), al- though note this is the first application to study word elicited neural activation known to the authors). For decoding we use a recently introduced algorithm (Anderson et al., 2016) that abstracts the decoding task to representational similarity space, and achieve decoding accuracies on par with those convention- ally achieved through discriminating concrete nouns (and higher if we combine data to exploit group- level regularities). Because the fMRI experiments were performed in Italian on native Italians, and because approximately comparable text corpora in content were available in English and Italian (English and Italian Wikipedia), we were able to compare how well English and Ital- ian text-based semantic models can decode neural activity patterns. Whilst Italian Wikipedia could reasonably be expected to be advantaged by sup- porting culturally appropriate nuances of seman- tic structure, it is disadvantaged by being consider- ably smaller than English Wikipedia. Taking inspi- ration from previous work exploiting cross-lingual resources (Richman and Schone, 2008; Shi et al., 2010; Darwish, 2013) we combined Italian and En- glish text-based models in our decoding analyses in an attempt to leverage the benefits of both. Although combined language and English models tended to yield marginally better decoding accura- cies, there were no significant differences between the different language models. Whilst we expect se- mantic structure on a grand scale to broadly straddle language boundaries for most concrete and abstract concepts (albeit with cultural specificities), this is proof of principle that cross linguistic commonali- ties are reflected in neural activity patterns measur- able with current technology. 2 Brain Data We reanalyze the fMRI data originally collected by Anderson et al. (2014), who investigated the rel- evance of different taxonomic categories and do- mains embedded in WordNet to the organization of conceptual knowledge in the brain. 2.1 Word stimuli Anderson et al. (2014) systematically selected a list of 70 words intended to be representative of a broad range of abstract and concrete nouns. These were organised according to the domains of law and music, cross-classified with seven taxonomic cate- gories. They began by identifying low-concreteness 18 LAW MUSIC Ur-abstracts giustizia justice musica music liberta’ liberty blues blues legge law jazz jazz corruzione corruption canto singing refurtiva loot punk punk Attribute giurisdizione jurisdiction sonorita’ sonority cittadinanza citizenship ritmo rhythm impunita’ impunity melodia melody legalita’ legality tonality’ tonality illegalita illegality intonazione pitch Communication divieto prohibition canzone song verdetto verdict pentagramma stave ordinanza decree ballata ballad addebito accusation ritornello refrain ingiunzione injunction sinfonia symphony Event/action arresto arrest concerto concert processo trial recital recital reato crime assolo solo furto theft festival festival assoluzione acquital spettacolo show Person/Social-role giudice judge musicista musician ladro thief cantante singer imputato defendant compositore composer testimone witness chitarrista guitarist avvocato lawyer tenore tenor Location tribunale court/tribunal palco stage carcere prison auditorium auditorium questura police-station discoteca disco penitenziario penitentiary conservatorio conservatory patibolo gallows teatro theatre Object/Tool manette handcuffs violino violin toga robe tamburo drum manganello truncheon tromba trumpet cappio noose metronomo metronome grimaldello skeleton-key radio radio Table 1: Italian stimulus words and English translations, divided into law and music domains (columns), and taxo- nomic categories (groups of 5 rows). The most concrete half of the words are indicated in bold font. Strike-throughs indicate words for which we did not have semantic model coverage. words in the norms of Barca et al. (2002). They then linked these to WordNet to identify the taxonomic category of the dominant sense of each word. Six taxonomic categories that were heavily populated with abstract words, as well as one unambiguously concrete category, were chosen. All categories sup- ported ample coverage of Law and Music domains (determined according to WordNet Domains (Ben- tivogli et al., 2004)). Five law words and five music words were selected from each taxonomic category. Taxonomic categories and example stimulus words (translated into English) are as below: Ur-abstract: Anderson et al.’s term for concepts that are classified as abstract in WordNet but do not belong to a clear subcategory, e.g., law or music. At- tribute: A construct whereby objects or individuals can be distinguished, e.g., legality, tonality. Com- munication: Something that is communicated by, to or between groups, e.g., accusation, symphony. Event/action: Something that happens at a given place and time, e.g., crime, festival. Person/Social- role: Individual, someone, somebody, mortal, e.g., judge, musician. Location: Points or extents in space, e.g., court, theatre. Object/Tool: A class of unambiguously concrete nouns, e.g., handcuffs, vio- lin. The full list of stimuli is in Table 1. We split the stimulus nouns into the 35 most concrete and 35 most abstract words according to the behavioural concreteness ratings from Anderson et al. (2014). 19 2.2 fMRI Experiment Participants Nine right-handed native Italian speakers aged between 19 and 38 years (3 women) were recruited to take part in the study. Two were scanned after Anderson et al. (2014) to match the number of participants analysed by Mitchell et al. (2008). Scanning had previously been halted at 7 in- stead of the planned 9 participants for a period due to equipment failure. All had normal or corrected- to-normal vision. The 70 stimulus words were presented as written words, in 5 runs (all runs were collected in one par- ticipant visit), with the order of presentations ran- domised across runs. In each run, a randomly se- lected word was presented every 10 seconds, and re- mained on screen for 3 seconds. On reading a stim- ulus word, participants thought of a situation that they individually associated with the noun. This pro- cess is similar to previous concrete noun tasks, e.g., Mitchell et al. (2008), where participants were in- structed to think of the properties of the noun. How- ever, as people encounter difficulties eliciting prop- erties of non-concrete concepts, compared to think- ing of situations in which concepts played a role (Wiemer-Hastings and Xu, 2005), the experimental paradigm was adapted to imagining situations. fMRI acquisition and preprocessing Anderson et al. (2014) recorded fMRI images on a 4T Bruker MedSpec MRI scanner. They used an Echo Planar Imaging (EPI) pulse sequence with a 1000 msec rep- etition time, an echo time of 33 msec, and a 26◦ flip angle. A 64×64 acquisition matrix was used, and 17 slices were imaged with a between-slice gap of 1 mm. Voxels had dimensions of 3mm×3mm×5mm. fMRI data were corrected for head motion, un- warped, and spatially normalized to the Montreal Neurological Institute and Hospital (MNI) template. Only voxels estimated to be grey matter were in- cluded in the subsequent analysis. For each partic- ipant, for each scanning run (where a run is a com- plete presentation of 70 words), voxel activity was corrected by removing linear trend and transformed to z scores (within each run). Each stimulus word was represented as a single volume by taking the voxel-wise mean of the 4 sec of data offset by 4 sec from the stimulus onset (to account for hemo- dynamic response). Voxel selection The 500 most stable grey mat- ter voxels per participant were selected for analy- sis. This was undertaken within the leave-2-word- out decoding procedure detailed later in Section 4 using the same method as Mitchell et al. (2008): Pearson’s correlation of each voxel’s activity be- tween matched word lists in all scanning run pairs (10 unique run pairs giving 10 correlation coeffi- cients of 68/70 words, where the other 2 words were test words to be decoded) was computed. The mean coefficient was used as stability measure. Voxels as- sociated with the 500 largest stability measures were selected. 3 Semantic Models 3.1 Image-based semantic models Following previous work in multi-modal semantics (Bergsma and Van Durme, 2011; Kiela et al., 2014), we obtain a total of 20 images for each of the stim- ulus words from Google Images1. Images from Google have been shown to yield representations that are competitive in quality compared to alterna- tive resources (Bergsma and Van Durme, 2011; Fer- gus et al., 2005). Image representations are obtained by extracting the pre-softmax layer from a forward pass in a convolutional neural network (CNN) that has been trained on the ImageNet classification task using Caffe (Jia et al., 2014). This approach is simi- lar to e.g., Kriegeskorte (2015), except that we only use the pre-softmax layer, which has been found to work particularly well in semantic tasks (Razavian et al., 2014; Kiela and Bottou, 2014). Such CNN- derived image representations have been found to be of higher quality than traditional bag of visual words models (Sivic and Zisserman, 2003) that were pre- viously used in multi-modal semantics (Bruni et al., 2014; Kiela and Bottou, 2014). We aggregate im- ages associated with a stimulus word into an overall visually grounded representation by taking the mean of the individual image representations. Image search for abstract nouns The validity and success of the following analyses are depen- dent on having built the image-based models from a set of images that are indeed relevant to the ab- stract words. The Google Image searches we used 1www.google.com/imghp 20 Figure 1: Representing brain and semantic model vectors in similarity space. to build the image-based models largely returned a selection of images systematically associated with our most abstract nouns. For instance, ‘corruption’ returns suited figures covertly exchanging money; ‘law’, ‘justice’, ‘music’, ‘tonality’ return pictures of gavels, weighing scales, musical notes and cir- cles of fifths, respectively. For ‘jurisdiction’, the image search returns maps and law-related objects. However, there were also misleading cases such as ‘pitch’ where the image search, whilst returning po- tentially useful pictures of sinusoidal graphs, was heavily contaminated by images of football pitches. This problem is not exclusive to images, and the cur- rent text-based models are also not immune to the multiple senses of polysemous words. 3.2 Text-based semantic models For linguistic input, we use the continuous vec- tor representations from the skip-gram model of Mikolov et al. (2013). Specifically, we obtained 300-dimensional word embeddings by training a skip-gram model using negative sampling on recent Italian and English Wikipedia dumps (using Gensim with preprocessing from word2vec’s demo script). For English, representations were built for the En- glish translations of the 70 stimuli provided by An- derson et al. (2014). The English model was trained for 1 iteration, whereas the Italian was trained for 5, since the Italian Wikipedia dump was smaller (5.2 vs 1.3 billion words respectively). Following previous work exploiting cross-lingual textual resources (Richman and Schone, 2008; Shi et al., 2010; Darwish, 2013), we also applied Ital- ian and English text-based models in combination. Model combination was achieved at the analysis stage, by fusing decoding outputs of Italian and En- glish models as described in Section 4.1. 4 Representational similarity-based decoding of brain activity We decoded word-level fMRI representations using the semantic models following the procedure intro- duced by Anderson et al. (2016). The process of matching models to words is abstracted to represen- tational similarity space: For both models and brain data, words are semantically re-represented by their similarities to other words by correlating all word pairs within the native model or brain space, using Pearson’s correlation (see Figure 1). The result is two square matrices of word pair correlations: one for the fMRI data, another for the model. In the similarity space, each word is a vector of correla- tions with all other words, thereby allowing model and brain words (similarity vectors) to be directly matched to each other. In decoding, models were matched to fMRI data as follows (see Figure 2). Two test words were cho- sen. The 500 voxels estimated to have the most stable signal were selected using the strategy de- scribed in Section 2.2. Voxel selection was based on the fMRI data of the other 68/70 words. Se- lection on 68/70 rather than all 70 words was to allay any concern that voxel selection could have 21 systematically biased the fMRI correlation structure (calculated next) to look like that of the semantic model, and consequently biased decoding perfor- mance. However, as similarity-based decoding does not optimise a mapping between fMRI data and se- mantic model, it is not prone to modelling and de- coding fMRI noise as in classic cases of double dip- ping (Kriegeskorte et al., 2009). Indeed, as we report later in this section, there were no significant differ- ences in decoding accuracy arising from tests using voxel selection on 68/70 versus 70 words. A single representation of each word was built by taking the voxel-wise mean of all five presentations of the word for the 500 selected voxels. An fMRI similarity matrix for all 70 words was then calcu- lated. Similarity vectors for the two test words were drawn from both the model and fMRI similarity ma- trices. Entries corresponding to the two test words in both model and fMRI similarity vectors were re- moved because these values could reveal the correct answer to decoding. The two model similarity vec- tors were then compared to the two fMRI similar- ity vectors by correlation, resulting in four corre- lation values. These correlation values were trans- formed using Fisher’s r to z (arctanh). If the sum of z-transformed correlations between the correctly matched pair exceeded the sum of correlations for the incongruent pair, decoding was scored a success, otherwise a failure. This process was then repeated for all word pairs, with the mean accuracy of all test iterations giving a final measure of success. Fisher’s r to z transform (arctanh) is typically used to test for differences between correlation coeffi- cients. It transforms the correlation coefficient r to a value z, where z has amplified values at the tails of the correlation coefficient (r otherwise ranges be- tween -1 and 1). This is to make the sampling distri- bution of z normally distributed, with approximately constant variance values across the population corre- lation coefficient. In the similarity-decoding method used here, z is evaluated in decoding because it is a more principled metric to compare and combine (as later undertaken in Section 4.1) However, under most circumstances r to z is not critical to the procedure. z noticeably differs from r only when correlations exceed .5, and r to z changes decoding behaviour in select circumstances. Specif- ically r to z can influence how word labels are as- Figure 2: Similarity-decoding algorithm (adapted from Anderson et al. 2016). 22 signed to similarity vectors by upweighting high value correlation coefficients at the final stage of de- coding. A hypothetical scenario to illustrate the above point is as follows. Let Pearson(X,Y) denote Pear- son’s correlation of vectors X and Y, and brainA cor- respond to a brain similarity vector “A” for an un- known word label, and model1 to a semantic model similarity vector for a known word label “1”. In the final stage of analysis, there are two decoding alter- natives given by (i) Pearson(brainA,model2)=.9 and Pearson(brainB,model1)=.9, which when summed gives 1.8; (ii) Pearson(brainA,model1)=.89, Pear- son(brainB,model2)=.91. Here the sum is also 1.8 and therefore (i) and (ii) are identical. Applying the r to z transform would result in selection of (ii) because arctanh(.9)+arctanh(.9)=2.94, whereas arc- tanh(.89)+arctanh(.91)=2.95. Statistical significance of decoding accuracy was determined by permutation testing. Decoding was repeated multiple times using the following proce- dure: creating a vector of word-label indices and randomly shuffling these indices; applying the vec- tor of shuffled indices to reorder both rows and columns of only one of the similarity matrices (whilst keeping the original correct row/column la- bels so that word-labels now mismatch matrix con- tents); and repeating the entire pair-matching decod- ing procedure described above. If word labels are randomly assigned to similarity vectors, we expect a chance-level decoding accuracy of 50%. Repeti- tion of this process (here 10,000 repeats) supplies a null distribution of decoding accuracies achieved by chance. The p-value of decoding accuracy is calcu- lated as the proportion of chance accuracies that are greater than or equal to the observed decoding accu- racy. For permutation testing only, voxel selection was undertaken a single time, per participant, on all 70 words (rather than on 68/70 words in each leave-2- out decoding iteration). This was to reduce com- putation time that would otherwise have been pro- hibitive. This is very unlikely to have yielded any discernible difference in outcome. Unlike de- coding strategies, that involve fitting a classifica- tion/encoding model to fMRI data (and are prone to fitting and subsequently decoding fMRI noise), similarity-based decoding does not learn a mapping between semantic-model and fMRI data and is ro- bust to “double dipping” giving spurious decoding accuracies (see Kriegeskorte et al. (2009) for prob- lems associated with double dipping). As an empirical demonstration, we reran all of our 21 actual (non-permuted) model-based de- coding analyses, that are reported later in Sec- tion 5.2, whilst selecting voxels from all 70 words (as opposed to leave-2-out voxel-selection on 68/70 words). Specifically, decoding analyses were re- peated for all 7 model combinations, and tested first on all words, then for the most concrete words only, and finally the most abstract words only. Mean de- coding accuracies for the 9 participants yielded with and without leave-2-out voxel selection were com- pared using paired t-tests. There were no significant differences across all 21 tests. The most different (non-significant) individual result was t=1.87, p=.09 (2-tailed), and in this case leave-2-out voxel selec- tion gave the higher accuracy. 4.1 Model combination by ensemble averaging To test whether the three different semantic mod- els (image-based, Italian/English text-based) carried complementary information, we combined the mod- els in evaluation, thus allowing us to test whether accuracies achieved using model combinations were higher than those achieved with isolated models. To combine the different models, we used an en- semble averaging strategy and ran the similarity- based decoding analyses as described above in par- allel with each of the three semantic models. At each leave-2-out test iteration, this gave three arc- tanh transformed 2*2 correlation matrices (one for each semantic model) that were used to evaluate de- coding. Model combination was achieved by fusing the respective arctanh transformed correlation ma- trices by summing them together. Evaluation of the resulting 2×2 summation matrix proceeded as previ- ously by first summing the two congruent values on the main-diagonal of the matrix, then summing the two incongruent scores on the counter-diagonal. If the congruent sum was greater than the incongruent sum, decoding was a success, otherwise a failure. 23 5 Results We split the stimulus nouns into the 35 most con- crete and 35 most abstract words according to the behavioural concreteness ratings from Anderson et al. (2014), and ran analyses on all words combined and these two subsets. Due to limitations in word coverage of the semantic models, ‘melody’ was missing from the abstract words, and ‘skeleton-key’ and ‘police-station’ were missing from the most concrete words (hence 67/70 words were analysed). 5.1 Hypotheses Dual coding theory (Paivio, 1971) leads to the fol- lowing hypotheses: (1) The text-based models will decode the more abstract nouns’ neural activity pat- terns with higher accuracy than the image-based model; (2) both image and text-based models will decode the more concrete nouns’ neural activity. We also compared the decoding accuracy for the most concrete nouns achieved using the combined image- and text-based models to the unimodal mod- els in isolation. Whilst previous analyses have ob- served advantages of multimodal models in describ- ing concrete noun fMRI, it is not clear whether this effect will carry over to our noun data set. One reason is because many of the most concrete half are “less concrete” than those of previous studies: according to Brysbaert et al. (2014)’s concreteness norms (where words were rated on a scale from 1 to 5), the mean ± SD rating of the 60 concrete nouns analysed by Mitchell et al. (2008) (and subsequently by Anderson et al. (2015)) is 4.87±.12, whereas the mean ± SD of the “most concrete” nouns anal- ysed in the current article, when tested with an inde- pendent samples t-test, was significantly smaller at 4.42±.44 (t = 7.4, p < .0001, 2-tail). A second reason is that the experimental task required par- ticipants to imagine a situation associated with the noun, rather than think of object properties. There- fore this analysis was of a more exploratory nature. 5.2 Decoding Analysis Decoding analyses were run using the image-based model and Italian and English text-based models in isolation, and also all combinations of these models as described in Section 4. Results are in Figure 3. In this section we use the abbreviations Img for the image-based model and TXit and TXen, for the Ital- ian and English text-based models, respectively. In all tests, chance-level decoding accuracy (the expected accuracy if word labelling is random) is 50%. Mean±SE accuracies across all participants are displayed in the leftmost column of plots for all 7 model combinations. Individual-level results are displayed for only three model combinations to avoid cluttering the graphs (Img only, the combined TXit&TXen, and the combined Img&TXit&TXen). To simplify the following discussion of results, we mainly focus on these three models. The choice to focus on TXit&TXen, rather than the Italian model, was made following the rationale that the language combination would leverage cultural nu- ances of semantic structure found in the Italian text- corpora jointly with the more extensive coverage of the larger English Wikipedia. Although TXit&TXen and TXen tended to produce higher decoding accu- racies, there were no significant differences between either TXit or TXen tested in isolation, or any model combination incorporating them. Mean results are displayed for all model combinations in Figure 3 and key results are tabulated in Table 2. 5.3 An advantage for the textual model on abstract nouns With respect to hypothesis 1 (an advantage for the text-based models decoding abstract neural activity patterns), the key difference to observe in Figure 3 is the drop in relative decoding accuracy between the image-based model and text-based models when decoding the most abstract nouns. The nine partic- ipant’s mean decoding accuracies for the most ab- stract nouns were compared between the Img, TXit, TXen and TXit&TXen models using Repeated Mea- sures ANOVA. Combinations of image and text- based models (e.g. Img&TXen) were not directly relevant to this analysis (because they integrate vi- sual and textual data) and consequently these mod- els were excluded. Bartlett’s test was used to verify that there was no evidence against homogeneity of variances prior to analysis (χ2=1.77, p = .62). The ANOVA indicated a statistically significant differ- ence between models: F(3,24) = 5.06, p < .01. Post hoc comparisons conducted using the Tukey Hon- est Significant Difference (HSD) test revealed that decoding accuracies achieved using TXen and the 24 40 50 60 70 80 90 100 Img TXit & TXen Img &TXit & TXen p=.05 40 50 60 70 80 90 100 Img TXit & TXen Img &TXit & TXen p=.05 40 50 60 70 80 90 100 Img TXit & TXen Img &TXit & TXen p=.05 Figure 3: Results of the decoding analysis from Section 5.2. See also Table 2. p=.05 lines were empirically estimated as described in Section 4 and apply to decoding an individual’s fMRI data (not multiple individuals). 25 All words combined Most concrete Most abstract Img 67±3%, 7/9 (<.001) 70±3%, 7/9 (<.001) 58±4%, 2/9 (.07) TXit&TXen 76±5%, 7/9 (<.001) 76±6%, 7/9 (<.001) 68±5%, 6/9 (<.001) Img&TXit&TXen 77±5%, 8/9 (<.001) 77±5%, 8/9 (<.001) 68±5%, 5/9 (<.001) Table 2: Key decoding accuracies from Section 5.2 (see also Figure 3). Each cell shows mean±SE decoding accuracy, the number (n) of participants decoded at a level significantly above chance (p<.05), and in round brackets, the cumulative binomial probability of achieving ≥ n significant results at p=.05. TXit&TXen model were significantly different (and larger than) Img (both p < .05). There were no other significant differences (including between Img and TXit). One possible reason for the weaker perfor- mance of TXit than TXen is that Italian Wikipedia is a less rich source of information due to being smaller in size than English Wikipedia (despite it presum- ably containing semantic information that is more relevant to Italian culture). 5.4 Both image and text-based models decode the more concrete nouns That both image- and text-based models signifi- cantly decoded the most concrete nouns is consis- tent with hypothesis 2. To test for differences be- tween image- and text-based models, mean decod- ing accuracies for the nine participants on the most concrete nouns were compared for the Img, TXit, TXen and TXit&TXen models using Repeated Mea- sures ANOVA. Combinations of image- and text- based models (e.g. Img&TXen) were not directly relevant to this analysis (because they integrate vi- sual and textual data) and so these models were ex- cluded. Bartlett’s test was used to verify homo- geneity of variances prior to analysis (χ2 = 2.86, p = .41). The ANOVA detected no statistically sig- nificant differences between the models: F(3,24) = 1.56, p = .22. Therefore when decoding the most concrete nouns there was no significant difference in accuracy between image-based and any text-based model. 5.5 No overall advantage for multimodal models on the more concrete nouns The third exploratory test compared the accuracy of the multimodal combination of image- and text- based models to the unimodal models when decod- ing the more concrete neural activity patterns. For the most concrete words, the highest scor- ing combination across all models was Img&TXen (mean±SE=77±4%). Whilst this proved to be sig- nificantly greater than Img (t = 3.13, p <= .02, df = 8, 2-tail), it was not significantly greater than TXen (t = .81, p = .44, df = 8, 2-tail). Turning to the analogous case for the Italian models, Img&TXit (mean±SE=75±4%) was not significantly greater than Img (t = 1.74, p = .12, df = 8, 2-tail), or TXit (t = 1.09, p = 0.31, df = 8, 2-tail). Therefore, although multimodal combinations re- turned higher accuracies than either the image- and text-based models in isolation (for concrete words), decoding accuracy was not significantly higher than either image- or text-based models. Previous work decoding neural activity associated with concrete nouns has found image-based mod- els to supply complementary information to text- based models (Anderson et al., 2015). We suggest three reasons that image-based models may have been disadvantaged in the current study compared to these past analyses. Firstly, Anderson et al. fo- cused on fMRI data elicited by unambiguously con- crete nouns, whereas the experimental nouns anal- ysed in the current article were mostly intended to be ‘less than concrete’ (of the seven taxonomic cate- gories investigated only ‘objects/tools’ was designed to be unambiguously concrete). Secondly, Anderson et al. used more images to build noun representa- tions (on average 350 images per noun compared to 20 used here), and nouns in the ImageNet images were segmented according to bounding boxes. Con- sequently their input may have been less noisy than Google Images (which we used because of its wider coverage). Finally, the experimental task of the pre- vious analyses required participants to actively think about the properties of objects, whereas the current data set was elicited as participants imagined situ- 26 ations associated with nouns (and hence may have invoked neural representations with more contextual elements). The lack of a significant increase in decoding ac- curacy achieved by pairing image- and text-based models allows us to infer that the text-based model contained many aspects of the visual semantic struc- ture found in the image-based model. Of course we expect modal structure in text-based models com- mensurate with what people are inclined to report in writing; e.g., it is easy to convey in text that both bananas and lemons are yellow and curvy, and light- bulbs and pears have similar shapes. Therefore we would anticipate correspondences in semantic sim- ilarities between image and text-based models and for these correspondences to extend to match neural similarities, e.g., as induced by participants viewing pictures of objects (Carlson et al., 2014). 5.6 Group-level decoding analysis The similarity-based decoding approach we have ap- plied enables group-level neural representations to be built simply by taking the mean similarity matrix over participants. Values in the correlation matrix were r to z (arctanh) transformed prior to averaging, then the average values were back transformed to the original range using tanh. This was because aver- aging z-transformed values (and back transforming) tends to yield less biased estimates of the population value than averaging the raw coefficients (Silver and Dunlap, 1987). However, in the current analysis re- sults obtained with z-transformation versus without it were virtually identical. Building group-level representations by averag- ing correlation matrices side-steps potential prob- lems surrounding the obvious alternative method of averaging data in fMRI space, where anatom- ical/functional differences between different peo- ples’ brains may result in relatively similar activity patterns being spatially mismatched in the standard- ised fMRI space. The motivation behind building group-level neural representations is that we might expect these to better match the computational se- mantic models than individual-level data. This is because the models are also built at group-level, cre- ated from the photographs and text of many individ- uals. However building group-level neural represen- tations will only be beneficial if there exist group- level commonalities in representational similarity (when combining data will reduce noise) as opposed to individual semantic representational schemes. Accuracies achieved using models to decode the group-level neural similarity matrices are displayed in the final column of the bar charts at the right of Figure 3. Specifically, decoding accuracies were: For all words combined: Img=84.8%, TXit&TXen=96.9% and Img&TXit&TXen=97.3%. For the most concrete words: Img=87.5%, TXit&TXen=95.8% and Img&TXit&TXen=95.8%. For the most abstract words: Img=70.2%, TXit&TXen=85.2% and Img&TXit&TXen=84.8%. To statistically test whether group-level decod- ing accuracies surpassed those of the individual- level results, we compared the set of individual- level mean accuracies to the corresponding group- level mean accuracy using one sample t-tests. In all tests (see Table 3) the individual-level accuracies were significantly different (lower) than the group- level accuracy (corrected for multiple comparisons using false discovery rate (Benjamini and Hochberg, 1995)). This is indicative of group-level regularities in semantic similarity for both concrete and abstract nouns and also their combination. A qualitative observation is that the differences between group and individual-level accuracy appear to be greater for concrete nouns. This could be con- sistent with participants having a more subjective se- mantic representation of abstract nouns; however we did not attempt to statistically test this claim. This is because a meaningful comparison would require concrete and abstract words to be controlled by be- ing at least equally discriminable at individual level and this does not appear to be the case with this dataset. 6 Conclusion This article has demonstrated that neural activity patterns elicited in mental situations of abstract nouns can be decoded using text-based compu- tational semantic models, thus demonstrating that computational semantic models can make a con- tribution to interpreting the semantic structure of neural activity patterns associated with abstract nouns. Furthermore, by comparing how well vi- sually grounded and textual semantic models de- 27 All words combined Most concrete Most abstract Img -5.6 (.004) -5.2 (.004) -3.0 (.02) TXit&TXen -4.2 (.007) -3.6 (.010) -3.4 (.01) Img&TXit&TXen -4.4 (.007) -3.9 (.008) -3.4 (.01) Table 3: Results of one sample t-tests comparing the set of individual-level mean decoding accuracies to the group- level accuracy (see Section 5.6). All tests were 2-tailed with df=8. The first number in each cell is the t-statistic, the second number in round brackets is the p-value (corrected according to false discovery rate). code brain activity associated with concrete or ab- stract nouns, we have observed a selective advan- tage for textual over visual models in decoding the more abstract nouns. This has therefore provided initial model-based brain decoding evidence that is broadly in line with the predictions of dual coding theory (Paivio, 1971). However, results should be interpreted in light of the following two factors. First, the dataset analysed was for a small sample of 67 words, and it is reasonable to conjecture that some of these words are also encoded in modalities other than vision and language. For example, mu- sical words may be encoded in acoustic and motor features (see also Fernandino et al. (2015)). Future work will be necessary to verify that the findings generalise more broadly to words from domains be- yond law and music. In work in progress the authors are undertaking more focused analyses on the cur- rent dataset, using textual, visual and newly devel- oped audio semantic modes (Kiela and Clark, 2015) to tease apart linguistic, visual and acoustic con- tributions to semantic representation and how these vary throughout different regions of the brain. A second limitation of the current approach, as pointed out by a reviewer, is that the Google im- age search algorithm (the workings of which are un- known to the authors) may not perform as well for abstract words as it does for concrete words. Con- sequently, the visual model may have been handi- capped compared to the textual model when decod- ing neural representations associated with more ab- stract words. We have no current measure of the de- gree of this effect, but it may be possible to alleviate it in future work, by having participants manually select images that they associate with abstract stim- ulus words, and using computational representations derived from these images in the analysis. Secondary results are that we have exploited rep- resentational similarity space to build group-level neural representations which better match our inher- ently group-level computational semantic models. In so doing, this exposes group-level commonalities in neural representation for both concrete and ab- stract words. Such group-level representations may prove both a useful test-bed for evaluating compu- tational semantic models, as well as a potentially useful information source to incorporate into com- putational models (see Fyshe et al. (2014) for related work). Finally we have demonstrated that English and Italian text-based models are roughly interchange- able in our neural decoding task. That the En- glish text-based model tended to return marginally higher results on our Italian brain data than the Ital- ian model provides a cautionary note for future stud- ies wishing to use semantic models from different languages to identify culturally specific aspects of neural semantic representation e.g., as a follow up to Zinszer et al. (2016). However we also note that the English Wikipedia data was larger than the cor- responding Italian corpus. Acknowledgments We thank three anonymous reviewers for their in- sightful comments and suggestions, Brian Murphy for his involvement in the configuration, collection and preprocessing of the original dataset, and Marco Baroni and Elia Bruni for early conversations on some of the ideas presented. Stephen Clark is sup- ported by ERC Starting Grant DisCoTex (306920). References A. J. Anderson, E. Bruni, U. Bordignon, M. Poesio, and M. Baroni. 2013. Of words, eyes and brains: Correlat- ing image-based distributional semantic models with 28 neural representations of concepts. In Proceedings of EMNLP, pages 1960–1970, Seattle, WA. A. J. Anderson, B. Murphy, and M. Poesio. 2014. Discriminating taxonomic categories and domains in mental simulations of concepts of varying concrete- ness. J. Cognitive Neuroscience, 26(3):658–681. A. J. Anderson, E. Bruni, A. Lopopolo, M. Poesio, and M. Baroni. 2015. Reading visually embodied mean- ing from the brain: Visually grounded computational models decode visual-object mental imagery induced by written text. NeuroImage, 120:309–322. A. J. Anderson, B. D Zinszer, and R. D. S. Raizada. 2016. Representational similarity encoding for fMRI: Pattern-based synthesis to predict brain activity using stimulus-model-similarities. NeuroImage, 128:44–53. M. Andrews, G. Vigliocco, and D. Vinson. 2009. In- tegrating experiential and distributional data to learn semantic representations. Psychological Review, 116(3):463–498. L. Barca, C. Burani, and L. S. Arduino. 2002. Word naming times and psycholinguistic norms for Italian nouns. Behavior Research Methods, Instruments, & Computers, 34:424–434. Y. Benjamini and Y. Hochberg. 1995. Controlling the false discovery rate: A practical and powerful ap- proach to multiple testing. Journal of the Royal Sta- tistical Society, Series B (Methodological), 57(1):289– 300. L. Bentivogli, P. Forner, B. Magnini, and E. Pianta. 2004. Revising the WordNet Domains Hierarchy: Seman- tics, coverage, and balancing. In Proceedings of the Workshop on Multilingual Linguistic Resources, pages 101–108, Geneva, Switzerland. S. Bergsma and B. Van Durme. 2011. Learning bilingual lexicons using the visual similarity of labeled web im- ages. In IJCAI, pages 1764–1769. R. Bruffaerts, P. Dupont, R. Peeters, S. De Deyne, G. Storms, and R. Vandenberghe. 2013. Similar- ity of fMRI activity patterns in left perirhinal cortex reflects similarity between words. J. Neuroscience, 33(47):18597–18607. E. Bruni, N. K. Tran, and M. Baroni. 2014. Multimodal distributional semantics. Journal of Artifical Intelli- gence Research, 49:1–47. M. Brysbaert, A. B. Warriner, and V. Kuperman. 2014. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior research methods, 46(3):904–911. T. A. Carlson, R.A. Simmons, and N. Kriegeskorte. 2014. The emergence of semantic meaning in the ventral temporal pathway. J. Cognitive Neuroscience, 26(1):120–131. K. M. Chang, T. M. Mitchell, and M. A. Just. 2010. Quantitative modeling of the neural representations of objects: How semantic feature norms can account for fMRI activation. NeuroImage: Special Issue on Mul- tivariate Decoding and Brain Reading, 56:716–727. K. Darwish. 2013. Named entity recognition using cross-lingual resources: Arabic as an example. In Proc. ACL, pages 1558–1567. B. Devereux, C. Kelly, and A. Korhonen. 2010. Us- ing fMRI activation to conceptual stimuli to evalu- ate methods for extracting conceptual representations from corpora. In Proceedings of the NAACL HLT First Workshop on Computational Neurolinguistics, pages 70–78, Los Angeles, USA. C. Fellbaum, editor. 1998. WordNet: An Electronic Database. MIT Press, Cambridge, MA. R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. 2005. Learning object categories from Google’s im- age search. In ICCV, pages 1816–1823. L. Fernandino, C. J. Humphries, M. S. Seidenberg, W. L. Gross, L. L. Conant, and J. R. Binder. 2015. Prediction of brain activation patterns as- sociated with individual lexical concepts based on five sensory-motor attributes. Neuropsycholigia. doi:10.1016/j.neuropsychologia.2015.04.009. A. Fyshe, P. P. Talukdar, B. Murphy, and T. M. Mitchell. 2014. Interpretable semantic vectors from a joint model of brain-and text-based meaning. In Proceed- ings of ACL, pages 489–499, Baltimore, MD. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature em- bedding. In ACM Multimedia, pages 675–678. D. Kiela and L. Bottou. 2014. Learning image em- beddings using convolutional neural networks for im- proved multi-modal semantics. In Proceedings of EMNLP, pages 36–45, Doha, Qatar. D. Kiela and S. Clark. 2015. Multi- and cross-modal semantics beyond vision: Grounding in auditory per- ception. In Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP 2015), pages 2461–2470, Lisbon, Portugal. D. Kiela, F. Hill, A. Korhonen, and S. Clark. 2014. Im- proving multi-modal representations using image dis- persion: Why less is sometimes more. In Proceedings of ACL 2014. N. Kriegeskorte, W. K. Simmons, P. S. F. Bellgowan, and C. I. Baker. 2009. Circular analysis in systems neuro- science: The dangers of double dipping. Nature Neu- roscience, 12:535–540. N. Kriegeskorte. 2015. Deep neural networks: A new framework for modeling biological vision and brain information processing. Annual Review of Vision Sci- ence, 1:417–446. 29 T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of ICLR, Scottsdale, Arizona, USA. T. M. Mitchell, S. V. Shinkareva, A. Carlson, K.-M. Chang, V. L. Malave, R. A. Mason, and M. A. Just. 2008. Predicting human brain activity associated with the meaning of nouns. Science, 320:1191–1195. B. Murphy, P. Talukdar, and T. Mitchell. 2012. Selecting corpus-semantic models for neurolinguistic decoding. In Proceedings of the First Joint Conference on Lexi- cal and Computational Semantics (*SEM), pages 114– 123, Montreal, Canada. A Paivio, editor. 1971. Imagery and verbal processes. Holt, Rinehart, and Winston, New York. M. Palatucci, D. Pomerleau, G. Hinton, and T. Mitchell. 2009. Zero-shot learning with semantic output codes. Neural Information Processing Systems, 22:1410– 1418. F. Pereira, M. Botvinick, and G. Detre. 2013. Using Wikipedia to learn semantic feature representations of concrete concepts in neuroimaging experiments. Artif. Intell., 194:240–252. A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls- son. 2014. CNN features off-the-shelf: An astound- ing baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops 2014, pages 512–519. A. E. Richman and P. Schone. 2008. Mining wiki re- sources for multilingual named entity recognition. In Proc. ACL. L. Shi, R. Mihalcea, and M. Tian. 2010. Cross- language text classification by model translation and semi-supervised learning. In Proc. EMNLP. N. C. Silver and W. P. Dunlap. 1987. Averaging correla- tion coefficients: Should Fisher’s z transformation be used? J. Applied Psychology, 72(1):146–148. J. Sivic and A. Zisserman. 2003. Video google: A text retrieval approach to object matching in videos. In ICCV, pages 1470–1477. G. Sudre, D. Pomerleau, M. Palatucci, L. Wehbe, A. Fyshe, R. Salmelin, and T. Mitchell. 2012. Track- ing neural coding of perceptual and semantic features of concrete nouns. NeuroImage, 62:451–463. K. Wiemer-Hastings and X. Xu. 2005. Content differ- ences for abstract and concrete concepts. Cognitive Science, 29:719–736. B. D. Zinszer, A. J. Anderson, O. Kang, T. Wheatley, and R. D. S. Raizada. 2016. Semantic structural align- ment of neural representational spaces enables transla- tion between English and Chinese words. J. Cognitive Neuroscience, 28(11):1749–1759. 30