Abstract
This work focuses on the enrichment of existing Portuguese word embeddings with visual information in the form of visual embeddings. This information was extracted from images portraying given vocabulary terms and imagined visual embeddings learned for terms with no image data. These enriched embeddings were tested against their text-only counterparts in common NLP tasks. The results show an increase in performance for several tasks, which indicates that visual information fusion for word embeddings can be useful for word embedding based NLP tasks.
Financially supported by the Brazilian National Council for Scientific and Technological Development (CNPq) and by the Portuguese Foundation for Science and Technology (FCT) under the projects CEECIND/01997/2017, UIDB/00057/2020.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Language modelling technologies have been dominated by semantic embedding models ever since Mikolov et al. (2013b) and Mikolov et al.’s (2013a) [11, 12] popularization of Word Embeddings, a concept which revolutionized the field of Natural Language Processing (NLP). The architecture presented by the authors, Word2Vec, has been used as basis for many works across the spectrum of NLP tasks, as attested by nearly 45,000 citations when accounting both of the aforementioned papers (as recorded by Google Scholar), mainly because of the fact that training this architecture only requires raw text, and no human-made annotation (the main obstacle in training machine learning models).
Many architectures based on the original intuition behind Word2Vec have become popular since 2013. The most prevalent, besides the original Word2Vec, are fastText [7] and GloVe [14]. An evolution upon the concept, taking into account the current context of a word, not just an amalgamation of all contexts with which it was trained, was introduced by Peters et al. (2018) [15], with their ELMO architecture, and popularized by Devlin et al.’s (2019) [4] BERT architecture.
All of the mentioned embedding architectures have at least one model trained on Portuguese language corpora. The Núcleo Interinstitucional de Linguística Computacional (NILC), from the Universidade de São Paulo (USP), for example, has several Word2Vec, fastText and GloVe models for the Portuguese language available within their Word Embedding repositoryFootnote 1. The Allen Institute for AI maintains an ELMO model repository which includes a Portuguese language modelFootnote 2. BERTimbau [19], a Portuguese language BERT model, was recently developed and added to the Hugging FaceFootnote 3 library. These models, and others, have been used to advance the state-of-the-art in several Portuguese language NLP tasks [6, 9, 17].
Beyond these efforts to further enhance the usage of text in the training of word embedding models, be it Portuguese language text or otherwise, an effort to enrich these embeddings with other modes of information also arose. The most studied modes of information used to enhance Word Embeddings are the visual mode (composed of images and video), and the aural mode (composed of sounds, spoken language, music, etc.). These efforts spurred the creation of multimodal embedding fusion architectures, used to join embeddings of disparate modes into a single embedding representing all fused knowledge. An example of this is the concatenation based architecture of Bruni et al. (2014) [1], which arrived at promising results after proposing that multiple embeddings of different modes could be concatenated, resulting in a higher-dimensional space, for them to be enhanced for better use in NLP tasks.
The goal of this work is to study the usage of visual data to enrich textual data within word embeddings for NLP tasks in the Portuguese language. The main hypothesis is that fusing textual information with visual information will enhance results for text-only tasks. To test it, experiments in four NLP tasks were performed using text-only and multimodal embedding models. These tasks were: Word Relatedness, Analogy Prediction, Semantic Similarity in Short Sentences, and Named Entity Recognition.
The following sections are arranged in the following manner: Sect. 2 delves into related work within the literature; Sect. 3 explains the methodology used in creating the multimodal embeddings; Sect. 4 explains the testing methodology and presents the results for these tests; Sect. 5 presents the conclusions and future work that can be done to expand upon this topic.
2 Related Work
The literature reveals two main ways in which multimodal embeddings are constructed: individually and simultaneously. That is, either learning is performed individually (an embedding is learned for each modality, and then these are fused) [3], or simultaneously (all modalities are learned at the same time in the same space). Henceforth, the former method will be referred to as Post-Learning Fusion, while the latter method will be referred to as Simultaneous Learning.
Post-learning fusion is divided into two further methods: early fusion and late fusion. Early fusion is performed at the representation level, and three methods of early fusion were found in the literature: feature concatenation, auto-encoder fusion, and cross-modal mapping. Feature concatenation is performed through the concatenation of all single modality fusion embedding vector pairs (that is, a textual feature vector representing a concept will be concatenated with a visual feature vector representing that same context) into a single, longer, multimodal feature vector [8]. Auto-encoder fusion is performed through the use of auto-encoders fed with pre-trained single modality embeddings, thus generating a single feature vector which can then be extracted from the auto-encoder’s last hidden layer [18]. Cross-modal mapping is performed through the learning of a certain amount of pre-mapped multimodal inputs and predicting those that do not have examples in both modalities [3]. Late fusion is performed at the level of prediction scores, and it is performed through an averaging of single modality predictions [8].
Lazaridous et al. (2015) [10] introduced the first instance found during the review of simultaneous learning semantic embedding model, based on Mikolov et al.’s (2013) [12] skip-gram architecture. They extended Mikolov et al.’s (2013) models to present relevant visual feature vectors alongside textual data during training for a subset of target words. This model has been shown to further propagate visual information to representations of words which were not trained with visual features.
As for evaluation methodologies, most of the literature consisted of using multimodality to improve the performance of downstream tasks. As such, the evaluation of the embeddings was extrinsic. That is, the evaluation metric was whether or not its addition to the systems performing the downstream task affected their performance.
Lazaridou et al. (2015) [10] were the only ones to perform intrinsic tests, using general semantic benchmarks such as concept relatedness (also known as semantic relatedness) [1] or semantic similarity. These are usually used to evaluate word embeddings, but multimodal embeddings were shown by Lazaridou et al. (2015) to outperform word embeddings on these tasks.
3 Resources and Methodology
Several choices had to be made prior to developing this work’s methodology. It was decided that these first tests would use static word embeddings based on the Word2Vec and FastText architectures, and use a translated version of the image embeddings released by Collell et al. (2017) [3]. These readily available resources made post-learning fusion methods an obvious choice for developing our own fusion architectures. Since the tasks we intended to use these models for were strictly textual, late post-learning fusion was not an option, as there would be no visual data to input into the system during tests, which left early post-learning fusion techniques.
The resources and architectures mentioned in the above paragraph are elaborated upon in the following subsections. This section also explains the multimodal model development methodology.
3.1 Unimodal Embeddings
The text embeddings used for this work were NILC’s word embeddingsFootnote 4 [9] and BBP corpus word embeddingsFootnote 5 [17]. Three versions of NILC’s embeddings were used: the 100 feature word2vec version and the 100 and 300 feature fastText versions. These three were deemed to be adequate for studying the effect of different parameters when adding multimodality to textual models. Only the 300 feature fastText version of BBP was used, as it was the only one readily available for download. This final BBP model was chosen as a means to study how different text embedding training corpora within the same domain affected multimodal fusion.
The visual embedding, henceforth referred to as the ImageNet embedding, is derived from Collell et al.’s (2017) [3] work, as they made their original visual embeddings created using ImageNet freely availableFootnote 6. The individual embeddings were paired with English language terms from the English language WordNet, however, and so needed to be translated before use with Portuguese language textual embeddings. In order to translate the English terms, OpenWordNet-PT [13], an open Brazilian WordNet available onlineFootnote 7, was used. Since the codes used to refer to each term in both WordNets were the same, and Collell et al. (2017) also shared the WordNet code for each term, about 5000 of the 18000 original text-visual embedding pairs were successfully translated into Brazilian Portuguese unigrams. This resulted in what we believe to be the first visual embedding dataset paired with Brazilian Portuguese terms, made available through our GitHub pageFootnote 8.
3.2 Dealing with the Information Gap
The great imbalance between visual embeddings and text embeddings becomes clear when comparing the roughly 5000 terms of the ImageNet embedding to the textual embedding vocabularies shown in Table 1. In order to ameliorate this problem, the “imagined embeddings” architecture described in Collell et al. (2017) [3] was used. As exemplified in Fig. 1, textual embedding-visual embedding pairs are created for the terms present in the visual embedding vocabulary, w, and used to train a feed-forward neural network. It does this by inputting the textual embedding \(\overrightarrow{l_x}\) into the NN, and expecting the visual embedding \(\overrightarrow{v_x}\) as an output, where the \(w_x\) is the term being learned. Once this textual-visual translation, f, is learned by the network, it can be extrapolated into terms without visual counterparts, creating “Imagined” visual embeddings for the entire vocabulary represented by the textual embedding that was translated.
Example of the architecture used by Collell et al. (2017). The imagined representations are the outputs of a text-to-vision mapping, f. Image created by Collell et al. (2017) [3]
Our imagined embedding networks used the following parameters: 0.25 dropout; 0.1 learning rate, SGD optimization; MSE based loss; 1 hidden layer with 200 nodes; and a TanH activation function. As was done in Collell et al. (2017), we chose certain epoch thresholds to test. These were 25, 50, 100, and 150 epochs for each model. The translated ImageNet embeddings were used to train these imagined embeddings for each text model.
Notably, the work discusses that while these imagined embeddings are valuable aggregates to common embeddings, substituting the textual embeddings completely with these “imagined embeddings” yields worse results. Additionally, in a follow-up paper, Collell et al. (2018) [2] highlighted several problems with this architecture, such as the fact that they do not fully mimic the behaviour of proper visual embeddings to the desired degree. It remains, however, that when combined with the original textual embeddings, these “imagined embeddings” do positively affect results in intrinsic tasks such as Word Relatedness.
3.3 Fusion Techniques
Two fusion techniques were tested in this work: Concatenation and Auto-encoding. Concatenation was performed as detailed in Collell et al. (2017), but for the Portuguese language, while Auto-encoding was inspired by the work of Silberer and Lapata (2014) [18] and adapted to work with our resources.
Furthermore, in order to perform this kind of fusion, it is helpful to ensure that all fused embeddings are in the same scale, so that none can overly influence the result simple because it is presented in a larger scale than another. To do this, a mathematical process called Standardization was performed on the embeddings, making it so all features were scaled according to a standard deviation of 1 and had a mean of 0.
Concatenation Fusion. This technique is simple: concatenate one mode’s embeddings to the end of another mode’s embeddings. This effectively packages all necessary data into a single vector space by expanding the dimensionality of that space.
This fusion technique’s greatest weakness is the fact that should one embedding in a certain mode not have a counterpart in another mode (as often happens with text-image multimodality, e.g. the presence of a textual embedding with no counterpart visual embedding) it is not possible to create the multimodal embedding. This problem is solved with the use of Imagined Embeddings, which, though not perfectly representative of an actual image embedding, allows for the concatenation of the entire vocabulary.
As such, the development of this embedding required the prediction of an imagined visual embedding for each word in the vocabulary, which was then concatenated with its originating word embedding. This resulted in multimodal embeddings with larger feature pools with which to draw from. Figure 2 presents the architecture of the concatenated fusion used for every word embedding in this work.
Auto-Encoding Fusion. This technique is performed by a Neural Network trained to predict an output by using the output itself as an input. Once this is done, one of the hidden layers of this network with less features than the original input is extracted to serve as an embedded version of the input. This serves to both shorten the final embedding, and to fuse several embeddings together. This fusion, in theory, keeps the most important features and fuses less important features together to make them more impactful.
This architecture has been used to lessen the impact of the gap between textual and visual information in the literature [18]. In this instance, whenever there was no visual pair for the textual embedding, a zeroed vector was appended to the textual embedding for the purposes of auto-encoding. The architecture presented below is a bit different, as it offers a new possibility: using imagined embeddings to fill the knowledge gap and offer complete feature vectors for auto-encoding. Figure 3 presents the architecture of the Auto-encoded fusion used for every multimodal word embedding in this work
The hidden layers are divided into two encoding layers and two decoding layers. The first encoding layer has the initial input node size of the concatenated textual-visual feature vector and an output node size of the feature vector of the textual model plus half the feature vector of the visual model. The second layer has the input node size of the previous output, and an output the size of the feature vector of the textual embedding. The output of this second layer is extracted and used as the Auto-encoded textual-visual embedding. The decoder is used only during training, and its two hidden layers are the same as the encoder’s, but in reverse order. The auto-encoding networks used the following parameters: 0.001 learning rate; ADAM optimization; MSE based loss; four hidden layers, as explained above; and ReLU activation funtions between layers with a TanH function as a final output.
3.4 Multimodal Embeddings
Several different textual-visual multimodal embeddings were created using the unimodal embeddings and multimodal fusion techniques explained above. Four text corpora were used for training: a 300-dimensional fastText model using the BBP text corpus (BBPFT300); a 300-dimensional fastText model using the NILC text corpus (NILCFT300); a 100-dimensional fastText model using the NILC text corpus (NILCFT100); and a 100-dimensional word2vec model using the NILC text corpus (NILCFT100). The Imagined Embeddings were trained from the 5000 available word-image pairs to four epoch thresholds: 25, 50, 100, and 150. Finally, each of the four text corpora were fused with all Imagined Embedding epoch threshold models, meaning a total of 16 models, four for each text corpus. Since we used two different fusion architectures, this total is doubled to a final roster of 32 tested models, with 16 for the Concatenation fusion architecture and 16 for the Auto-encoding fusion architecture.
Note that the act of training to different epochs was simply due to a lack of time and computational resources that would be required to train the best model for each individual task. As such, the best performing model out of each group can be taken to best represent the capabilities of the multimodal embedding fusion in question.
4 Tests and Results
Four tests were performed on the 32 models created for his work. These were Word Relatedness, Analogy Prediction, Semantic Similarity in Short Sentences, and Named Entity Recognition. These tests and their results are presented in that order within this section. Note that for each model, all four training epochs were tested for each fusion architecture of each model, but only the results for the best performing training epoch for each model are displayed for comparison in the below tables.
4.1 Word Relatedness
Word Relatedness is the intrinsic task of giving a score to how closely related two terms are. These tasks are usually scored via Spearman correlation, which assigns a Real number score between −1 and 1. The score approaches -1 the more inversely correlated the predictions are to the annotation; it approaches 0 the more unrelated the predictions are to the annotation; and it approaches 1 the more directly correlated the predictions are to the annotation. The more representative of human understanding of the terms an embedding is, the closer the Spearman score comes to 1.
A custom code was written for this task. It uses the Gensim python library to extract the Cosine distance between each word pair as a relatedness measurement, and compares them to their respective annotated relatedness scores using the Spearman Correlation method. The code can be accessed in the GitHub page for this projectFootnote 9.
The test set is a collection of 3000 word pairs annotated with a relatedness score from 50 (most related) to 0 (least related). The objective of the semantic models is to score each word pair in order to rank them from most related to least related. The closer to the original ranking the model gets, the higher its Spearman Correlation, the chosen method for scoring these kinds of tests. Table 2 presents some examples from this test set.
The test corpus used for word relatedness testing, MEN [1], was translated from the English language to the Portuguese language with the help of DeepL TranslateFootnote 10. The machine translations were checked individually to ensure some degree of uniformity, but the corpora should be considered Silver standard nonetheless. Table 3 shows the results of this test, with the best overall model and the best architecture for each model marked in bold.
This table shows that multimodal fusion enhances model performance for this task by an average of 3% points, and that the best overall architecture is the Auto-encoded architecture. It should be said, however, that between the multimodal models, the average difference is 0.5% point in favor of the Auto-encoded architecture, so neither fusion architecture has an overwhelming advantage over the other in this task.
4.2 Analogy Prediction
Hartmann et al. (2017) [9] published an analogy prediction test set, divided into Brazilian Portuguese and European Portuguese halves, alongside their initial publication of their NILC word embeddings. The test gives a related word pair and a single word from which it must predict a pair analogous to the first.
The code used to run these tests was made available alongside the test set itself. It can be found in the associated paper’s GitHub pageFootnote 11. It measures accuracy by counting how many correct predictions were achieved by the model against the total number of predictions.
This test focuses on two kinds of analogies: Semantic and Syntactic. These are each divided into a Brazilian Portuguese set and an European Portuguese set. To reiterate, the objective of this task is to accurately predict the second word of a pair, when given an example pair and the first word of the prediction pair (e.g. Berlin/Germany, Prediction: Paris/?). The accuracy of the model is then measured in a percentage, from 0 (completely inaccurate) to 100 (completely accurate). Table 5 shows the results of this test, with the best overall model and the best architecture for each model marked in bold. Table 4 presents some examples for this test set.
The results, presented in Table 5, show that the two language-specific test sets were corroborative, with similar results being achieved in all test set pairs. On average, there was a 0.01% point difference between text-only and the best multimodal results, which leads to the conclusion that the task of Analogy Prediction was mostly unaffected by the addition of visual information through the multimodal fusion methods tested herein. That said, the Auto-encoding architecture tended to underperform when compared to the Concatenated architecture in all cases but those in which the NILCFT100 semantic model was used.
These results were not outside expectations, as the image data used for the visual embeddings focused mostly on objects, while the Analogy Prediction tests focused on abstracts such as parentage, countries, currency and word forms. It is promising, however, that the previously mentioned best overall model achieved the best multimodality results when compared to their text-only counterpart. Perhaps with further testing, it might be ascertained that the better the original text-embedding, the more effective the imagined visual embedding fusion is.
4.3 Semantic Similarity in Short Sentences
Semantic similarity requires that a model give a numerical value to how semantically similar two sentences are, with the lower similarity extreme being that the sentences are completely different, and the higher similarity extreme being that the sentences are paraphrases. The ASSIN [5] sentence similarity corpus was used for this task in this work.
The code used for the tests is the same as was used by Hartmann et al. (2017) [9], available in the publication’s GitHub pageFootnote 12. The architecture uses a linear regression algorithm trained on two features: the cosine similarity of the TF-IDF of each sentence and the cosine similarity between the sum of each sentence’s word embeddings.
The ASSIN Semantic Similarity dataset is divided into two tracks, European Portuguese and Brazilian Portuguese. The objective of the task is to predict a number between 1 (unrelated sentences) and 5 (paraphrasing sentences) to represent the similarity between two short sentences. The task was evaluated using Pearson’s Correlation and Mean Standard Error (MSE), as it was during the original ASSIN task. Table 6 shows the results of this test, with the best overall model and the best architecture for each model marked in bold. Figure 4 presents some examples for this test set.
The results found with these Semantic Similarity tests are very similar to those found during the Word Relatedness tests presented previously in this section. These tests found that multimodal models outperform text-only models by an average of 2% points and that the Auto-encoding architecture is generally superior to the Concatenation architecture within these test conditions.
4.4 Named Entity Recognition
Named Entity Recognition (NER) requires that, given a set of classes for named entities, a model recognize and classify these entities within raw text, usually by use of tags. Word embeddings can be used to parse the text input into the model, using the feature vectors in its tagging predictions. The HAREM [16] NER corpus was used to test the models.
The code used for the NER tests was developed by Santos et al. (2019) [17], available in the paper’s GitHub pageFootnote 13. It uses an LSTM-CRF neural network architecture to train a sequence tagger using the Flair Toolkit to perform a NER task based on the supplied training and test corpora.
The HAREM test set is composed of two tracks: the Selective track; and the Total track. The Selective track is the smaller of the two, including only the five most populated named entity categories within the test set. The Total track is the larger of the two, including all ten named entity categories present in the First HAREM test set. Table 7 shows the results of this test, with the best overall model and the best architecture for each model highlighted in bold. Figure 5 presents some examples for this test set.
On average, the multimodal models outperform text-only models by 0.25% point in the Selective track, and are outperformed by the text-only models by 0.08% point in the Total track. These results show that the addition of visual information through multimodal fusion does not have much effect on the results for our models in this test set.
5 Conclusion and Future Work
This work presented the results of a study into the usefulness of visual data when used in conjunction with textual data for NLP tasks in a general news domain. It involved the development of word embedding models which were then put through a test battery for multimodal Word Embedding models which included the following tasks: Word Relatedness, Sentence Similarity, Analogy Prediction and Named Entity Recognition. These results revealed some aspects of textual-visual multimodal fusion for Word Embeddings within NLP tasks for the Portuguese language, a field in which it is most common to study purely textual Word Embedding models.
It took inspiration from the works of Bruni et al. (2014) [1], and their concatenation based multimodal fusion architecture; Silberer et al. (2014) [18], and their auto-encoding multimodal fusion architecture; and Collell et al. (2017) [3], and their Imagined Embeddings cross-modal mapping neural network, for visual vocabulary expansion. It took a different tack from previous work by exploring the possibility of use of this technology beyond the English language, using resources for the Portuguese language and in previously unexplored combinations.
Testing revealed that in tasks which require broader semantic meaning judgements, such as word relatedness and semantic similarity, multimodal fusion with visual information enhances results. For the specific test sets within these tasks explored in this work, the average increase in correlation with human scoring was of 2% points. For the tasks of analogy prediction and named entity recognition, however, the fusion resulted in little to no impact in the final results. This might be explained by the fact that these annotations in particular make use of knowledge that is not present within the visual modality, and thus were not enhanced by its addition.
As future work, we have planned testing multimodal fusion techniques on Contextual Embeddings such as BERT and ELMO. We also plan to expand testing into inherently multimodal tasks such as text-image pairing and cross-modal retrieval, both for the Portuguese Language.
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
References
Bruni, E., Tran, N., Baroni, M.: Multimodal distributional semantics. J. Artif. Intell. Res. 49, 1–47 (2014)
Collell, G., Moens, M.: Do neural network cross-modal mappings really bridge modalities? In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 462–468 (2018)
Collell, G., Zhang, T., Moens, M.: Imagined visual representations as multimodal embeddings. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, pp. 4378–4384 (2017)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies, pp. 4171–4186 (2019)
Fonseca, E.R., Santos, L.B., Criscuolo, M., Aluísio, S.M.: Visão geral da avaliação de similaridade semântica e inferência textual. Linguamática 8, 3–13 (2016)
Gomes, D.S.M., et al.: Portuguese word embeddings for the oil and gas industry: development and evaluation. Comput. Ind. 124, 1–44 (2021)
Grave, E., Mikolov, T., Joulin, A., Bojanowski, P.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 427–431 (2017)
Habibian, A., Mensink, T., M., S.C.G.: Video2vec embeddings recognize events when examples are scarce. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2089–2103 (2017)
Hartmann, N., Fonseca, E.R., Shulby, C., Treviso, M.V., Rodrigues, J.S., Aluísio, S.M.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. In: Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, pp. 122–131 (2017)
Lazaridou, A., Pham, N.T., Baroni, M.: Combining language and vision with a multimodal skip-gram model. In: Proceedings of the 13th Conference of the North American Chapter of the Association of Computational Linguistics on Human Language Technologies, pp. 153–163 (2015)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the 1st International Conference on Learning Representations, p. 12 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems, pp. 3111–3119 (2013)
Paiva, V., Rademaker, A., Melo, G.: Openwordnet-pt: an open brazilian wordnet for reasoning. In: Proceedings of the 24th International Conference on Computational Linguistics, pp. 353–360 (2012)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 16th Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies, pp. 2227–2237 (2018)
Santos, D., Cardoso, N.: A golden resource for named entity recognition in portuguese. In: Proceeding of the 7th International Conference on the Computational Processing of Portuguese, pp. 69–79 (2007)
Santos, J., Consoli, B.S., Santos, C.N., Terra, J., Collovini, S., Vieira, R.: Assessing the impact of contextual embeddings for portuguese named entity recognition. In: Proceedings of the 8th Brazilian Conference on Intelligent Systems, pp. 437–442 (2019)
Silberer, C., Lapata, M.: Learning grounded meaning representations with autoencoders. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 721–732 (2014)
Souza, F., Nogueira, R., Lotufo, R.: Bertimbau: pretrained BERT models for Brazilian Portuguese. In: Proceedings of the 9th Brazilian Conference on Intelligent Systems, pp. 403–417 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Consoli, B.S., Vieira, R. (2021). Enriching Portuguese Word Embeddings with Visual Information. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-91699-2_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)




