Disambiguation of Universal Dependencies Part-of-Speech Tags of Closed Class Words in Portuguese

Lopes, Lucelene; Fernandes, Paulo; Inacio, Marcio L.; Duran, Magali S.; Pardo, Thiago A. S.

doi:10.1007/978-3-031-45392-2_16

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14197))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

519 Accesses
1 Citation

Abstract

This paper explores methods to disambiguate Part-of-Speech (PoS) tags for closed class words in Brazilian Portuguese corpora annotated according to the Universal Dependencies annotation model. We evaluate disambiguation methods of different paradigms, namely a Markov-based method, a widely adopted parsing tool, and a BERT-based language modeling method. We compare their performances with two baselines, and observe a significant increase of more than 10% over the baselines for all proposed methods. We also show that while the BERT-based model outperforms the others reaching for the best case a 98% accuracy predicting the correct PoS tag, the use of the three methods as an Ensemble method offers more stable result according to the smaller variance for the numerical results we performed.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Top-Down Parsing Error Correction Applied to Part of Speech Tagging

Universal Dependencies-Based PoS Tagging Refinement Through Linguistic Resources

A Comparative Study of Language Modeling to Instance-Based Methods, and Feature Combinations for Authorship Attribution

1 Introduction

To develop large annotated corpora is a challenging task in the area of Natural Language Processing (NLP). It is a common strategy to rely on automatic analysis tools to pre-annotate the data and then to manually review the annotation, as direct human annotation requires large amounts of time, a strict discipline, and an homogeneous system of annotation rules that must be followed by annotators. In the case of treebanks (i.e., databases of sentences and their corresponding syntactic trees), this must be carried out for both Part-of-Speech (PoS) tagging and syntactic parsing, at least. However, as the annotation of PoS tags can be considered input for syntactic annotation, the better the PoS annotation, the lower the risk of tagging errors leading to parsing errors.

In the pre-annotation of PoS, the hardest part is assigning tags to words from closed classes, as many of them may pertain to different classes, that is, they are ambiguous. These words are among the most frequent in any corpus, regardless of genre and domain. As they constitute marks for syntactic annotation decisions, it is important to annotate them accurately, thus offering safe annotation points to support the effort of producing large annotated treebanks.

Of particular interest to this paper is the initiative to perform PoS tagging for Portuguese language according to the widely adopted Universal Dependencies (UD) model [18, 20], which predicts the existence of “universal” PoS tags and syntactic relations that may be applied to all languages in the world. There are currently nearly 200 treebanks in over 100 languages affiliated to UD.

In Portuguese, we highlight the previous work of Rademaker et al. [22], that has produced a UD-annotated version of Bosque treebank [1], and the recent efforts of Silva et al. [25] that have explored PoS-tagging for tweets, Lopes et al. [17] that shows that there are non-ambiguous words and sequences of words that may significantly improve tagging accuracy, and Souza et al. [3] that creates a UD corpus for domain specific documents. Nevertheless, in practically all languages and domains there is a large number of challenging words that are ambiguous and require more efforts to be properly annotated.

We focus here on investigating and proposing methods for solving the problem of PoS tag disambiguation of words of a particular case in Portuguese: the closed class words. Such word classes include prepositions, conjunctions, determiners, numbers, pronouns and primitive adverbs (non-derived adverbs). The motivation for such choice relies on the fact that closed class words may be very ambiguous regarding PoS tagging and, once solved, may serve as solid evidence to help disambiguating words of other word classes, improving tagging accuracy overall. For instance, in PortiLexicon-UD [16], which is a UD-based lexicon for Portuguese, the word “que” (in English, it usually corresponds to “what”, “that” or “which”) may have up to 7 different word classes and represents a challenge for annotation, as discussed in the work of Duran et al. [6]. To exemplify the challenges, we reproduce some examples from this work: the word “que” is:

a pronoun in the sentence “As coisas com que sonhamos são pistas para o autoconhecimento”,
a subordinate conjunction in “É obvio que quero votar”,
a determiner in “ Que maravilha”, and
an adverb in “ Que horrível foi esse acidente”.

The work of Silva et al. [25] evidences the practical difficulties in PoS tagging. Some of these classes, as pronouns and subordinate conjunctions, achieve some of the lowest accuracy values (below 90%) in the reported experiments.

In our paper we explore three different methods to disambiguate the words of the selected classes:

a Markovian model similar to the one developed by Assunção et al. [2] to disambiguate each ambiguous word, considering the PoS tag sequences of a training set;
the tagger/parser UDPipe 2.0 [28] pretrained for Portuguese is used to tag the whole sentences, and we observe the decision took for ambiguous words;
a contextual word embedding-based solution using a pretrained Portuguese model to individually disambiguate the PoS tag for ambiguous words, in a similar approach to Vandenbussche et al. [30] that uses BERT (Bidirectional Encoder Representations from Transformers) [4] by a pretrained language model specific for Brazilian Portuguese, the BERTimbau [26].

We perform a detailed error analysis in comparison with two baselines and also explore an Ensemble solution considering a voting system taking into account the three presented models’ outcomes. Our results show accuracy values ranging from 90% to 98% for the three proposed methods, as well as the Ensemble one. While the BERTimbau-based approach delivers the higher accuracy values for the best cases experimented, the Ensemble solution presents more stable results since the accuracy of all experimented cases has a smaller standard deviation than all three methods applied individually. The achieved accuracy is relevant and the gains in terms of annotation effort are considerable, as we correct nearly 25% of all tokens of large Portuguese corpora.

This paper is organized as follows. The next section presents the main related work. The problem definition and the baselines for this study are presented in Sect. 3. Section 4 presents the three experienced models: the Markovian Model, the UDPipe 2.0 application, and our contextual word embedding BERTimbau-based approach, respectively. Section 5 shows our experiments over an example corpus and discusses the achieved effectiveness. Finally, the final section summarizes our contributions and suggests some future work.

2 Related Work

We can find in the literature related works using varied techniques to deal with PoS ambiguities. We comment on the main ones in what follows.

Ehsani et al. [7] perform the disambiguation of PoS tags for Turkish using Conditional Random Fields (CRF) [14]. In such work the authors point out the processing burden associated using CRF to perform the task, but they deliver PoS tag accuracies around 97%. The authors also mention that this high PoS tag accuracy is very beneficial to support reliable lemma and morphological annotations.

A different approach is the work of Assunção et al. [2] that uses a Markovian model to perform PoS tag disambiguation based on Markovian models [13] and a partially annotated corpus. In this work, the disambiguation follows a basic probability computation of PoS tag sequences, and therefore delivers faster computations than more sophisticated approaches.

The work of Hoya Quecedo et al. [10] presents an effort to disambiguate PoS tag and lemma for morphological rich languages, which is the case of Portuguese, using a Neural Network model based on BiLSTM (Bidirectional Long Short-Term Memory) [8]. Hoya Quecedo et al. build word embeddings to submit to the bidirecional model and, thus, estimate the probability of the PoS tag and lemma options of each word. These experiments deliver a precision varying from 80% to 85% to corpora in Finnish, Russian, and Spanish.

Another interesting work published by Muñoz-Valero et al. [19] also employs an LSTM (Long Short-Term Memory) RNN (Recurrent Neural Networks) [5] to disambiguate words in the American National Corpus [11]. Similar to this one, the works of Shen et al. [24] and Zalmout and Habash [32] aim at the disambiguation of morphological features, both using LSTM and RNN.

3 Problem Definition and Baseline Approaches

The problem we are aiming to solve is how to disambiguate closed class words that can be tagged with different PoS tags, for example, the word “por” that in Portuguese can be either an ADP (in English, “by”) or a VERB (in English, “to put”). Our proposed methods are based on working with a training set of fully annotated sentences using Universal Dependencies (UD) in the CoNLL-U format [20] in order to disambiguate the PoS tag annotation of all ambiguous words of a test set.

The set of UD PoS tags is formed by:

ADJ - adjectives, as “bonito” (“beautiful” in English);
ADP - adpositions, as “de” (“of” in English);
ADV - adverbs, as “não” (“no” in English);
AUX - auxiliary verbs, as “foi” (“was” in English);
CCONJ - coordinating conjunctions, as “e” (“and” in English);
DET - determiners, as “cujo” (“whose” in English);
INTJ - interjections, as “tchau” (“goodbye” in English);
NOUN - nouns, as “vida” (“life” in English);
NUM - numerals, as “cinco” (“five” in English);
PART - particles, which is not employed in Portuguese;
PRON - pronouns, as “ele” (“he” in English);
PROPN - proper nouns, as “Brasil” (“Brazil” in English);
PUNCT - punctuations, as “?”;
SCONJ - subordinating conjunctions, as “porque” (“because” in English);
SYM - symbols, as “$”;
VERB - verbs, as “jogamos” (“(we) play” in English);
X - others, as foreign words.

The closed classes considered in this paper are ADP, CCONJ, DET, NUM, PRON and SCONJ, plus subsets of the classes ADV^{Footnote 1}, AUX^{Footnote 2}, and ADJ^{Footnote 3}. Table 1 shows the considered closed classes, the total number of words (#tot.), the number of ambiguous words (#amb.), and some examples. Considering that many ambiguous tokens belong to several classes, the overall sum of ambiguous tokens is 368 different words. A full list of the considered closed class words is available at https://sites.google.com/icmc.usp.br/poetisa/publications.

Table 1. Closed PoS tags considered in our work.

Full size table

The first baseline (Baseline 1) that we evaluate takes into account the number of occurrences of each ambiguous word in the training set and uses the most common PoS tag to all occurrences in the test set. For example, if the word “até” that can be an ADP (“until” in English) or an ADV (“even” in English) is found 189 times as ADP and 77 times as ADV, we will assume that all occurrences of the word “até” will be considered ADP. However, to prevent decisions based on too sparse data at the training set, we will only consider a prediction when there are at least 3 occurrences of a word with the most frequent PoS tag. Additionally, to prevent decisions based on too close differences, the second more frequent PoS tag has to be less than half frequent than the more frequent one.

The second baseline considered is the implementation following the definitions made by the work of Hoya Quecedo et al. [10] that uses a Neural Network model based on BiLSTM as described before. We employ this baseline using the same hyperparameters defined in [10] and we will refer to this as Baseline 2.

4 Prediction Methods

In this section we present the three methods implemented in these paper experiments. The first one is a traditional Markovian modeling approach inspired in the work of Assunção et al [2] and it is presented in Subsect. 4.1. The second method is the application of the UDPipe 2.0 tool [27] and it is presented at Subsect. 4.2. The third method is an approach based on BERTimbau pretrained neural language model [26] to build a contextual word embedding model and it is presented at Subsect. 4.3.

4.1 Prediction Through Markovian Models

Among the Markovian approaches to stochastically predict language information, we decided to employ the one based on Assunção et al. [2]. This approach is based on developing a Markov chain model to each sentence of the training set describing the probabilities of PoS tag sequences.

For an illustration of the method, consider the sentence “Com a rescisão, as provas apresentadas pelos delatores ainda podem ser usadas.”. This sentence will be annotated as described in Fig. 1. This sentence corresponds to the attached Markov chain that represents the sequences of PoS tags. For example, the two words tagged as ADP are followed by DET, therefore the probability of transition from ADP to DET is 100%. Similarly, the three occurrences of NOUN are followed by PUNCT, VERB, and ADV tags, therefore, each existing arc in the chain has a transition probability of 33.3%. Additionally, one additional node representing the boundary of the sentence ($\star $) links to the sentence start (before ADP) and the sentence end (after PUNCT).

This Markov chain creation process is repeated to all sentences of the training set. Once all sentences of the training set are used to create a Markov chain, each possible word with ambiguous PoS tags is disambiguated considering the PoS tags preceding and succeeding the ambiguous word.

4.2 Prediction Through UDPipe 2.0

The second method experimented is the use of the tagger/parser UDPipe 2.0 [28] using one of the available Portuguese training models, which is BOSQUE-PT [29] in our experiment, composed by 9,364 sentences and 210,957 tokens. UDPipe 2.0 uses a RNN based on BiLSTM to perform a full UD annotation based on a previously training set.

Our approach in this paper is to feed the whole sentences of the testing set to UDPipe 2.0 and pinpoint the ambiguous words to observe the accuracy of disambiguation of the target closed class words.

4.3 Prediction Through BERTimbau-Based Model

A more recent approach, which has achieved state of the art results in many NLP tasks [9, 15], is based on the usage of pretrained neural Language Models (LM), such as BERT [4]. A LM is capable of providing contextualized word embeddings, which can later be used in a downstream task. Specifically, we use the language model BERTimbau, a BERT model pretrained for Brazilian Portuguese [26], in the smallest version, BERTimbau-base, that, like BERT-base, has 12 layers.

All sentences are previously tokenized using the same BERTimbau model, as it requires the words to be split into subtokens. For words consisting of more than one subtoken, the embedding used during classification represents the first one.

For the PoS disambiguation, our model has two inputs: the sentence duly tokenized and the position of the token to disambiguate. BERTimbau is used to retrieve a sentence vector (which represents the whole context) and also the word embedding for the token to disambiguate (to include specific information about the token), which are later concatenated and passed through a linear layer in order to create a combined representation.

Finally, this vector is fed into another linear layer with Softmax for classification into PoS tags, plus a dropout layer between the combined representation and the final classification layer in order to tackle over fitting issues. Figure 2 depicts the general process for the BERTimbau-based method.

For the fine tuning of the model, the Cross Entropy loss function was optimized using Adam [12] with learning rate $\alpha = 10^{-5}$ for 2 epochs. The dropout rate used was 0.2 with batch size 8. The whole model was implemented using HuggingFace’s Transformers library [31] alongside PyTorch [21] and it is available as supplemental material^{Footnote 4}.

5 Experiments

The conducted experiments were performed over an annotated corpus in Brazilian Portuguese language with 8,420 sentences (168,397 tokens) extracted from the first five thousand news from Folha Kaggle data bank [23]. This corpus was manually annotated using UD tags by a group of 10 annotators with redundancy and supervised by a chief linguist. It contains (among its 168,397 tokens) 44,208 occurrences of 368 distinct ambiguous words.

To allow performing the experiments and comparing all the methods, we randomly split the corpus into: train - a training data set with approximately 80% of the tokens of the full corpus; dev - a development data set with approximately 10% of the tokens of the full corpus; test - a testing data set with approximately 10% of the tokens of the full corpus. However, to prevent bias in our experiments, we created 10 of such random splits as folds to perform cross-validation in order to establish the accuracy of each method. The application of Baseline 1 and the Markov-based model employs the train+dev data sets to training and test data set to testing. The application of UDPipe 2.0 method employs test data set as testing and ignores the data sets train and dev. The application of Baseline 2 and BERTimbau-based model employ each split individually.

Table 2 presents the accuracy and F1 obtained for each fold, while Table 3 summarizes the results obtained by each method for each fold in terms of the average^{Footnote 5}, standard deviation, minimum, and maximum results of the ten folds individually. In these tables, the methods are abbreviated as Markov, for the Markov-based method (Subsect. 4.1), UDPipe for the UDPipe 2.0 application method (Subsect. 4.2), and BERT for the BERTimbau-based method (Subsect. 4.3). Finally, in these tables we included the accuracy and F1 results of an Ensemble method. The Ensemble is obtained applying the Markov-based, UDPipe 2.0, and BERTimbau-based methods, and deciding on the PoS tag by a simple voting strategy among them.

Table 2. Accuracy and F1 values of each method for each of the ten folds.

Full size table

Table 3. Summarization of accuracy and F1 values considering the ten folds.

Full size table

The first observation from the results in Table 3 is that the three methods are significantly more accurate than the baselines. It is also noticeable the high accuracy achieved by the BERTimbau-based approach that alone provides a better accuracy than the Ensemble method. The second observation is the stability of the Ensemble method delivering smaller standard deviation both for accuracy and F1. The accuracy achieved by the Ensemble method is slightly smaller than the BERTimbau-based one, but since the its variance of results is smaller, it is probably more reliable to employ the Ensemble method in a real annotation where the gold standard is not previously known.

To illustrate the difficulties of the phenomenon, we analyzed the number of mistaken decisions (wrong PoS tags) made by each of the three methods, stated in the first three rows of Table 4. We also present in the middle rows of this table the mistakes shared by the methods, two by two, and all together. The last rows stated the total mistakes by methods and the use of an Ensemble of the three methods application.

Table 4. Number of wrong PoS tags by method for each of the ten folds of the PoS tag errors.

Full size table

Figure 3 shows a Venn diagram of the PoS tag errors made by the three methods for Fold 4 (which is a fold with number of errors nearest to the average of all folds, as highlighted in Table 4). For this fold, there were 319 PoS tag errors by the Markov-based method alone, 187 by the UDPipe 2.0 method alone, and 9 by the BERTimbau-based method alone. Similarly, 50 PoS tag errors were made by both Markov-based and UDPipe 2.0 methods, 18 by Markov-based and BERTimbau-based methods, and 16 by the UDPipe 2.0 and BERTimbau-based methods. Finally, 29 errors were made by all three methods. These lead to 416 errors by Markov-based (319+50+18+29), 282 by UDPipe 2.0 (187+50+16+29), and 72 by BERTimbau-based (9+18+16+29). Since the Ensemble method will always decide wrongfully when the three methods are wrong (29 errors), when any two methods are wrong (50+18+16), the errors of the Ensemble method will appear on 113 PoS tags (29+50+18+16).

Analyzing Fold 4 testing set outcome, we observe that it is composed by 842 sentences and a total of 16,938 tokens. From these, 4,354 tokens are ambiguous words from closed classes, and only 72 were incorrect after applying BERTimbau-based method, i.e., the method produced 4,282 correct tokens (25.21% of the total tokens), which correspond to 98.35% of the ambiguous tokens.

To illustrate the errors of the methods, we use two words from the 29 mistakes in the testing set for Fold 4. The token “segundo” (“second” in English) at the end of the sentence “Flamengo e São Paulo viraram 12 e 11 pontos de distância, mas não estavam em segundo.” is tagged as ADJ, but the methods delivered a NOUN prediction, because the methods had difficulty to handle the noun elipse, as the full form would be “segundo lugar” (“second place” in English). The first token “a” (“her” in English) at the sentence “Ela recebeu uma mensagem certo dia chamando a a ir a Cariri...” is tagged as PRON. The three methods delivered ADP prediction because a double token sequence “a a” is the usual tokenized form of crasis (“à”, in English “to the”), a very frequent form in Portuguese that is tagged as ADP DET, respectively. In the training set for Fold 4, the number of crasis occurrences is 276, while other occurrences of double “a” were never present, consequently the three methods were unable to learn tags PRON ADP for the sequence, assigning DET instead of PRON for it. One may see that both cases are difficult ones, which somehow explains why the methods failed.

6 Conclusion

We conducted an experiment with three proposed methods that show to be competitive and delivered an accuracy gain over the baselines, reaching average accuracy values ranging from 90.76% to 97.84%. The F1 average values were also impressive, ranging from 94.89% to 99.11%. It is important to mention that, since we experiment using a 10-fold cross-validation, our results are statistically sound, as the standard deviation of the results of each fold was always low^{Footnote 6}.

It is particularly noticeable that the BERTimbau-based approach brings a relevant correction power. Recalling Fold 4 testing set correction of 25.21% of the total tokens, we observe that, to a large corpus as the Folha-Kaggle [23] with around 84 million tokens, we can automatically deliver more than 20 million correct tokens for ambiguous words of closed classes.

It is also noticeable that combining the three presented methods in an Ensemble method has shown a small impact in the average accuracy with respect to BERTimbau-based method, respectively 97.23% and 97.84%, but the Ensemble solution offers more stable results, since the standard deviation of the 10 folds dropped from 0.0034 for BERTimbau-based solution to 0.0024 for the Ensemble one.

As future work, we intend to perform a deeper analysis of the Ensemble approach, perhaps adding other methods casting votes together with the three methods already developed. It is also possible to imagine the development of a similar technique to disambiguate other fields of the UD model, as lemma and morphological features, given the promising results achieved by our methods.

Notes

1.
Similarly to English with the ending -“ly”, in Portuguese it is possible to turn adjectives into adverbs by adding -“mente” at the end. We disconsider those -ly adverbs, as only the primitive adverbs are a closed class.
2.
The verbs ser and estar (“to be” in English) are always annotated as AUX, either by being true auxiliary verbs, either by being copula verbs. The verbs “ir”, “haver”, and “ter” (“to go”, “to exist”, and “to have” in English) are sometimes annotated as VERB, sometimes annotated as AUX (as “going to” and “have” + a past participle in English).
3.
While adjectives are not a closed class, the adjectives that are ordinal numbers are considered belonging to a closed subset of class ADJ.
4.
https://github.com/nilc-nlp/pos-disambiguation.
5.
The values stated as average are the macro average of the values of each fold, but since the folds have about the same size, the values for micro and macro average have are practically the same (less than 0.01% difference).
6.
For reproducibility purposes, all data (including fold splits) and implementation of all methods are available at https://sites.google.com/icmc.usp.br/poetisa/publications.

References

Afonso, S., Bick, E., Haber, R., Santos, D.: Floresta sintá(c)tica: A treebank for Portuguese. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). ELRA, Las Palmas, Canary Islands - Spain (May 2002), http://www.lrec-conf.org/proceedings/lrec2002/pdf/1.pdf
Assunção, J., Fernandes, P., Lopes, L.: Language independent pos-tagging using automatically generated markov chains. In: Proceedings of the 31st International Conference on Software Engineering & Knowledge Engineering, pp. 1–5. Lisbon, Portugal (2019). https://doi.org/10.18293/SEKE2019-097
De Souza, E., Freitas, C.: Polishing the gold-how much revision do we need in treebanks? In: Procedings of the Universal Dependencies Brazilian Festival, pp. 1–11 (2022). https://aclanthology.org/2022.udfestbr-1.2.pdf
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). https://doi.org/10.48550/ARXIV.1810.04805, https://arxiv.org/abs/1810.04805
DiPietro, R., Hager, G.D.: Chapter 21 - deep learning: RNNs and LSTM. In: Zhou, S.K., Rueckert, D., Fichtinger, G. (eds.) Handbook of Medical Image Computing and Computer Assisted Intervention, pp. 503–519. The Elsevier and MICCAI Society Book Series, Academic Press (2020). https://doi.org/10.1016/B978-0-12-816176-0.00026-0
Duran, M., Oliveira, H., Scandarolli, C.: Que simples que nada: a anotação da palavra que em córpus de UD. In: Proceedings of the Universal Dependencies Brazilian Festival, pp. 1–11 (2022). https://aclanthology.org/2022.udfestbr-1.3
Ehsani, R., Alper, M.E., Eryiğit, G., Adali, E.: Disambiguating main POS tags for Turkish. In: Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012), pp. 202–213. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Chung-Li, Taiwan (2012). https://aclanthology.org/O12-1021
Gers, F.A., Schmidhuber, J.A., Cummins, F.A.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000). https://doi.org/10.1162/089976600300015015
Article Google Scholar
Hoang, M., Bihorac, O.A., Rouces, J.: Aspect-based sentiment analysis using BERT. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics, pp. 187–196. Linköping University Electronic Press, Turku, Finland (2019). https://aclanthology.org/W19-6120
Hoya Quecedo, J.M., Maximilian, K., Yangarber, R.: Neural disambiguation of lemma and part of speech in morphologically rich languages. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3573–3582. European Language Resources Association, Marseille, France (2020). https://aclanthology.org/2020.lrec-1.439
Ide, N., Suderman, K.: Integrating linguistic resources: The American national corpus model. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). ELRA, Genoa, Italy (2006). http://www.lrec-conf.org/proceedings/lrec2006/pdf/560_pdf.pdf
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of the 3rd International Conference on Learning Representations (2015). http://arxiv.org/abs/1412.6980
Kupiec, J.: Robust part-of-speech tagging using a hidden markov model. Comput. Speech Lang. 6(3), 225–242 (1992). https://www.sciencedirect.com/science/article/pii/088523089290019Z
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001). https://dl.acm.org/doi/10.5555/645530.655813
Liu, Y., Lapata, M.: Text summarization with pretrained encoders. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3730–3740. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1387
Lopes, L., Duran, M., Fernandes, P., Pardo, T.: Portilexicon-ud: a Portuguese lexical resource according to universal dependencies model. In: Proceedings of the Language Resources and Evaluation Conference, pp. 6635–6643. European Language Resources Association, Marseille, France (2022). https://aclanthology.org/2022.lrec-1.715
Lopes, L., Duran, M.S., Pardo, T.A.S.: Universal dependencies-based pos tagging refinement through linguistic resources. In: Proceedings of the 10th Brazilian Conference on Intelligent System. BRACIS’21 (2021). https://link.springer.com/chapter/10.1007/978-3-030-91699-2_41
de Marneffe, M.C., Manning, C.D., Nivre, J., Zeman, D.: Universal Dependencies. Comput. Linguist. 47(2), 255–308 (2021). https://doi.org/10.1162/coli_a_00402, https://aclanthology.org/2021.cl-2.11
Muñoz-Valero, D., Rodriguez-Benitez, L., Jimenez-Linares, L., Moreno-Garcia, J.: Using recurrent neural networks for part-of-speech tagging and subject and predicate classification in a sentence. Int. J. Comput. Intell. Syst. 13, 706–716 (2020). https://doi.org/10.2991/ijcis.d.200527.005
Article Google Scholar
Nivre, J., et al.: Universal Dependencies v1: A multilingual treebank collection. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 1659–1666. ELRA, Portorož, Slovenia (2016). https://aclanthology.org/L16-1262
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Rademaker, A., Chalub, F., Real, L., Cláudia Freitas, Bick, E., De Paiva, V.: Universal dependencies for Portuguese. In: Proceedings of the Fourth International Conference on Dependency Linguistics (Depling), pp. 197–206 (2017)
Google Scholar
Santana, M.: Kaggle - news of the brazilian newspaper. https://www.kaggle.com/marlesson/news-of-the-site-folhauol, accessed: 2021-06-14
Shen, Q., Clothiaux, D., Tagtow, E., Littell, P., Dyer, C.: The role of context in neural morphological disambiguation. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 181–191. Osaka, Japan (2016). https://aclanthology.org/C16-1018
Silva, E., Pardo, T., Roman, N., Fellipo, A.: Universal dependencies for tweets in brazilian portuguese: Tokenization and part of speech tagging. In: Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional. pp. 434–445. SBC, Porto Alegre, RS, Brasil (2021). https://doi.org/10.5753/eniac.2021.18273, https://sol.sbc.org.br/index.php/eniac/article/view/18273
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20–23 (2020), https://link.springer.com/chapter/10.1007/978-3-030-61377-8_28
Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207 (2018). https://aclanthology.org/K18-2020
Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Association for Computational Linguistics, Vancouver, Canada (2017), https://aclanthology.org/K17-3009
Universal Dependencies: UD Portuguese Bosque - UD version 2. https://universaldependencies.org/treebanks/pt_bosque/index.html. Accessed 14 Jun 2021
Vandenbussche, P.Y., Scerri, T., Jr., R.D.: Word sense disambiguation with transformer models. In: Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6), pp. 7–12. Association for Computational Linguistics, Online (2021) https://aclanthology.org/2021.semdeep-1.2
Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online (2020). https://www.aclweb.org/anthology/2020.emnlp-demos.6
Zalmout, N., Habash, N.: Don’t throw those morphological analyzers away just yet: Neural morphological disambiguation for Arabic. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 704–713. Association for Computational Linguistics, Copenhagen, Denmark (2017). https://aclanthology.org/D17-1073

Download references

Acknowledgements

This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant number 2019/07665-4) and by the IBM Corporation. The project was also supported by the Ministry of Science, Technology and Innovation, with resources of Law N. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published as Residence in TIC 13, DOU 01245.010222/2022-44.

Author information

Authors and Affiliations

ICMC, University of São Paulo, São Carlos, Brazil
Lucelene Lopes, Magali S. Duran & Thiago A. S. Pardo
Merrimack College, North Andover, MA, USA
Paulo Fernandes
CISUC, Universidade de Coimbra, Coimbra, Portugal
Marcio L. Inacio

Authors

Lucelene Lopes
View author publications
Search author on:PubMed Google Scholar
Paulo Fernandes
View author publications
Search author on:PubMed Google Scholar
Marcio L. Inacio
View author publications
Search author on:PubMed Google Scholar
Magali S. Duran
View author publications
Search author on:PubMed Google Scholar
Thiago A. S. Pardo
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Lucelene Lopes .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Murilo C. Naldi
Centro Universitario da FEI, São Bernardo do Campo, Brazil
Reinaldo A. C. Bianchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lopes, L., Fernandes, P., Inacio, M.L., Duran, M.S., Pardo, T.A.S. (2023). Disambiguation of Universal Dependencies Part-of-Speech Tags of Closed Class Words in Portuguese. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-45392-2_16
Published: 12 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)

Disambiguation of Universal Dependencies Part-of-Speech Tags of Closed Class Words in Portuguese

Abstract

Similar content being viewed by others

Top-Down Parsing Error Correction Applied to Part of Speech Tagging

Universal Dependencies-Based PoS Tagging Refinement Through Linguistic Resources

A Comparative Study of Language Modeling to Instance-Based Methods, and Feature Combinations for Authorship Attribution

1 Introduction

2 Related Work

3 Problem Definition and Baseline Approaches