Abstract
This paper presents a technique that employs linguistic resources to refine PoS tagging using the Universal Dependencies (UD) model. The technique is based on the development and use of lists of non-ambiguous single tokens and non-ambiguous co-occuring tokens in Portuguese (regardless of whether they constitute multiword expressions or not). These lists are meant to automatically correct the tags for such tokens after tagging. The technique is applied over the output of two well-known state of the art systems - UDPipe and UDify - and the results for a real data set have shown a significant improvement of annotation accuracy. Overall, we improve tagging accuracy by up to 1.4%, and, in terms of the number of fully correct tagged sentences, our technique produces results that are 13.9% more accurate than the corresponding original system.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
The importance of the Part of Speech (PoS) tagging task is beyond discussion, as often the other Natural Language Processing (NLP) tasks rely on the outcome of PoS tagging. Furthermore, the current NLP applications are designed to process large amounts of texts, making the use of taggers a necessity. For well resourced languages, PoS tagging processes are usually dealt with consistent high accuracy [12]. The reason for such high accuracy relies mostly on the existence of large annotated corpora and powerful techniques, usually based on neural network inspired implementations [5].
For poorly resourced languages, or even specific dialects, PoS tagging may become challenging because of the lack of an appropriate training model [8]. This can be coped with reduced or slightly inadequate training models and the use of very powerful taggers/parsers based on performant solutions as the use of word embeddings and bidirectional RNNs [4]. One example of this is implemented in UDPipe, a trainable pipeline that performs sentence segmentation, tokenization, lemmatization, PoS tagging and dependency parsing [11]. Other performant similar solutions are the ones based on the use of cutting edge technology as BERT [9], which is implemented in UDify, a multilingual multi-task model capable of accurately predicting universal part-of-speech, morphological features, lemmas, and dependency trees simultaneously for several Universal Dependencies (UD) treebanks [3].
Such approaches are very effective, basically because they are flexible enough to detect complex characteristics necessary to classify words and grammatical constructions that are present in the training set. Unfortunately, these high performant PoS taggers still make basic mistakes and misclassify simple words (as those belonging to closed classes). That leads to errors in PoS annotation that could be easily tackled by solutions that rely on linguistic resources that are not explicitly covered by current machine learning models [13].
One thing in common to the UDPipe and UDify taggers/parsers is the use of UD formalism for encoding the output representation [6]. UD is a de-facto standard for cross-linguistically comparable morphological and syntactic annotation that keeps evolving and provides a steadily growing and heavily multilingual collection of corpora [1]. UD has reshaped the initiatives in the area of tagging and parsing in the world, causing a significant movement from constituency to dependency analysis in NLP.
For Portuguese texts, the PoS tagging task using tools as UDPipe and UDify present some challenges because of the low availability of appropriate training sets with the right language style and a reasonable size. One of the few options is the Bosque-UD corpus, a Portuguese treebank based on the Constraint Grammar converted version of Bosque, which is part of the Floresta Sintá(c)tica treebank [7] that contains both European and Brazilian variants [14]. The Bosque-UD corpus has 9,364 sentences, totalizing 210,957 tokens.
In order to contribute to UD-based PoS tagging of Brazilian Portuguese texts, we propose a linguistic-based technique to improve tagging accuracy, which we specifically test over the output of UDPipe and UDify systems, which use Bosque-UD as training set. The proposed technique relies on building and employing linguistic resources to automatically correct the PoS tags produced by the cited systems. Our research claim is that explicit linguistic knowledge does help improving the result of modern machine learning-based models. To support our claim, we describe the process of building our linguistic resources for Brazilian Portuguese, and we conduct experiments illustrating our attempt of quality improvement.
Specifically, we propose the automatic correction of tags wrongfully assigned to non-ambiguous single tokens and non-ambiguous co-occuring tokens (regardless of whether they constitute MWE - multiword expressions - or not) from closed PoS classes in Portuguese. We consider in our linguistic resources three types of phenomena: single functional words (the pronoun você - “you” in English, for example); MWEs acting as functional words (the prepositional phrase por meio de - “through”, for example); and co-occurring functional words that, although in isolation are ambiguous, together are not ambiguous (for example, a may be a determiner (DET), a preposition (ADP), or a pronoun (PRON), but in the co-occurrence à de may be unambiguously tagged as ADP PRON ADP). It is important to call the reader attention that, being functional words, they are naturally non-ambiguous, independently of the domain.
We show that the achieved gains by our proposed technique is significant and that this is a strategy worth to follow. Overall, we improve tagging accuracy in up to 1.4%, and, in terms of the number of fully correct tagged sentences, our technique produces results that increases the accuracy by 13.9% from the corresponding original system. Considering that PoS tagging annotation is the basis for other text analysis levels, such improvement may be really impactful for producing better NLP tools and applications.
This paper is organized as follows: the next section describes the process and the result of producing our linguistic resources composed by lists of non-ambiguous single tokens and non-ambiguous co-occuring tokens belonging to closed PoS tag classes, highly frequent in Portuguese written text; the third section presents the evaluation of our proposed correction technique on a data set of 2,000 sentences (43,483 tokens) from the Folha-Kaggle corpus [10] annotated by UDPipe [11] and UDify [3] systems; finally, the last section presents our concluding remarks and future works.
2 Linguistic Analysis of Non-ambiguous Tokens in Portuguese
In this section we analyze the possibility to establish a set of tokens that can only be annotated by one single possible PoS tag. Usually, very performant PoS taggers based on a training set knows no such boundaries as they try to capture very subtle interdependencies among the tokens and their representation. Linguistic knowledge of the target language, however, can provide such boundaries by establishing the specific tokens or expressions that are unambiguously capable to receive only one single PoS tag.
The knowledge of these non-ambiguous tokens can be a powerful and fully automatic tool to improve the quality for a very performant PoS tagger based on a training set. In this work, we adopt two linguistic groups to improve the PoS tagging:
-
The single tokens belonging to a closed PoS tag class (in UD): ADP (adposition), ADV (adverbsFootnote 1), CCONJ (coordinating conjunction), DET (determiner), NUM (numeral - just cardinals), PRON (pronoun), SCONJ (subordinating conjunction), a subset of the class AUX (verbs ser and estar) and ADJ (ordinal numbers);
-
Common co-occuring tokens that have always the same PoS tag classes, which include: MWEs acting as functional words; Co-occurring functional words that in isolation are ambiguous, but together become non-ambiguous.
2.1 Non-ambiguous Single Tokens for Closed PoS Classes
To establish this group, we started listing all tokens tagged to each of the targeted PoS classes (ADP, ADV, CCONJ, DET, NUM, PRON and SCONJ) in the Bosque-UD corpus [14], and the equivalent tags in the MacMorpho corpus [2]. The lists for each PoS class were analyzed and inappropriate tokens were removed throughout a manual linguistic analysis of each token individually. Similarly, some tokens were added to the lists, despite not being found in the corpora above mentioned. This was the case of ordinal and cardinal numerals, additional forms of auxiliary verbs (ser and estar), and functional rare words, as the adverb adentro (“inwards”).
Table 1 presents the overall number of listed tokens and the final number of tokens retained for each class. In this table, only the tokens for ordinal numerals (annotated as ADJ in UD) are absent, since they were not searched in the Bosque-UD and MacMorpho as they are not distinguishable from other open class adjectives (all these tokens were added by the linguistic analysis only).
The main reason for a token being considered inappropriate was misannotation due to different annotation principles. An example of such difference of annotation principles can be found in MacMorpho corpus, where several MWEs, acting as functional words, were tagged as a whole and not by the words that compose them. This is the case of the por fim expression that is tagged in MacMorpho by assigning a tag ADV to por and another tag ADV to fim, since this expression plays the role of an adverb (this expression can be translated into English as “finally”). However, according to the UD guidelines, each token has to be tagged individually, not considering MWEs. Following this principle, por will be clearly tagged as an adposition (PREP in MacMorpho tagset, ADP in UD tagset). Therefore, the tokens por and fim were removed from the list of ADV despite appearing as such in the MacMorpho corpus. This is exemplified in the sentence of Fig. 1.
The next step was to cross-analyze the lists among themselves. For example, the token se can be found in the PRON and SCONJ lists, and, therefore, it was considered an ambiguous token regarding the PoS tag. In English, se could be translated to “yourself” for PRON or to “if” for SCONJ. Figure 2 exemplifies two sentences where the se token is employed either way.
After crossing the tokens for the closed PoS tag classes, each token was also individually analyzed in relation to its other possible open classes. For example, the token entre can be an ADP, but also a conjugation of the verb entrar (therefore, a VERB). In English entre could be translated to “between” for ADP or to “enter” for VERB. Figure 3 presents two sentences where entre token is employed both ways.
After all those analyses, it was possible to identify all single tokens for the target closed PoS classes, as well as which ones are ambiguous or not. Tables 2 and 3 indicate the non-ambiguous tokens in bold face (269 tokens) and the ambiguous ones in italic.
2.2 Non-ambiguous Co-occurring Tokens
The process to obtain the group of co-occurring tokens to be considered in our technique was somewhat similar to the one for single tokens of closed PoS classes. We started by searching in the available corpora the occurrences of a list of functional expressions that are commonly employed in Portuguese such as MWEs acting as functional words, as the prepositional phrase de acordo com (“according to”). This list was suggested by a linguist dedicated to annotating a corpus following UD guidelines and it includes highly frequent functional expressions such as desde que (“since”), annotated as ADP SCONJ, and mais ou menos (“more or less”), annotated as ADV CCONJ ADV.
The linguistic analysis filtered out those expressions that appeared with different sets of PoS tags in the corpora. In other words, expressions that need context to be disambiguated were disregarded. The goal was to determine once more if that was an ambiguity due to context or just a misannotation due to different annotation principles. Table 4 shows some examples of non-ambiguous expressions and co-occurring tokens with their respective tags and number of occurrences in the Bosque-UD and MacMorpho corpora.
Observing the expressions in Table 4, we have simple cases as o bastante (“enough”), where all annotations converged to the same tags (DET ADJ), as MacMorpho uses ART for article, which is always mapped as DET in UD. However, other examples required the analysis of different annotation principles, for example, de longe (“from afar”), which is annotated in MacMorpho as an expression that plays a role of an adverb, but, following the UD principles to tag each token individually, de (“from” in this context) is undoubtedly an adposition (ADP), as agreed by Bosque-UD annotation. Furthermore, our decision to adopt the tag ADV to the word longe (“afar”) relies on the fact that it is a non-ambiguous place adverb in Portuguese. Similar decisions based on linguistic concepts were adopted to each expression individually.
Some expressions, on the contrary, were considered ambiguous because they correctly appear with different PoS tags. Figure 4 illustrates two sentences with different tags associated to the co-occurring tokens a mais that can be tagged as ADP ADV or DET ADV according to the context. In English, a mais could be translated to “more” (or “extra”) for ADP ADV or to “the more” (or “the most”) for DET ADV. The co-occuring tokens considered ambiguous were left out of our technique.
This process of analysis of ambiguity was carried out to each candidate individually and, even thought we are aware that the resulting list is non-exhaustive, the 110 non-ambiguous expressions and co-occurring tokens with their respective PoS tags are representative of usual and highly frequent lexical phenomena in Portuguese given the observed corpora. The full list of expressions and co-occuring tokens used in our technique is presented in Table 5, where each expression is followed by its non-ambiguous associated PoS tags.
3 Evaluation
The Test Data Set. Once the set of non-ambiguous single and co-occuring tokens for closed PoS tag classes are defined, our proposed technique is experimented over a data set composed by 2,000 sentences randomly picked from Folha-Kaggle corpus [10]. The limit of 2,000 sentences for the data set is due to the fact that we proceeded a laborious manual annotation of these 2,000 sentences to serve as a gold standard for our experiments.
The Folha-Kaggle corpus is composed by news published at the electronic version of the Brazilian newspaper Folha de São Paulo from January 2015 and September 2017. This corpus holds 167,053 news articles with an approximate average of 22 sentences per news and 23 tokens per sentence, thus more than 3.6 million sentences and about 84 million tokens.
The major concern choosing the sentences to form our testing data set is to have sentences with at least one of the chosen non-ambiguous single or co-occurring tokens. Additionally, we restricted our choices to sentences from 7 to 40 tokens long.
The full random process consisted to pass sequentially for the sentences of the Folha-Kaggle corpus, until obtaining 2,000 sentences performing the following steps:
-
To test if the sentence had from 7 to 40 tokens;
-
To test if the sentence had either at least one token or one expression among the non-ambiguous ones (Tables 2, 3 or 5);
-
To randomize a chance of 5% to keep the sentence in the test data set (so that the several sentences come from a wide variety of texts).
The resulting data set is composed by 2,000 sentences totalizing 43,483 tokens, therefore, an average of 22 tokens per sentence. The total number of non-ambiguous single tokens of closed PoS classes (Tables 2 and 3) within this data set is 8,544 tokens, and the total number of non-ambiguous co-occuring tokens (Table 5) is 273.
The Automatic Annotation. The test data set was processed by UDPipe [11] and UDify [3] systems using as training set the Bosque-UD [14]. The goal is to observe how each system annotated each token of the test data set and how our proposed technique could improve the tagging accuracy. It is important to notice that UDPipe and UDify are considered to be among the state of the art systems in the area for UD annotation.
The Manually Produced Gold Standard. A team of linguists performed the annotation of the test data set to provide a gold standard to evaluate the automatic annotation accuracy of each system. The annotation was conducted independently by 10 highly trained linguists, and the overall result was adjudicated by a chief linguist to provide a correct and homogeneous annotation.
3.1 Evaluation of the Non-ambiguous Single Tokens from Closed PoS Classes Corrections
Our goal here is to improve the tagging accuracy by applying an automatic correction of all non-ambiguous single tokens from closed PoS classes, by identifying and replacing the PoS annotation of all these tokens (hence, this process is denoted as ACS for Automatic Correction of Single tokens). The test data set has 8,544 tokens that are non-ambiguous tokens from closed PoS classes, and they are fairly distributed among all 2,000 sentences, since all sentences have at least one of those tokens. Comparing the systems’ output with the gold standard, we observe a good overall accuracy for these tokens as stated in Table 6.
UDPipe annotation was wrong in 5.6% of these tokens, while UDify was wrong in 2.8%. However, considering that those tokens belong to sentences, and a wrongfully annotated token may jeopardize the whole sentence annotation because of a possible ripple effect, it is concerning that UDPipe and UDify outputs contain errors in 22.2% and 11.0% of the sentences, respectively.
By running our automatic correction (ACS), we successfully corrected all wrong annotations mentioned in Table 6 for both systems’ outputs, as all tokens wrongfully annotated (482 by UDPipe, 239 by UDify) were corrected by us in accordance to the gold standard. This represented an improvement of, respectively, 22.2% and 11.0% in the number of correct sentences annotated by UDPipe and UDify concerning the non-ambiguous single tokens for closed PoS classes.
3.2 Evaluation of the Non-ambiguous Co-occurring Tokens Corrections
Similarly to the process for the single tokens of closed PoS classes, we conducted an automatic correction of the chosen non-ambiguous functional expressions and co-occuring functional words (Table 5) (hence, this process is denoted as ACC for Automatic Correction of Co-occuring tokens). The 2,000 sentences test data set holds 273 occurrences of the chosen non-ambiguous co-occurring tokens that are distributed over 263 sentences. Both UDPipe and UDify have several expressions incorrectly annotated as shown in Table 7.
The observation of the results described in Table 7 shows in scale the same need for correction as the one for single tokens. An impressive number of wrongfully annotated co-occurring tokens (39.2% for UDPipe and 32.6% for UDify) reflects in annotation errors of 39.9% and 33.8% of sentences, respectively.
After applying the automatic correction (ACC) to both systems’ outputs, we were pleased to verify that all 107 and 89 expressions, respectively, were correctly reannotated by our technique, eliminating all system errors for the chosen non-ambiguous co-occurring tokens.
It is important to point out that the ACC carried out by our technique had a low overlap with ACS. For UDpipe, only 7 out of 107 co-occurring tokens corrections had an overlap with the corrections carried out for the single tokens: the token de in the expression a ponto de (twice); and the token desde in the expression desde que (five times). For UDify, only 2 out of 89 co-occurring tokens corrections had overlap and they both were for the token desde in the expression desde que (twice).
3.3 Evaluation of All Non-ambiguous Corrections
As seen in Sects. 3.1 and 3.2, both automatic corrections brought improvements considering the targeted tokens and expressions. Also, our experiments indicated an overall improvement in the accuracy of the systems’ annotation when compared to the gold standard. Table 8 presents the obtained accuracy starting from the original system annotation, followed by the application of our automatic correction of non-ambiguous single tokens (ACS) and automatic correction of non-ambiguous co-occurring tokens (ACC). However, in this table, we indicate not only the numbers concerned by the tokens belonging to the target set (single and co-occurring tokens), but all tokens of the 2,000 sentences test data set, providing an overall picture of the situation.
Observing the results in Table 8, it is clear that both automatic corrections presented consistent improvements by correcting wrongfully annotated tokens (1.4% and 0.8% for UDPipe and UDify, respectively). However, the gains in terms of the number of correctly annotated sentences is even more noticeable, as the initial system accuracy for fully correct tagged sentences was 44.3% and 40.0% for UDPipe and UDify, respectively, and the combined application of ACS and ACC brought an increase around 13.9% of fully correct sentences for UDPipe (from 44.3% to 58.2% of sentences) and 5.4% for UDify (from 40% to 45.4% of sentences). Such values are very relevant, as tagging results directly influence other automatic analyses that parsers and other NLP systems perform. Therefore, producing an error reduction in this scale is significant.
Another interesting observation from Table 8 is that the ACS and ACC corrections bring rather complementary gains. For instance, observing the reduction of wrongfully annotated tokens for UDPipe, the application of ACC over the UDPipe output corrected the tags of 118 tokens (1935 minus 1817), while the application of ACS corrected the tags of 482 tokens (1935 minus 1453), and the combined application of both ACS and ACC corrected 583 tokens (1935 minus 1352). This shows that the intersection of corrected tags between ACC and ACS was just over 17 tokens (which is the difference between 118+482 and 583). A similar relation was found for UDify output, and also for the gains in terms of sentences. These numbers reinforce our findings of the small overlap between ACS and ACC gains mentioned in Sect. 3.2.
4 Concluding Remarks
We proposed a technique for Universal Dependencies-based PoS tagging refinement through linguistic resources. Our goal was to show that such refinement could be accomplished by improving the annotation accuracy of two well-known systems (UDPipe and UDify) through tag corrections based on non-ambiguous single and co-occurring tokens. The overall impact of our proposed automatic corrections lead to a tagging accuracy improvement of up to 1.4%. More than this, in relation to the number of fully correct annotated sentences, the improvement was up to 13.9%.
We performed a verification over the entire Folha-Kaggle corpus with 3.6 million sentences and 84 million tokens, and we found nearly 28 million tokens that would be detected by our proposed approach. Given the results of our experiments, which indicate that about 2.5 million tokens would be wrongfully tagged by UDPipe and would be corrected by our approach, this represents the correction of about 20% of the sentences of the whole Folha-Kaggle corpus.
Another important contribution of our work is the construction and availability of the linguistic resources with the 269 non-ambiguous single token words from closed PoS classes and 110 non-ambiguous functional expressions and co-occuring functional words for Portuguese. These resources may help several linguistic efforts in Portuguese, which is still an under resourced language. For the interested reader, all the resources and the gold standard data are available at the POeTiSA project webpageFootnote 2.
Finally, tackling PoS tagging under the UD formalism is also an important contribution for the computational processing of the Brazilian Portuguese language, as UD model has become the standard in tagging and parsing in the world. Nowadays, the UD project has been adopted by over 100 languages, including Portuguese. This paper helps fostering UD research for this language.
Future work includes to possibly refine the current lists, in particular, to enlarge the lists of non-ambiguous tokens and functional expressions, as well as to look for possible disambiguation rules for some tokens from closed PoS classes and to include nominal and verbal MWEs.
Notes
- 1.
The ADV class is not exactly a closed one as, similarly to English with the ending -ly, in Portuguese it is possible to turn adjectives into adverbs by adding -mente at the end (for example, the adjective final can be turned into the adverb finalmente - “finally”), but, for the purpose of our technique, we ignore such adverbs.
- 2.
References
Droganova, K., Zeman, D.: Towards deep universal dependencies. In: Proceedings of the 5th International Conference on Dependency Linguistics (Depling, SyntaxFest), pp. 144–152 (2019)
Fonseca, E.R., G Rosa, J.L., Aluísio, S.M.: Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. J. Braz. Comput. Soc. 21(1), 1–14 (2015)
Kondratyuk, D., Straka, M.: 75 languages, 1 model: parsing universal dependencies universally. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2779–2795 (2019)
Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1520–1530 (2015)
Nivre, J., Fang, C.T.: Universal dependency evaluation. In: Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW), pp. 86–95 (2017)
Nivre, J., et al.: Universal Dependencies v1: a multilingual treebank collection. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), pp. 1659–1666 (2016)
Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., De Paiva, V.: Universal dependencies for Portuguese. In: Proceedings of the 4th International Conference on Dependency Linguistics (Depling), pp. 197–206 (2017)
Rehbein, I., Hirschmann, H.: POS tagset refinement for linguistic analysis and the impact on statistical parsing. In: Henrich, V., Hinrichs, E., de Kok, D., Osenova, P., Przepiorkowski, A. (eds.) Proceedings of the 13th International Workshop on Treebanks and Linguistic Theories (TLT13), pp. 172–183. University of Tübingen (2018)
Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. Trans. Assoc. Computat. Linguist. 8, 842–866 (2020)
Santana, M.: Kaggle - news of the Brazilian newspaper. https://www.kaggle.com/marlesson/news-of-the-site-folhauol. Accessed 14 June 2021
Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207 (2018)
Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99 (2017)
Sulubacak, U.: Implementing universal dependency, morphology, and multiword expression annotation standards for Turkish language processing. Turk. J. Electr. Eng. Comput. Sci. 26(3), 1662–1672 (2018)
Universal Dependencies: UD Portuguese Bosque - UD version 2. https://universaldependencies.org/treebanks/pt_bosque/index.html. Accessed 14 June 2021
Acknowledgments
This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support of the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and the IBM Corporation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Lopes, L., Duran, M.S., Pardo, T.A.S. (2021). Universal Dependencies-Based PoS Tagging Refinement Through Linguistic Resources. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_41
Download citation
DOI: https://doi.org/10.1007/978-3-030-91699-2_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)



