Abstract
In recent years, there has been considerable growth in the volume of legal proceedings in Brazil. In this context, there is a lot of potential in using recent advances in Natural Language Processing to automate tasks and analysis in the legal domain. In this article, we investigate text decoding methods for automating the writing of keyphrases, a sequence of key terms present in documents used in courts throughout Brazil. For this purpose, a text-to-text framework based on generative Transformers is used to generate keyphrases and evaluate three decoding techniques: greedy, top-K, and top-p. Since the keyphrases are designed to improve retrieval tasks, we evaluated keyphrases generated by the decoding methods in legal document retrieval. Traditional retrieval methods (TF-IDF and BM25) were used to evaluate the quality of the generated keyphrases. The results obtained (in terms of IR metrics) were statistically significant, and they indicate that greedy decoding generates high-quality keyphrases for the dockets used in this work, providing keyphrases close to the ones generated by human specialists.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Based on data from the National Council of Justice - Conselho Nacional de Justiça (CNJ) [11], there were 77.3 million cases in transit in the Brazilian judiciary at the end of 2021, indicating a 10.4% increase from the previous year. The analysis of such cases contributes to the slowness of the Brazilian legal system due to the human effort required to both write and analyze the legal cases. In this context, dockets consist of a special type of document, that aims to provide a summary of a legal case. They are used in courts all around Brazil and are designed to provide a summarised representation of judicial decisions. Figure 1 presents an example of a docket.
The dockets usually follow a pre-defined structure composed of two components: keyphrases and enumerated paragraphs. The keyphrases consist of a header present at the beginning of the docket and are composed of sequences of capitalised key terms that highlight the key subjects present in the document. This header is created to improve the search and retrieval of jurisprudences (precedents) [7]. The enumerated paragraphs discuss the themes (or topics) present in the document.
By analysing the form and linguistic style present in the keyphrases, it’s possible to note similarities between the writing of keyphrases and two Natural Language Processing (NLP) tasks: summarization and key terms extraction. However, keyphrases are not written in a fluid and natural manner such as summaries. In addition, most of the terms present in their text are not present in the remainder of the docket which originated the keyphrase, which makes it difficult to treat its writing as an extractive task.
Given the predictable structure and availability of dockets, it would be possible to prepare input-output pairs in order to generate keyphrases using the enumerated paragraphs as inputs, by employing a supervised approach. Transformers, such as GPT [19], were already proven effective in various text-to-text generative Natural Language Processing (NLP) tasks [6] (such as translation, question answering and summarization). Also, the availability of pre-trained language models [3, 25, 33] presents a lot of opportunities to automate NLP tasks.
Thus, in this work, we aim to investigate the usage of state-of-the-art generative Transformers to automate the writing of keyphrases. Specifically, we seek to investigate text decoding methods in order to generate keyphrases that aid retrieval in the legal domain. This study is unprecedented in Brazil and can be used to automate keyphrase generation in courts around the country. At last, the main contributions of this work are:
-
1.
Investigation of a novel approach to generate keyphrases from Brazilian dockets, using a sequence-to-sequence Transformer;
-
2.
Comparison of three different text decoding methods for the proposed task (greedy and sampling methods);
-
3.
Quantitative and qualitative analysis of the generated keyphrases.
This paper is organised as follows. Section 2 presents related works. Section 3 discussed the methodology applied for the keyphrase generation. Next, Sect. 4 presents and discusses the obtained results. At last, Sect. 5, presents conclusions and future works.
2 Related Works
In this Section, we will present studies related to the main objectives of this proposal. The Transformer [29] consists of a deep neural network architecture that achieved the state of the art in several NLP tasks. It consists of an encoder-decoder architecture used originally for translation. However, the context-aware representations generated by the model can be used for a large variety of tasks.
Following the success of the Generative Pretrained Transformers (GPT) models [19, 25, 33], there is a predominance of decoder-only models in NLP tasks that can be approached as text generation (such as question answering and summarization) [6]. In addition, recent studies showed the great potential in using such models in zero and few-shot scenarios [2]. Other studies [20, 31] investigate text generation using the full Transformer architecture (encoder-decoder) for some NLP tasks. The T5 [20] Transformer proposes the unification of a series of NLP tasks in a single text-to-text framework and Xue et al. [31] expanded the original work to add multilingual support.
Although the presented Transformer approaches for text generation are different in terms of architecture and scale (number of parameters), they all deal with common issues concerning the quality of the artificially generated text. Generated texts are often simplistic, inconsistent, or end up being repetitive [8]. There is also the possibility of hallucination, generating contradictory texts, meaningless and without foundation or evidence [10].
In order to mitigate the challenges (repetitive and predictable texts), there were initiatives aimed at making text generation non-deterministic [4, 8]. Such proposals arise as an alternative to simpler text generation methods (also called greedy decoding), arguing that choosing most probable words (or tokens) is one of the main causes of repetitive texts.
Another example of a study aimed at mitigating repetitive texts is the work from Su et al. [28]. Proposed in 2022, the contrastive-search consists of a modification in the choosing words (or tokens) predicted by a textual generator, which aims to increase the variability of the text while maintaining its coherence. For this purpose, the authors suggest penalising, during decoding or unsupervised training of the language model, the softmax scores of the most likely tokens by their similarity to other tokens within the context. The importance given to the similarity is controlled by a parameter alpha.
At last, we will present examples of studies employing Transformers to generate text in the legal domain. Keyphrases such as the ones used in this work are exclusive to Brazil and, for the best of our knowledge, this is the first in depth study of decoding methods for brazillian keyphrase generation. Feijo and Moreira [5] and Yoon et al. [32] applied Transformer models to summarise rulings from the Brazilian Supreme Court and Korean legal cases, respectively. Peric et al. [16] proposed the use of Transformers to generate opinions about legal cases originating in the U.S Circuit Court, by employing an encoder-decoder architecture.
Huang et al. [9] proposed a solution to automate the Legal Judgment Prediction (PJL) subtasks using the T5 text-to-text framework. At last, Althammer et al. [1] investigated the use of summaries (generated by Transformer) as part of an information retrieval pipeline for the legal domain as part of the 2021 Competition on Legal Information Extraction/Entailment (COLIEE).
3 Methodology
The methodology used in this work is composed of: I) Data Collection and Preprocessing, II) Keyphrase Generator Training, III) Decoding Methods Evaluation and IV) Qualitative Analysis. These components will be discussed below.
3.1 Data Collection and Preprocessing
In 2022, the Brazilian Supremo Tribunal de Justiça (STJ) - Supreme Court of Justice made available the Dados AbertosFootnote 1 platform. The platform consists of a public website, sharing legal decisions from various courts in Brazil. The published documents comprise a large variety of topics in Brazil’s legal domain, such as crimes in general, commerce, taxes, etc. We collected a total of 712,161 documents from the platform in August 2022.
After the data collection, we extracted the dockets from the documents metadata and preprocessed the text of the decisions. We removed duplicated examples and removed URLs from the text. 111,964 dockets remained after the preprocessing described. With the remaining examples, we extracted the keyphrases and enumerated paragraphs from the dockets, identifying and extracting capitalised sentences present in the header of the collected decisions. By extracting the inputs (enumerated paragraphs) and expected outputs (keyphrases), the original keyphrases (written by specialists) compose the reference set used for supervised training and evaluation.
As a final preprocessing step, we divided the corpus (111,964 examples) into training (70%), validation (10%), and test (20%) splits. From the examples of the training set, we observed that enumerated paragraphs and keyphrases have a mean of 203.26 and 55.84 space-separated tokens, respectively. We used the splits to train and evaluate a supervised deep learning text generator.
3.2 Keyphrase Generation
In this section is described the methodology employed for keyphrase training and generation.
Transformers for Text Generation. Based on the dockets collected, we noted that most of the terms in the keyphrases are not directly present in the dockets. By further analysing examples from the validation set, we noted that only \(\sim \)10% of the terms present in the keyphrases are in fact present in the input text. Thus, we decided to approach writing keyphrases as generation rather than extraction of text. For this purpose, a sequence-to-sequence (or text-to-text) Transformer model was chosen.
We choose PTT5 [3] as our keyphrase generator. PTT5 was pretrained in a large Brazilian Portuguese corpus (brWaC [30]) with 2.7 billion tokens and the base version of the model (220M parameters) was used in our experiments. We experimented with other state-of-the-art multilingual generative Transformer models (such as mT5 [31], BLOOM [25] and OPT [33]), but the Portuguese model (PTT5) performed better. Previous works [3, 22, 27] observed that models pretrained for the task language tend to outperform multilingual models on the same tasks, and the same trend was observed in our experiments.
Training Details. We fixed the input (enumerated paragraphs) and output (keyphrases) sizes to 512 and 256 sentence-piece tokens, respectively. We padded shorter sequences of tokens and truncated longer sequences (to the maximum length). We fine-tuned the method PTT5 using a fixed learning rate of \(1\times 10^{-3}\), batch size equal to 256 and 20 maximum training epochs.
For the sequence-to-sequence training, the cross-entropy loss function was adopted. The BLEU score [17] metric was considered to evaluate the text generation quality. We used early-stopping during training, monitoring the BLEU metric in the validation set. The training process is stopped after two epochs without improving the BLEU score. For evaluation, we repeated the training process with five different seeds (1000, 2000, 3000, 4000 and 5000) and obtained a \(37.254 \pm 0.783\) BLEU score. The best performing model achieved 38.607 BLEU on the test set. The fine-tuning was done using the HuggingFaceFootnote 2 library, and a Tesla P100 GPU with 16 GB VRAM.
Figure 2 shows the train and validation losses for the best execution of the PTT5 model in addition to the validation BLEU scores. The model was trained for 17 epochs, totaling 5219 iterations.
3.3 Decoding Methods Evaluation
For the evaluation of the generated text and to compare the decoding methods, we have concatenated the generated keyphrases to their original document and used a real use case of retrieval to extract IR metrics. We opted to use an IR task to evaluate the generated keyphrases (created using different decoding methods) to evaluate them in their intended use: improving retrieval tasks. The details of the evaluations will be presented to follow.
Decoding Methods. Decoding techniques are used to guide neural text generation, in order to generate meaningful and coherent text. The methods are used to generate human-readable text from the internal representations of language models. In this work, we evaluate three decoding methods: greedy, top-k [4] and top-p [8]. Top-k and top-p are sampling decoding methods, that, during text generation, sample tokens from finite sets. A brief description of the methods will be presented next.
-
Greedy: greedy decoding always selects the most probable token (highest softmax score) during generation.
-
Top-K: consists of filtering the most probable K tokens at a given instant, and redistributing their probabilities among them before sampling.
-
Top-p: limits the set of selectable tokens to a set of more probable tokens whose summed probabilities are lower than the established threshold p. Note that the number of tokens that can be chosen is dynamic, since the probability distributions vary at each instant.
Task Formulation. We have used the themes (categorical information), present in the dockets’ metadata, to simulate a retrieval task in which a specialist seeks to retrieve documents similar to a query document using a search engine. The themes are unique identifiers, that are mapped to common questions of law. Thus, by using the binary relevance definition: given a query document Q, the relevant documents R to Q must have the same theme as Q. Note that, in the real scenario, the documents consists in dockets containing both keyphrases and enumerated paragraphs.
From the collected decisions, only 801 have themes. These documents were removed from the training set and used to prepare query - relevant document pairs for IR evaluation. The query set consists of dockets whose themes occur at least twice. From those, we prepared 482 query - relevant document pairs (pairs of same theme documents).
To prepare the final retrieval corpus, we combined the test set presented in Sect. 3.1 with the dockets with themes and obtained a total of 23,194 documents. We increased the retrieval corpus to make the retrieval task more challenging. In the worst case, the documents without a theme may introduce false negatives (documents with the same theme of the query, but considered non-relevant), hindering the IR metrics.
Experimental Setup. Two different experiments were performed during the IR evaluation and they are described in the following.
-
1.
Studying Sampling-based Decoding Methods: this experiment aims to investigate the generation of multiple keyphrases from a single docket using sampling decoding. By concatenating multiple keyphrases to a single docket, we expect to see improvements in the IR metrics since we are adding more text variations. We generated up to 10 keyphrases for each example in the search corpus, using top-K and top-p sampling, and concatenated them to the original input (enumerated paragraphs) to generate the IR metrics. We repeated the text generation five times with seeds of different random numbers (1,000, 2,000, 3,000, 4,000, and 5,000) and aggregated the results for comparisons. The effects of the K of the top-K, and the p of the top-p sampling methods were also evaluated in this experiment, varying the values of both K and p. We choose K and p from the following sets of values: \(K \in \{15, 50, 100\}\) and \(p \in \{0.1, 0.5, 0.9\}\). Note that for this experiment, we are not interested in determining the best number of repetitions, nor the best value for K or p. The goal is to investigate the effect of the parameters on the proposed IR task, but the results may indicate the best parameter ranges.
-
2.
Decoding Methods Comparison: in order to compare the decoding methods, we extracted IR metrics for dockets using keyphrases generated using top-K and top-p sampling. We used the generated ones in place of the originals in this experiment. For reference, we also evaluated IR metrics considering documents with and without the original keyphrases (for both query and corpus documents) and using simple greedy decoding. We choose to use only one keyphrase generated by each method based on the results of the previous experiment and to evaluate the decoding methods in similar scenarios. For this experiment, we used the following parameters for the sampling-based decoding methods: \(K=15\) and \(p=0.9\) (based on the performances obtained in the previous experiment).
The experiments with sampling decoding methods were inspired by the work doc2query [14]. In this work, for each example in a corpus, the authors generated several queries related to the example’s content using a sequence-to-sequence Transformer model. The authors used top-K sampling in order to generate several queries per example. Then the queries were concatenated to the input documents in order to improve IR metrics. Considering both experiments with sampling methods, we used contrastive-search with \(alpha=0.6\), based on the original paper [28]. We choose the K and p values based on previous works with top-K and top-p sampling [8, 14].
Information Retrieval Methods and Metrics. To evaluate the proposed IR task, we choose two traditional methods: TF-IDF and BM25 [21]. The methods were chosen due to their popularity in search engines (such as LuceneFootnote 3) and competitive performance [18, 23]. Previous works [12, 13, 15] also discussed that sparse representation methods (such as the chosen ones) tend to perform better in similar tasks in the legal domain.
As an additional preprocessing for the IR methods, the documents were tokenized and Portuguese stop-words and punctuation were removed. For TF-IDF, we utilized a vocabulary size of 10,000 tokens (that appeared at least three times), and n-grams from 1 to 3. To sort documents during retrieval using TF-IDF, we used the cosine similarity between queries and documents. Considering BM25, the documents were sorted by the probability ranking principle, estimating the relevance of a document to the presented query. The additional preprocessing was done using spaCyFootnote 4 and sklearnFootnote 5. For BM25, we used the implementation and default parameters from rank-bm25Footnote 6.
At last, we evaluated the performance in the proposed IR task using two traditional IR metrics: Mean Reciprocal Rank (MRR) and Recall. The metrics were chosen by their use in similar works in the legal domain [24, 26]. We used a threshold of 10 documents (top-10 ranked documents) to compute the metrics. According to Russel et al. [24], law professionals tend to analyze, for the most part, up to 50 documents in their searches. Therefore, we are evaluating an even more challenging scenario than the described by the authors.
3.4 Qualitative Analysis
As a final analysis, for all decoding methods evaluated (greedy, top-K and top-p), we sampled examples generated using all methods and performed a qualitative analysis on them. For this analysis, we compared the generated keyphrases to the references and discussed the similarities between them and the effect of the sampling methods.
4 Results and Discussions
These Sections discuss the results obtained for each experiment described in Sect. 3.3. In all experiments, we aim not to compare the retrieval methods (TF-IDF and BM25), but to use them to evaluate the quality of the generated keyphrases using different decoding methods.
4.1 Studying Sampling-Based Decoding Methods
Tables 1 and 2 present the IR metrics obtained, varying both the number of repetitions and K e p parameters. The metrics consist in the mean of five different executions (using five different seeds).
When carrying out this experiment, the expectation was to observe a logarithmic growth as more different keyphrases were concatenated to the dockets (similar to doc2query [14]), since we are using more variations of keyphrases. However, this result was not observed in any of the evaluated metrics. Contrary to expectations, in the worst cases, there was a decay in the metrics as new variations were added to the input texts for both top-K and top-p decoding methods. The decay is more noticeable for TF-IDF method, with reductions between 2% (top-K) and 12% (top-p) in all observed metrics. The mentioned behaviors were observed for all evaluated K and p values.
The worst performances were observed in increasing repetitions for \(p=0.1\) (top-p experiments). The most probable explanation is that the low p value is too restrictive, reducing the set of selectable tokens. Hence, the top-p tends to generate similar keyphrases with low text variation (more similar or equal keyphrases). This way, the repetitive text hindered the performance of both IR methods evaluated.
By increasing the K and p values, we increase the variability of the generated text, since the tokens to be predicted are chosen from a larger set. A positive effect on the metrics was also expected due to the possibility of adding more discriminative terms in the generated keyphrases, which is beneficial for the evaluated sparse methods. However, we observed deterioration of the performance and, at the best case, similar metrics by varying the K and p values. We suspect that even with the increase in variability, the generated keyphrases remained similar to each other, resulting in addition of repetitive texts to the dockets.
The conclusion from these results is that there is no evidence that using more keyphrases (by using sampling decoding) is beneficial to the evaluated task. Also, there is no benefit in using K values above 15, and p values lower than 0.9. We will discuss the results of the sampling methods further in Sect. 4.3.
4.2 Decoding Methods Comparison
In Table 3 is presented the results comparing decoding methods. For both TFIDF and BM25, a single keyphrase using greedy was generated for both top-K and top-p decoding. We adopted \(K=15\) and \(p=0.9\) for the top-K and top-p decoding, respectively, based on the results from the previous experiment. Table 3 also presents the results obtained performing the proposed retrieval task with and without the original (reference) keyphrases for comparisons.
We observe that the keyphrases are, indeed, beneficial to retrieval tasks by comparing the metrics obtained by using documents with and without the keyphrases. For both metrics, we observed statistically significant differences. Since both sparse methods (TF-IDF and BM25) benefit from the existence of discriminative terms in the documents, these results were already expected. Note that the metrics obtained using the original keyphrases act as an upper bound to our experiments.
Considering the TF-IDF retrieval, we observed an increment in all metrics by using the generated texts (compared to not using any). The differences in all metrics are statistically significant (see Table 3a). For the BM25 method, we observe similar results (see Table 3b). However, no significant differences were observed when considering the R@10 metric. Note that by using generated keyphrases, there is the possibility of introducing false positives (false similar) and false negatives (false non-similar) in the search corpus, originated by noisy keyphrases. The IR metrics obtained by the BM25 method suggest that the method was sensitive to these noisy examples.
By comparing the decoding methods, we note small increments for the sampling methods in relation to greedy decoding. However, considering a paired T-Test using a threshold of 5%, there is no significant difference between the decoding methods. Thus, there is not enough evidence to reject the null hypothesis (metrics have the same mean) by observing the comparisons between the metrics of all three decoding approaches. Therefore, there is no evidence that justifies choosing to sample decoding methods over a simpler greedy decoding approach, considering the proposed task.
4.3 Qualitative Analysis
Greedy Decoding. Examples of keyphrases generated by PTT5, using greedy decoding, are presented in Fig. 3. We can note that with BLEU scores close to 40%, although being generated by a model trained in a modest training set (less than 100K examples), the generated keyphrases do not present spelling and lexical errors. They captured the writing style of the reference keyphrases and are very similar to the keyphrases written by humans.
A comparison between the number of tokens of the original and the generated keyphrases using greedy decoding is shown in Fig. 4. It is possible to observe that, although the distributions presented by the two histograms are similar, the generated keyphrases have a higher concentration of examples below 60 tokens. The average of space-separated tokens of the generated keyphrases is lower than the average of the tokens presented by the references (42.34 compared to 48.28).
Therefore, we identified that the keyphrases generated with greedy decoding tend to have a smaller length (in tokens) than the originals. We also observed the same pattern for the keyphrases generated using sampling decoding.
Sampling Decoding. In Fig. 5 is shown keyphrase examples generated using top-K and top-p decoding. We used \(K=15\) and \(p=0.9\) based on the results from the previous analysis. From the examples, it is possible to note that the main effect of using sampling is the generation of paraphrases of the original keyphrase. We also observe examples of reordering of the phrases present in the keyphrases. Hence, the generated keyphrases tend to be similar to each other. The described behaviours are justified by the working of language models based on Transformers since, during text generation, they tend to generate tokens that appear in similar contexts.
By using sampling-based methods, we observed an increase in text variability. However, the possibility of the model generating text not related to the input also increases, which may have harmed the IR methods studied. In addition, when concatenating multiple variations of keyphrases similar to each other, we added many repeated terms to the documents, which may influence negatively the sparse IR methods evaluated.
In addition to the justifications presented, the amount of training data may also have affected the sampling methods. Although the results for greedy generation were better, the lack of variability in the training examples (due to their small size), may have harmed the decoding using top-K and top-p sampling.
5 Conclusion and Future Works
In this paper, we successfully trained a sequence-to-sequence Transformer to generate keyphrases and investigated three different text decoding methods. The results showed us that the keyphrases bring significant increments to IR metrics when used in combination with the dockets. This result was observed for all the keyphrases evaluated: the references and the generated ones (using greedy, top-K, and top-p decoding). Although we have evaluated different parameters and concatenated multiple variations of keyphrases generated using sampling decoding (top-K and top-p), the simpler greedy decoding performed similarly to these methods. We presented and discussed possible justifications for such behaviour, and the results suggest that greedy decoding is enough for keyphrase generation considering legal dockets.
As future works, we intend to experiment pre-training language models on legal documents in order to improve keyphrase generation. Furthermore, we aim to improve the quality of the training by collecting more dockets from more sources around Brazil. At last, this work can also be used to inspire other works aiming to automatise text writing in the legal domain.
References
Althammer, S., Askari, A., Verberne, S., Hanbury, A.: Dossier@ coliee 2021: leveraging dense retrieval and summarization-based re-ranking for case law retrieval. arXiv preprint arXiv:2108.03937 (2021)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Carmo, D., Piau, M., Campiotti, I., Nogueira, R., Lotufo, R.: PTT5: pretraining and validating the T5 model on Brazilian Portuguese data. arXiv preprint arXiv:2008.09144 (2020)
Fan, A., Lewis, M., Dauphin, Y.N.: Hierarchical neural story generation. CoRR abs/1805.04833 (2018). http://arxiv.org/abs/1805.04833
Feijo, D., Moreira, V.: Summarizing legal rulings: comparative experiments. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 313–322 (2019)
Floridi, L., Chiriatti, M.: GPT-3: its nature, scope, limits, and consequences. Mind. Mach. 30(4), 681–694 (2020)
Guimarães, J.A.C., Santos, J.C.G.: A ementa jurisprudencial como resumo informativo em um domínio especializado: aspectos estruturais. Braz. J. Inf. Sci. 10(3), 32–43 (2016)
Holtzman, A., Buys, J., Forbes, M., Choi, Y.: The curious case of neural text degeneration. CoRR abs/1904.09751 (2019). http://arxiv.org/abs/1904.09751
Huang, Y., Shen, X., Li, C., Ge, J., Luo, B.: Dependency learning for legal judgment prediction with a unified text-to-text transformer. arXiv preprint arXiv:2112.06370 (2021)
Ji, Z., et al.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12), 1–38 (2022)
de Justiça CNJ, C.N.: Conselho nacional de justiça - justiça em números (2023). https://www.cnj.jus.br/pesquisas-judiciarias/justica-em-numeros/. Accessed 08 May 2023
Lima, J.P., Costa, J.A., Araújo, D.C.: Comparison of feature extraction methods for Brazilian legal documents clustering. In: 2021 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1–5. IEEE (2021)
Mandal, A., Ghosh, K., Ghosh, S., Mandal, S.: Unsupervised approaches for measuring textual similarity between legal court case reports. Artif. Intell. Law 29(3), 417–451 (2021). https://doi.org/10.1007/s10506-020-09280-2
Nogueira, R., Yang, W., Lin, J., Cho, K.: Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019)
Pedroso, D.D.S.C., Ladeira, M., de Paulo Faleiros, T.: Does semantic search performs better than lexical search in the task of assisting legal opinion writing? In: 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 680–685. IEEE (2019)
Peric, L., Mijic, S., Stammbach, D., Ash, E.: Legal language modeling with transformers. In: Proceedings of the Fourth Workshop on Automated Semantic Analysis of Information in Legal Text (ASAIL 2020) held online in conjunction with 33rd International Conference on Legal Knowledge and Information Systems (JURIX 2020) 9 December 2020, vol. 2764. CEUR-WS (2020)
Post, M.: A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771 (2018)
Pradeep, R., et al.: H2oloo at TREC 2020: when all you got is a hammer... deep learning, health misinformation, and precision medicine. Corpus 5(d3), d2 (2020)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
Robertson, S.E., Walker, S.: Okapi/Keenbow at TREC-8. In: TREC, vol. 8, pp. 151–162. Citeseer (1999)
Rosa, G.M., Bonifacio, L.H., de Souza, L.R., Lotufo, R., Nogueira, R.: A cost-benefit analysis of cross-lingual transfer methods. arXiv preprint arXiv:2105.06813 (2021)
Rosa, G.M., Rodrigues, R.C., Lotufo, R., Nogueira, R.: Yes, BM25 is a strong baseline for legal case retrieval. arXiv preprint arXiv:2105.05686 (2021)
Russell-Rose, T., Chamberlain, J., Azzopardi, L.: Information retrieval in the workplace: a comparison of professional search practices. Inf. Process. Manag. 54(6), 1042–1057 (2018)
Scao, T.L., et al.: BLOOM: a 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022)
Souza, E., et al.: Assessing the impact of stemming algorithms applied to Brazilian legislative documents retrieval. In: Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pp. 227–236. SBC (2021)
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Su, Y., Lan, T., Wang, Y., Yogatama, D., Kong, L., Collier, N.: A contrastive framework for neural text generation. arXiv preprint arXiv:2202.06417 (2022)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Xue, L., et al.: MT5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020)
Yoon, J., Junaid, M., Ali, S., Lee, J.: Abstractive summarization of Korean legal cases using pre-trained language models. In: 2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM), pp. 1–7. IEEE (2022)
Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Acknowledgement
This study was supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (CAPES) - Finance Code 001. We thank CEMEAI for granting access to the Euler cluster for the experiments. Also, this work is partially funded by Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), grant 2022/01640-2. We would like also to thank INCT (CAPES Concessão 88887.136349/2017-00, CNPQ 465755/2014-3 and FAPESP 2014/50851-0) for the support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sakiyama, K., Montanari, R., Malaquias Junior, R., Nogueira, R., Romero, R.A.F. (2023). Exploring Text Decoding Methods for Portuguese Legal Text Generation. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14195. Springer, Cham. https://doi.org/10.1007/978-3-031-45368-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-45368-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45367-0
Online ISBN: 978-3-031-45368-7
eBook Packages: Computer ScienceComputer Science (R0)





