1 Introduction

Based on data from the National Council of Justice - Conselho Nacional de Justiça (CNJ) [11], there were 77.3 million cases in transit in the Brazilian judiciary at the end of 2021, indicating a 10.4% increase from the previous year. The analysis of such cases contributes to the slowness of the Brazilian legal system due to the human effort required to both write and analyze the legal cases. In this context, dockets consist of a special type of document, that aims to provide a summary of a legal case. They are used in courts all around Brazil and are designed to provide a summarised representation of judicial decisions. Figure 1 presents an example of a docket.

Fig. 1.
figure 1

Example of a docket. Keyphrases are highlighted in bold text.

The dockets usually follow a pre-defined structure composed of two components: keyphrases and enumerated paragraphs. The keyphrases consist of a header present at the beginning of the docket and are composed of sequences of capitalised key terms that highlight the key subjects present in the document. This header is created to improve the search and retrieval of jurisprudences (precedents) [7]. The enumerated paragraphs discuss the themes (or topics) present in the document.

By analysing the form and linguistic style present in the keyphrases, it’s possible to note similarities between the writing of keyphrases and two Natural Language Processing (NLP) tasks: summarization and key terms extraction. However, keyphrases are not written in a fluid and natural manner such as summaries. In addition, most of the terms present in their text are not present in the remainder of the docket which originated the keyphrase, which makes it difficult to treat its writing as an extractive task.

Given the predictable structure and availability of dockets, it would be possible to prepare input-output pairs in order to generate keyphrases using the enumerated paragraphs as inputs, by employing a supervised approach. Transformers, such as GPT [19], were already proven effective in various text-to-text generative Natural Language Processing (NLP) tasks [6] (such as translation, question answering and summarization). Also, the availability of pre-trained language models [3, 25, 33] presents a lot of opportunities to automate NLP tasks.

Thus, in this work, we aim to investigate the usage of state-of-the-art generative Transformers to automate the writing of keyphrases. Specifically, we seek to investigate text decoding methods in order to generate keyphrases that aid retrieval in the legal domain. This study is unprecedented in Brazil and can be used to automate keyphrase generation in courts around the country. At last, the main contributions of this work are:

  1. 1.

    Investigation of a novel approach to generate keyphrases from Brazilian dockets, using a sequence-to-sequence Transformer;

  2. 2.

    Comparison of three different text decoding methods for the proposed task (greedy and sampling methods);

  3. 3.

    Quantitative and qualitative analysis of the generated keyphrases.

This paper is organised as follows. Section 2 presents related works. Section 3 discussed the methodology applied for the keyphrase generation. Next, Sect. 4 presents and discusses the obtained results. At last, Sect. 5, presents conclusions and future works.

2 Related Works

In this Section, we will present studies related to the main objectives of this proposal. The Transformer [29] consists of a deep neural network architecture that achieved the state of the art in several NLP tasks. It consists of an encoder-decoder architecture used originally for translation. However, the context-aware representations generated by the model can be used for a large variety of tasks.

Following the success of the Generative Pretrained Transformers (GPT) models [19, 25, 33], there is a predominance of decoder-only models in NLP tasks that can be approached as text generation (such as question answering and summarization) [6]. In addition, recent studies showed the great potential in using such models in zero and few-shot scenarios [2]. Other studies [20, 31] investigate text generation using the full Transformer architecture (encoder-decoder) for some NLP tasks. The T5 [20] Transformer proposes the unification of a series of NLP tasks in a single text-to-text framework and Xue et al.  [31] expanded the original work to add multilingual support.

Although the presented Transformer approaches for text generation are different in terms of architecture and scale (number of parameters), they all deal with common issues concerning the quality of the artificially generated text. Generated texts are often simplistic, inconsistent, or end up being repetitive [8]. There is also the possibility of hallucination, generating contradictory texts, meaningless and without foundation or evidence [10].

In order to mitigate the challenges (repetitive and predictable texts), there were initiatives aimed at making text generation non-deterministic [4, 8]. Such proposals arise as an alternative to simpler text generation methods (also called greedy decoding), arguing that choosing most probable words (or tokens) is one of the main causes of repetitive texts.

Another example of a study aimed at mitigating repetitive texts is the work from Su et al.  [28]. Proposed in 2022, the contrastive-search consists of a modification in the choosing words (or tokens) predicted by a textual generator, which aims to increase the variability of the text while maintaining its coherence. For this purpose, the authors suggest penalising, during decoding or unsupervised training of the language model, the softmax scores of the most likely tokens by their similarity to other tokens within the context. The importance given to the similarity is controlled by a parameter alpha.

At last, we will present examples of studies employing Transformers to generate text in the legal domain. Keyphrases such as the ones used in this work are exclusive to Brazil and, for the best of our knowledge, this is the first in depth study of decoding methods for brazillian keyphrase generation. Feijo and Moreira [5] and Yoon et al. [32] applied Transformer models to summarise rulings from the Brazilian Supreme Court and Korean legal cases, respectively. Peric et al.  [16] proposed the use of Transformers to generate opinions about legal cases originating in the U.S Circuit Court, by employing an encoder-decoder architecture.

Huang et al.  [9] proposed a solution to automate the Legal Judgment Prediction (PJL) subtasks using the T5 text-to-text framework. At last, Althammer et al.  [1] investigated the use of summaries (generated by Transformer) as part of an information retrieval pipeline for the legal domain as part of the 2021 Competition on Legal Information Extraction/Entailment (COLIEE).

3 Methodology

The methodology used in this work is composed of: I) Data Collection and Preprocessing, II) Keyphrase Generator Training, III) Decoding Methods Evaluation and IV) Qualitative Analysis. These components will be discussed below.

3.1 Data Collection and Preprocessing

In 2022, the Brazilian Supremo Tribunal de Justiça (STJ) - Supreme Court of Justice made available the Dados AbertosFootnote 1 platform. The platform consists of a public website, sharing legal decisions from various courts in Brazil. The published documents comprise a large variety of topics in Brazil’s legal domain, such as crimes in general, commerce, taxes, etc. We collected a total of 712,161 documents from the platform in August 2022.

After the data collection, we extracted the dockets from the documents metadata and preprocessed the text of the decisions. We removed duplicated examples and removed URLs from the text. 111,964 dockets remained after the preprocessing described. With the remaining examples, we extracted the keyphrases and enumerated paragraphs from the dockets, identifying and extracting capitalised sentences present in the header of the collected decisions. By extracting the inputs (enumerated paragraphs) and expected outputs (keyphrases), the original keyphrases (written by specialists) compose the reference set used for supervised training and evaluation.

As a final preprocessing step, we divided the corpus (111,964 examples) into training (70%), validation (10%), and test (20%) splits. From the examples of the training set, we observed that enumerated paragraphs and keyphrases have a mean of 203.26 and 55.84 space-separated tokens, respectively. We used the splits to train and evaluate a supervised deep learning text generator.

3.2 Keyphrase Generation

In this section is described the methodology employed for keyphrase training and generation.

Transformers for Text Generation. Based on the dockets collected, we noted that most of the terms in the keyphrases are not directly present in the dockets. By further analysing examples from the validation set, we noted that only \(\sim \)10% of the terms present in the keyphrases are in fact present in the input text. Thus, we decided to approach writing keyphrases as generation rather than extraction of text. For this purpose, a sequence-to-sequence (or text-to-text) Transformer model was chosen.

We choose PTT5 [3] as our keyphrase generator. PTT5 was pretrained in a large Brazilian Portuguese corpus (brWaC [30]) with 2.7 billion tokens and the base version of the model (220M parameters) was used in our experiments. We experimented with other state-of-the-art multilingual generative Transformer models (such as mT5 [31], BLOOM [25] and OPT [33]), but the Portuguese model (PTT5) performed better. Previous works [3, 22, 27] observed that models pretrained for the task language tend to outperform multilingual models on the same tasks, and the same trend was observed in our experiments.

Fig. 2.
figure 2

Train and validation losses obtained for the PTT5 model are shown in the left y-axis. The plot also shows the validation BLEU scores in the right y-axis.

Training Details. We fixed the input (enumerated paragraphs) and output (keyphrases) sizes to 512 and 256 sentence-piece tokens, respectively. We padded shorter sequences of tokens and truncated longer sequences (to the maximum length). We fine-tuned the method PTT5 using a fixed learning rate of \(1\times 10^{-3}\), batch size equal to 256 and 20 maximum training epochs.

For the sequence-to-sequence training, the cross-entropy loss function was adopted. The BLEU score [17] metric was considered to evaluate the text generation quality. We used early-stopping during training, monitoring the BLEU metric in the validation set. The training process is stopped after two epochs without improving the BLEU score. For evaluation, we repeated the training process with five different seeds (1000, 2000, 3000, 4000 and 5000) and obtained a \(37.254 \pm 0.783\) BLEU score. The best performing model achieved 38.607 BLEU on the test set. The fine-tuning was done using the HuggingFaceFootnote 2 library, and a Tesla P100 GPU with 16 GB VRAM.

Figure 2 shows the train and validation losses for the best execution of the PTT5 model in addition to the validation BLEU scores. The model was trained for 17 epochs, totaling 5219 iterations.

3.3 Decoding Methods Evaluation

For the evaluation of the generated text and to compare the decoding methods, we have concatenated the generated keyphrases to their original document and used a real use case of retrieval to extract IR metrics. We opted to use an IR task to evaluate the generated keyphrases (created using different decoding methods) to evaluate them in their intended use: improving retrieval tasks. The details of the evaluations will be presented to follow.

Decoding Methods. Decoding techniques are used to guide neural text generation, in order to generate meaningful and coherent text. The methods are used to generate human-readable text from the internal representations of language models. In this work, we evaluate three decoding methods: greedy, top-k [4] and top-p [8]. Top-k and top-p are sampling decoding methods, that, during text generation, sample tokens from finite sets. A brief description of the methods will be presented next.

  • Greedy: greedy decoding always selects the most probable token (highest softmax score) during generation.

  • Top-K: consists of filtering the most probable K tokens at a given instant, and redistributing their probabilities among them before sampling.

  • Top-p: limits the set of selectable tokens to a set of more probable tokens whose summed probabilities are lower than the established threshold p. Note that the number of tokens that can be chosen is dynamic, since the probability distributions vary at each instant.

Task Formulation. We have used the themes (categorical information), present in the dockets’ metadata, to simulate a retrieval task in which a specialist seeks to retrieve documents similar to a query document using a search engine. The themes are unique identifiers, that are mapped to common questions of law. Thus, by using the binary relevance definition: given a query document Q, the relevant documents R to Q must have the same theme as Q. Note that, in the real scenario, the documents consists in dockets containing both keyphrases and enumerated paragraphs.

From the collected decisions, only 801 have themes. These documents were removed from the training set and used to prepare query - relevant document pairs for IR evaluation. The query set consists of dockets whose themes occur at least twice. From those, we prepared 482 query - relevant document pairs (pairs of same theme documents).

To prepare the final retrieval corpus, we combined the test set presented in Sect. 3.1 with the dockets with themes and obtained a total of 23,194 documents. We increased the retrieval corpus to make the retrieval task more challenging. In the worst case, the documents without a theme may introduce false negatives (documents with the same theme of the query, but considered non-relevant), hindering the IR metrics.

Experimental Setup. Two different experiments were performed during the IR evaluation and they are described in the following.

  1. 1.

    Studying Sampling-based Decoding Methods: this experiment aims to investigate the generation of multiple keyphrases from a single docket using sampling decoding. By concatenating multiple keyphrases to a single docket, we expect to see improvements in the IR metrics since we are adding more text variations. We generated up to 10 keyphrases for each example in the search corpus, using top-K and top-p sampling, and concatenated them to the original input (enumerated paragraphs) to generate the IR metrics. We repeated the text generation five times with seeds of different random numbers (1,000, 2,000, 3,000, 4,000, and 5,000) and aggregated the results for comparisons. The effects of the K of the top-K, and the p of the top-p sampling methods were also evaluated in this experiment, varying the values of both K and p. We choose K and p from the following sets of values: \(K \in \{15, 50, 100\}\) and \(p \in \{0.1, 0.5, 0.9\}\). Note that for this experiment, we are not interested in determining the best number of repetitions, nor the best value for K or p. The goal is to investigate the effect of the parameters on the proposed IR task, but the results may indicate the best parameter ranges.

  2. 2.

    Decoding Methods Comparison: in order to compare the decoding methods, we extracted IR metrics for dockets using keyphrases generated using top-K and top-p sampling. We used the generated ones in place of the originals in this experiment. For reference, we also evaluated IR metrics considering documents with and without the original keyphrases (for both query and corpus documents) and using simple greedy decoding. We choose to use only one keyphrase generated by each method based on the results of the previous experiment and to evaluate the decoding methods in similar scenarios. For this experiment, we used the following parameters for the sampling-based decoding methods: \(K=15\) and \(p=0.9\) (based on the performances obtained in the previous experiment).

The experiments with sampling decoding methods were inspired by the work doc2query [14]. In this work, for each example in a corpus, the authors generated several queries related to the example’s content using a sequence-to-sequence Transformer model. The authors used top-K sampling in order to generate several queries per example. Then the queries were concatenated to the input documents in order to improve IR metrics. Considering both experiments with sampling methods, we used contrastive-search with \(alpha=0.6\), based on the original paper [28]. We choose the K and p values based on previous works with top-K and top-p sampling [8, 14].

Information Retrieval Methods and Metrics. To evaluate the proposed IR task, we choose two traditional methods: TF-IDF and BM25 [21]. The methods were chosen due to their popularity in search engines (such as LuceneFootnote 3) and competitive performance [18, 23]. Previous works [12, 13, 15] also discussed that sparse representation methods (such as the chosen ones) tend to perform better in similar tasks in the legal domain.

As an additional preprocessing for the IR methods, the documents were tokenized and Portuguese stop-words and punctuation were removed. For TF-IDF, we utilized a vocabulary size of 10,000 tokens (that appeared at least three times), and n-grams from 1 to 3. To sort documents during retrieval using TF-IDF, we used the cosine similarity between queries and documents. Considering BM25, the documents were sorted by the probability ranking principle, estimating the relevance of a document to the presented query. The additional preprocessing was done using spaCyFootnote 4 and sklearnFootnote 5. For BM25, we used the implementation and default parameters from rank-bm25Footnote 6.

At last, we evaluated the performance in the proposed IR task using two traditional IR metrics: Mean Reciprocal Rank (MRR) and Recall. The metrics were chosen by their use in similar works in the legal domain [24, 26]. We used a threshold of 10 documents (top-10 ranked documents) to compute the metrics. According to Russel et al. [24], law professionals tend to analyze, for the most part, up to 50 documents in their searches. Therefore, we are evaluating an even more challenging scenario than the described by the authors.

3.4 Qualitative Analysis

As a final analysis, for all decoding methods evaluated (greedy, top-K and top-p), we sampled examples generated using all methods and performed a qualitative analysis on them. For this analysis, we compared the generated keyphrases to the references and discussed the similarities between them and the effect of the sampling methods.

4 Results and Discussions

These Sections discuss the results obtained for each experiment described in Sect. 3.3. In all experiments, we aim not to compare the retrieval methods (TF-IDF and BM25), but to use them to evaluate the quality of the generated keyphrases using different decoding methods.

4.1 Studying Sampling-Based Decoding Methods

Tables 1 and 2 present the IR metrics obtained, varying both the number of repetitions and K e p parameters. The metrics consist in the mean of five different executions (using five different seeds).

Table 1. Top-k experiments evaluation metrics. N represents the number of samples included at the beginning of each docket.
Table 2. Top-p experiments evaluation metrics. N represents the number of samples included at the beginning of each docket.

When carrying out this experiment, the expectation was to observe a logarithmic growth as more different keyphrases were concatenated to the dockets (similar to doc2query [14]), since we are using more variations of keyphrases. However, this result was not observed in any of the evaluated metrics. Contrary to expectations, in the worst cases, there was a decay in the metrics as new variations were added to the input texts for both top-K and top-p decoding methods. The decay is more noticeable for TF-IDF method, with reductions between 2% (top-K) and 12% (top-p) in all observed metrics. The mentioned behaviors were observed for all evaluated K and p values.

The worst performances were observed in increasing repetitions for \(p=0.1\) (top-p experiments). The most probable explanation is that the low p value is too restrictive, reducing the set of selectable tokens. Hence, the top-p tends to generate similar keyphrases with low text variation (more similar or equal keyphrases). This way, the repetitive text hindered the performance of both IR methods evaluated.

By increasing the K and p values, we increase the variability of the generated text, since the tokens to be predicted are chosen from a larger set. A positive effect on the metrics was also expected due to the possibility of adding more discriminative terms in the generated keyphrases, which is beneficial for the evaluated sparse methods. However, we observed deterioration of the performance and, at the best case, similar metrics by varying the K and p values. We suspect that even with the increase in variability, the generated keyphrases remained similar to each other, resulting in addition of repetitive texts to the dockets.

The conclusion from these results is that there is no evidence that using more keyphrases (by using sampling decoding) is beneficial to the evaluated task. Also, there is no benefit in using K values above 15, and p values lower than 0.9. We will discuss the results of the sampling methods further in Sect. 4.3.

4.2 Decoding Methods Comparison

In Table 3 is presented the results comparing decoding methods. For both TFIDF and BM25, a single keyphrase using greedy was generated for both top-K and top-p decoding. We adopted \(K=15\) and \(p=0.9\) for the top-K and top-p decoding, respectively, based on the results from the previous experiment. Table 3 also presents the results obtained performing the proposed retrieval task with and without the original (reference) keyphrases for comparisons.

Table 3. IR metrics obtained for each decoding method evaluated. Superscript characters denote a pairwise statistically significant difference, according to a paired T-test (p-value \(< 0.05\)).

We observe that the keyphrases are, indeed, beneficial to retrieval tasks by comparing the metrics obtained by using documents with and without the keyphrases. For both metrics, we observed statistically significant differences. Since both sparse methods (TF-IDF and BM25) benefit from the existence of discriminative terms in the documents, these results were already expected. Note that the metrics obtained using the original keyphrases act as an upper bound to our experiments.

Considering the TF-IDF retrieval, we observed an increment in all metrics by using the generated texts (compared to not using any). The differences in all metrics are statistically significant (see Table 3a). For the BM25 method, we observe similar results (see Table 3b). However, no significant differences were observed when considering the R@10 metric. Note that by using generated keyphrases, there is the possibility of introducing false positives (false similar) and false negatives (false non-similar) in the search corpus, originated by noisy keyphrases. The IR metrics obtained by the BM25 method suggest that the method was sensitive to these noisy examples.

By comparing the decoding methods, we note small increments for the sampling methods in relation to greedy decoding. However, considering a paired T-Test using a threshold of 5%, there is no significant difference between the decoding methods. Thus, there is not enough evidence to reject the null hypothesis (metrics have the same mean) by observing the comparisons between the metrics of all three decoding approaches. Therefore, there is no evidence that justifies choosing to sample decoding methods over a simpler greedy decoding approach, considering the proposed task.

4.3 Qualitative Analysis

Fig. 3.
figure 3

Examples generated using greedy decoding.

Greedy Decoding. Examples of keyphrases generated by PTT5, using greedy decoding, are presented in Fig. 3. We can note that with BLEU scores close to 40%, although being generated by a model trained in a modest training set (less than 100K examples), the generated keyphrases do not present spelling and lexical errors. They captured the writing style of the reference keyphrases and are very similar to the keyphrases written by humans.

Fig. 4.
figure 4

Histogram comparing the number of space-separated tokens between the reference keyphrases and the ones generated using greedy decoding.

A comparison between the number of tokens of the original and the generated keyphrases using greedy decoding is shown in Fig. 4. It is possible to observe that, although the distributions presented by the two histograms are similar, the generated keyphrases have a higher concentration of examples below 60 tokens. The average of space-separated tokens of the generated keyphrases is lower than the average of the tokens presented by the references (42.34 compared to 48.28).

Therefore, we identified that the keyphrases generated with greedy decoding tend to have a smaller length (in tokens) than the originals. We also observed the same pattern for the keyphrases generated using sampling decoding.

Sampling Decoding. In Fig. 5 is shown keyphrase examples generated using top-K and top-p decoding. We used \(K=15\) and \(p=0.9\) based on the results from the previous analysis. From the examples, it is possible to note that the main effect of using sampling is the generation of paraphrases of the original keyphrase. We also observe examples of reordering of the phrases present in the keyphrases. Hence, the generated keyphrases tend to be similar to each other. The described behaviours are justified by the working of language models based on Transformers since, during text generation, they tend to generate tokens that appear in similar contexts.

By using sampling-based methods, we observed an increase in text variability. However, the possibility of the model generating text not related to the input also increases, which may have harmed the IR methods studied. In addition, when concatenating multiple variations of keyphrases similar to each other, we added many repeated terms to the documents, which may influence negatively the sparse IR methods evaluated.

In addition to the justifications presented, the amount of training data may also have affected the sampling methods. Although the results for greedy generation were better, the lack of variability in the training examples (due to their small size), may have harmed the decoding using top-K and top-p sampling.

Fig. 5.
figure 5

Examples of keyphrases generated by top-K (\(K=15\)) and top-p (\(p=0.9\)) decoding.

5 Conclusion and Future Works

In this paper, we successfully trained a sequence-to-sequence Transformer to generate keyphrases and investigated three different text decoding methods. The results showed us that the keyphrases bring significant increments to IR metrics when used in combination with the dockets. This result was observed for all the keyphrases evaluated: the references and the generated ones (using greedy, top-K, and top-p decoding). Although we have evaluated different parameters and concatenated multiple variations of keyphrases generated using sampling decoding (top-K and top-p), the simpler greedy decoding performed similarly to these methods. We presented and discussed possible justifications for such behaviour, and the results suggest that greedy decoding is enough for keyphrase generation considering legal dockets.

As future works, we intend to experiment pre-training language models on legal documents in order to improve keyphrase generation. Furthermore, we aim to improve the quality of the training by collecting more dockets from more sources around Brazil. At last, this work can also be used to inspire other works aiming to automatise text writing in the legal domain.