1 Introduction

Brazilian judicial system has usually adopted a conservative stance regarding its lawsuit bureaucracy. However, in the last two decades, it started to adopt information technology systems to improve its underlying procedures. In 2006, Brazilian Congress passed a lawFootnote 1 to rule electronic lawsuit systems. Since then, all Brazilian courts started to digitize their lawsuit bureaucracy. This process has greatly accelerated the filing and intermediary procedures of a lawsuit which, for instance, allowed a 22% increase in the number of cases filed in 2019 when compared to 2009 [18]. Although this effect is mostly beneficial, it presents its own challenges. One issue is that the main bottleneck of Brazilian judicial system has shifted from the intermediary procedures performed by civil registries to judges’ offices.

Nowadays, most documents in a lawsuit in Brazil are either originally digital or digitized before inclusion in the judicial system. Nevertheless, several tasks within a lawsuit procedure still depend on direct human intervention. In general, every time a party (judge, complainant, defendant, etc.) is requested to pronounce within a case, this party may need to carefully analyse different documents. In particular, the petition (petição inicial, in Portuguese) is the first document presented by the complainant in order to file a civil lawsuit in Brazilian courts. In this document, the complainant needs to identify the involved parties, expose facts that are relevant to the case, present the legal basis and relevant prior cases, describe the requested remedy, among other aspects. After a petition is filed, it needs to be carefully analysed by staff of a judge’s office in order to verify different aspects. For instance, it is necessary to check if the petition presents the minimal requirements specified in the Brazilian civil procedure (Código de Processo Civil) [4]. Ultimately, the petition must be fully considered by the assigned judge, when another careful analysis is required. Although it is virtually impossible to mechanize all these manual analytic steps, in most of these cases if not all, machine learning (ML) and natural language processing (NLP) techniques can be leveraged to assist humans in such analyses.

Rhetorical role identification (RRI) is an NLP task that consists of labeling the sentences of a document according to a given set of semantic functions (rhetorical roles). This task is useful to applications like document summarization [11, 19,20,21, 26], semantic search [15], document analysis [22], among others. Different approaches have been proposed to solve this task for court decisions [2, 20, 21]. These previous works have proposed to segment a court decision into rhetorical roles such as facts, court arguments, prior related cases, ruling, among others. As far as we know, all previous RRI methods have tackled only court decisions and are limited to the English language. Although the benefits of analytic tools of judicial decisions are undisputed, it is also important to provide similar tools for other legal documents and languages, particularly petitions in Brazilian legal system. As mentioned before, a judge’s office in Brazil handle a high volume of work, and petition analysis represents a substantial portion of it. Therefore, rhetorical role identification for petitions represents an opportunity to improve Brazilian judges’ offices efficiency.

In this paper, we propose to segment petitions into eight rhetorical roles by mainly considering the analytic needs of judges’ offices in Brazil. We present a corpus of 70 petitions comprising more than 10 thousand sentences manually labeled with the proposed rhetorical roles. These petitions were taken from civil lawsuits filed in the court of the Brazilian state of Mato Grosso do Sul (TJMS). Additionally, we report on experiments to assess classic and modern machine learning methods on the created corpus. In order to propose relevant rhetorical roles, we considered the Brazilian civil procedure [4] which defines the minimal requirements of a petition. The proposed rhetorical roles are: identification of parties, facts, arguments, legal basis, precedents, requests, remedy, and others. The corpus was anonymized and is publicly available. We trained and evaluated six machine learning methods, including classic and deep learning models, on the developed corpus. The best performing method was an end-to-end-trained BERT-based [6] method that obtained a sentence-level micro-averaged F-score of 80.50. This method outperformed a fixed-representation BERT-based model by five points and the best classic algorithm (SVM with TF-IDF representation) by 25 points of F-score.

As far as we know, this is the first paper to approach rhetorical role identification of petitions, instead of court decisions. Moreover, this is also the first work on this task that considers the Portuguese language and legal domain. We hope this paper will foster research about this important subject, especially in Brazil.

2 Related Work

One of the first works to deal with text segmentation based on rhetorical function was proposed by Teufel [25]. In this work, Teufel proposed a segmentation approached called argumentative zoning that is based on the rhetorical function that an excerpt plays within a scientific article. This approach was later applied to extractive summarization [5, 12, 26].

Hachey and Grover [11], inspired by Teufel’s work, proposed a method to legal text summarization based on rhetorical role text segmentation. The proposed roles were: fact, proceedings, background, framing, disposal, textual, and other. They assessed the proposed method on the HOLJ dataset [10] which consists of 40 manually-labeled decisions from the House of Lords, the highest court in the United Kingdom until 2009. They employed a C4.5 classifier that achieved an accuracy of \(65.4\%\).

Saravanam [19] also followed Teufel’s work to study the generation of headnotes (a type of summary) for decisions from Indian courts. He argued that rhetorical roles proposed by Teufel were not appropriate in the legal domain. Thus, he proposed the following rhetorical roles for legal decisions: case identification, facts, arguments, chronology, argument analysis, ratio decidendi, and final decision. Later, based on these roles, Saravanam et al. [21] created a corpus out of 200 decisions from the Indian Court of Justice spanning three different domains. They developed and evaluated a Conditional Random Field (CRF) classifier to perform sentence-level rhetorical role identification which obtained the following F-scores on the considered domains: 84.9 on rent control, 81.7 on income tax, and 78.7 on sales tax domain.

Nejadgholi et al. [15] leveraged rhetorical role identification to perform semantic search. They defined eight rhetorical roles and manually labeled the sentences in 150 decisions taken from immigration and refugee cases from Canada’s Federal and Supreme Court. They assessed a simple classifier based on fastText that obtained an accuracy of 90%.

Walker et al. [28] proposed six rhetorical roles to aid a system for argument mining. They produced a corpus of 50 decisions extracted from the U.S. Board of Veterans’ Appeals comprising 6,153 sentences. They evaluated three different classification algorithms on this sentence classification corpus. The best performing classifiers (based on logistic regression and SVM) achieved 85.7% accuracy.

Bhattacharya et al. [2] annotated 50 decisions from the Supreme Court of India using the same rhetorical roles proposed by Saravanam [19]. They trained a hierarchical BiLSTM model with CRF on this corpus, obtaining an F-score of 82.0.

Yamada et al. [29] leveraged the argumentation structure in legal decisions from Japanese justice system to improve a summarization system. They manually annotated a corpus of 209 decisions based on seven rhetorical roles. By means of BiLSTM-CRF model, they obtained a macro-averaged F-score of 65.4 on this corpus.

All papers mentioned above are based on legal decisions. Although this kind of documents is of high importance, there are other types of documents that are also relevant to different applications in the legal domain. Another aspect that is worth mentioning is that all previous works on rhetorical role identification were performed on documents from legal systems based on the common law paradigm. However, the Brazilian legal system is based on the civil law paradigm, which implies some key differences in the structure of the documents within a legal case. Therefore, it is worth studying rhetorical role identification within the Brazilian legal system.

A few previous works [7,8,9] have tackled rhetorical role identification in Portuguese. However, as far as we know, there is no RRI work for legal documents in this language. Recently, some papers have proposed resources and systems to tackle NLP problems involving Portuguese legal documents, mainly for document classification [1] and named entity recognition [13, 24]. As far as we know, our work is the first one on rhetorical role identification for Portuguese legal documents, and also the first one to perform this task on petitions regardless of language. Moreover, we propose some novel rhetorical roles that are specific for petitions.

3 Corpus and Rhetorical Roles

We created a corpus of petitions based on the court of the Brazilian State of Mato Grosso do Sul (TJMS) which receives hundreds of thousands lawsuits every year. During 2017, for example, more than 340 thousand cases have been filed in TJMS among different types such as civil, criminal, fiscal, and others. We initially considered civil cases that were judged between 2014 and 2018 and are not under judicial secrecy. We then took the petitions of such cases which are usually written by a lawyer that represents the complainant. The system used to file cases in TJMS accepts a petition only as a PDF file. However, there is no requirement as how this document is created. It can even be produced by scanning a paper document. Therefore, such documents present great variation regarding encoding, content structure, background watermark, headings, footers, among others. Since these aspects may generate a lot of noise in the content extracted from the petitions, we selected 70 petitions and, after extracting its textual content by means of PDFBoxFootnote 2, a manual inspection was performed in order to remove noisy content. We split each petition into sentences and tokens by means of the NLTK library [3]. We obtained a total of 10,784 sentences comprising 249,105 tokens (20,914 unique tokens). The average length of a petition is 154 sentences, and the average length of a sentence is 23.1 tokens. In Fig. 1, to provide a better understanding of this corpus, we present the histogram of sentence length.

Fig. 1.
figure 1

Frequency of sentences per length

For all documents, personal information were manually identified and anonymized by means of random procedures. More specifically, names of people, institutions, streets and cities were replaced by random names taken from a database containing two thousand names. Personal identification numbers and telephone numbers were replaced by random numbers.

In order to define the rhetorical roles used to label sentences, we took some inspiration from the Brazilian civil procedure (Código de Processo Civil) [4]. More specifically, we considered articles 319, 320 and 321 from this law, which state the minimum requirements regarding the content of a petition within Brazilian legal system. In addition to these legal requirements, we considered two possible applications for this corpus: (i) support to staff of a judge’s office (including the judge) when analysing a filed petition; and (ii) petition summarization to support the generation of headnotes that need to be included in the judge’s pronouncement within the legal case.

In the following, we list the eight proposed rhetorical roles for sentences within a petition. The terms between parenthesis correspond to Portuguese translations of the rhetorical roles. For each rhetorical role, we also present some exemplary sentences.

  1. 1.

    Identification (identificação das partes): Identification of the parties (complainant, appellant, judge, among others) involved in the case. Usually such sentences occur at the beginning of the document.

    • ADEMIR CANTAREIRA E PAULA, brasileiro, solteiro, auxiliar de produção, portador do CPF n\(^{\underline{\mathrm{o}}}\).

    • 179.939.452 - 04 e do RG n\(^{\underline{\mathrm{o}}}\).

    • 103673421 SSP/MS, residente e domiciliado sito á Rua do Carapicuibano, n\(^{\underline{\mathrm{o}}}\).

  2. 2.

    Facts (fatos): Factual account that gave rise to the lawsuit. Usually these sentences occur right after the identification, but they can be presented interspersed with the arguments.

    • No dia 28/01/2013 o funcionário da Requerida, Sr. Jorge, realizou serviços de substituição de válvulas no apartamento do síndico do condomínio, Sr. Ailton Cesar, que aproveitando o ensejo contratou verbalmente o profissional para verificar os hidrômetros do prédio e, se necessário, consertar ou substituir dois registros de água do condomínio.

  3. 3.

    Arguments (argumentos): Pronouncement to persuade the judge to adopt a certain line of thought. These are the sentences that present the greatest variation in linguistic styles. It is common to interconnect these sentences with facts and legal basis.

    • O presente pleito tem como objetivo, entre tantos outros, ressarcir o condomínio autor pelos danos morais, coibindo a atitude abusiva da Requerida, em razão do constrangimento sofrido pelo síndico, Sr. Leandro Musk, o qual se sente lesado porque ficou extremamente constrangido ao ter que pagar e explicar perante os condôminos do prédio um gasto exorbitante de R$ 1.660,00 para trocar apenas dois castelos de registros.

  4. 4.

    Legal basis (fundamentação legal): Description of legal doctrine that gives the ground for the decision making process. It usually includes citations to laws and texts by jurisprudential authors. This is usually interspersed with the arguments and precedents, but we decided to create this specific label since its nature is more objective than arguments.

    • CC. Art. 186.

    • Aquele que, por ação ou omissão voluntária, negligência ou imprudência, violar direito e causar dano a outrem, ainda que exclusivamente moral, comete ato ilícito.

  5. 5.

    Precedents (jurisprudência): This part is based on citations to headnotes of similar or relevant previous cases. Some sentences within this part are in capital letters since this is the typical format of headnotes in Brazilian legal system. It also occurs interspersed with the arguments.

    • (STJ - REsp 927727 / MG, RECURSO ESPECIAL, 2007/0038830-3, Dês.

    • Rel.

    • Ministro JOSÉ DELGADO, Primeira Turma, 06/05/2008).

  6. 6.

    Requests (pedidos): This corresponds to the main requests by the complainants about which the judge must pronounce. It is usually the last by one section of the petition.

    • Ante o exposto com supedâneo nos motivos de fato e nos fundamentos do direito pátrio, requer a Vossa Excelência se digne de.

    • a) citar a Ré, na pessoa de seu representante legal, para que, assim desejando, compareça á audiência designada e apresente defesa, sob pena de revelia e confissão (art.

  7. 7.

    Remedy (valor da causa): Normally contained in one or two sentences in the end of the petition, this part provides a monetary estimate to remedy the complainant’s claims. This is usually the last section of the petition.

    • Dá-se á presente o valor de R$ 4.450,00 (quatro mil quatrocentos e cinquenta reais), para fins de alçada e fiscais.

  8. 8.

    Others (outros): Other textual elements like section and page headings.

    • DOS REQUERIMENTOS.

    • DO DIREITO.

In Table 1, we present the number of sentences and the average number of tokens per sentence for each rhetorical role in our corpus. In the first column of this table, we provide a label for each rhetorical role to ease further references.

Table 1. Basic statistics per rhetorical role

By means of the doccano tool [14], an expert labeled each sentence of a petition with one of the eight rhetorical roles aforementioned. The sequence of sentences in a petition was preserved in the corpus, so that future approaches can leverage this structure.

4 Text Representation and Machine Learning Approaches

Rhetorical role identification is usually modeled as a sentence classification task. Following this modeling, there are two broad types of methods in literature: the ones that treat each sentence independently, and the ones that treat a document as a sequence of sentences leveraging the inter-dependencies among sentences. Although the later type considers a relevant piece of information that is ignored by the former, in this work we focus only on the former type of methods. Our motivation for this decision is that our main goal here is to investigate the feasibility of the proposed RRI task. And we indeed experimentally show that, even ignoring the sequential aspect of the given sentences, it is possible to achieve a highly effective rhetorical role classifier (F-score over 80%) for sentences within petitions from a Brazilian court. Besides that, we also make progress on assessing different text representations and machine learning algorithms on this task.

4.1 Baseline Methods

We considered some standard text classification approaches as baseline methods. More specifically, we combined Naive Bayes (NB) and Support Vector Machine (SVM) models with two standard text representation techniques, namely bag of words (BoW) and term frequency-inverse document frequency (TFIDF). BoW and TFIDF are among the most used text representation techniques. Although deep learning representations have dominated the latest advances in text processing tasks, BoW and TFIDF can still be competitive, mainly in text classification problems. Thus, these classic representations represent strong baselines in order to evaluate the incremental contribution of more complex methods, like the ones based on deep learning.

NB is a probabilistic ML model specially appropriate for text classification tasks. While linear SVM is an effective model to tackle problems with sparse input data like texts. Both models offer competitive performance in many text classification problems.

4.2 Deep Learning Methods

In the last decade, deep learning models have represented the most prominent approaches for NLP problems. And, following the proposal of the seminal Transformer architecture [27], models based on attention mechanisms [6, 16, 17] became the standard approach for text tasks. BERT [6] is probably the most popular model among such approaches. For instance, among the more than 10 thousand NLP models available in the popular Hugging Face libraryFootnote 3, approximately 25% are based on BERT or some of its variants. Among those BERT-based models, there are 18 models were trained with Portuguese data.

In this work, we assessed two BERT-based models for rhetorical role identification by training and evaluating them on our proposed corpus. Both models are based on the BERTimbau Base model [23], a BERT language model pre-trained on Portuguese texts which is available in the Hugging Face libraryFootnote 4. In Fig. 2, we depict these two models. Both of them employ a BERT language model (LM) to compute a sentence representation (black part of the figure). This part, which is the same for both models, is represented in black at the bottom of the figure. Before applying a BERT LM, the input sentence is first tokenized into N tokens \((T_1, \ldots , T_N)\), then two special tokens [CLS] and [SEP] are added, respectively, at the beginning and in the end of the tokenized sentence. The resulting sequence of tokens ([CLS], \(T_1, \ldots , T_N\), [SEP]) is then fed in the language model, which outputs a 768-dimensional representation for each input token: \((E_\mathrm {[CLS]}, E_{T_1}, \ldots , E_{T_N}, E_\mathrm {[SEP]})\). When dealing with sentence classification tasks, like is the case for rhetorical role identification, one needs to compute a fixed-length representation for the variable-length input sentence. In this work, we use a common sentence representation approach which consists of simply taking \(E_\mathrm {[CLS]}\), thus ignoring the representations of all remaining tokens. Since BERT is composed of a sequence of self-attention layers, the output \(E_\mathrm {[CLS]}\) is directly connected to all tokens of the sentence. Therefore, this representation is robust for sentence classification tasks, regardless of which part of the input sentence is more important for the task at hand.

Fig. 2.
figure 2

Two BERT-based models for sentence classification. Black: BERT language model computes a sentence representation \(E_\mathrm {[CLS]}\). This part is identical in both models, except that, for the model on the left, the BERT LM is fine tuned during supervised training. On the other hand, for the model on the right, the BERT LM is kept fixed and only the remaining layers are updated during supervised training. Blue: The BERT-based sentence representation \(E_\mathrm {[CLS]}\) is fed to a single linear layer followed by a softmax function that outputs a probability distribution over the eight labels. This model is trained in an end-to-end fashion, i.e., the BERT model is fine-tuned together with the linear layer. Red: The sentence representation is fed to a MLP comprising two 512-neuron layers followed by a softmax function which outputs the label distribution. When training this model, the BERT LM is kept fixed and only the MLP parameters are updated. (Color figure online)

In the upper part of Fig. 2, we present the remaining layers of the two sentence classification models used in our work. In blue (upper-left of the figure), we represent the model that takes the BERT-based sentence representation \(E_\mathrm {[CLS]}\), passes it through a linear layer and then applies a softmax activation function which outputs a probability distribution over the eight rhetorical roles proposed in our work. This corresponds to the original BERT architecture for text classification problems. This model is trained in an end-to-end fashion, i.e., during supervised training the parameters of both the output linear layer and the BERT language model are jointly learned. This approach performs the so called LM fine tuning because the BERT LM parameters are updated using labeled data for the final task.

The red part of Fig. 2 represents a model that feeds the BERT-based sentence representation \(E_\mathrm {[CLS]}\) to a two-layer Multi-Layer Perceptron (MLP). The output of this MLP is then passed to a softmax function that provides the output label distribution. When training this model, we keep fixed the parameters of the BERT language model. In that way, we can evaluate the benefit of fine tuning the BERT LM in the rhetorical role identification task.

5 Experimental Evaluation

In this section, we describe the experimental evaluation of the developed ML models on the proposed RRI corpus. We model RRI as a sentence classification problem. In order to perform model selection and still report meaningful performance estimates, we randomly split the proposed corpus into three splits: train containing 6, 901 sentences, validation containing 1, 726 sentences, and test containing 2, 157 sentences.

We use Naive Bayes and SVM models, combined with bag of words and TFIDF representations, to define meaningful baselines for this task. In these approaches, each sentence was preprocessed with NLTKFootnote 5, the Natural Language Toolkit, to tokenize and remove stopwords.We represent input sentences using BoW and TFIDF representations by means of the scikit-learn libraryFootnote 6. Then, we train and evaluate NB and SVM models using both input representations, again making use of scikit-learn. The performances of these four baseline methods on our test set are reported in Table 2. We can observe that the best performing baseline method is SVM with TFIDF representation which obtains a F-score of 60.66. But this method is only slightly better than NB with the same representation.

Table 2. Test performances for different models and input representations. The column labeled F corresponds to micro-averaged F-score over the eight classes, and the remaining columns correspond to per-class F-score.

We also evaluated the four NB/SVM-based models when the input comprised all words (including stopwords), but the achieved results were substantially worse (F-score values were 5% lower on average). Thus, we do not report these performances.

In order to train and evaluate the two BERT-based methods, we did not employ any preprocessing steps but the corresponding tokenization. By means of the Hugging Face (HF) library, we trained and evaluated the end-to-end-trained BERT-based model. For the MLP-based classifier, we again employed the HF library to extract sentence representations, but then we used the Keras library to train the MLP model, since the BERT LM is kept fixed. In Table 2, we refer to this fixed BERT LM as BERT\(^f\). The MLP network comprises two hidden layers with 512 neurons each. This network was trained by the NAdam optimizer for 500 epochs using a batch size of 32 sentences and the categorical cross-entropy loss. The original (end-to-end) model was trained by the default HF AdamW optimizer for 10 epochs using a batch size of 4 sentences. From the results in Table 2, we can observe that fine-tuning the BERT LM brings a 5 point increase in F-score. The best performing method achieved an F-score of 80.50, which is almost 20 points higher than the performance achieved by the best baseline method.

In Table 2, we also report per-class F-scores. Untypically, the class with best results is Remedy, which is the least frequent class. Although there are only 71 sentences labeled as Remedy in the whole corpus (including the three splits), such sentences have a very specific vocabulary (“Dá-se o valor da causa ...”, for instance). This characteristic obviously facilitates the identification of these sentences. Besides that, we also observe that some frequent classes – namely Facts (2), Arguments (3), and Legal Basis (4) – present the worst performances.

In order to investigate this fact, we present the confusion matrix for our best model predictions in Fig. 3. First, we can see that there are 73 sentences labeled as Facts (Fac) but classified by the model as Arguments (Arg). And other 73 sentences labeled as Arguments (Arg) but classified as Facts (Fac). We also see that there are 181 sentences labeled as Arguments (Arg), Legal Basis (LB) or Precedents (Pre) that are misclassified among these same three classes. As mentioned in Sect. 3, sentences from these four roles (Fac, Arg, LB and Pre) are frequently interspersed among them. Thus, these interspersed sentences are probably more difficult to classify.

Fig. 3.
figure 3

Confusion matrix

6 Conclusions and Future Work

In this paper, we present a new corpus comprising 70 petitions (petiÇão inicial) from civil lawsuits filed in TJMS. Based on the Brazilian civil procedure, we proposed eight categories representing the rhetorical roles of text sentences within the petitions. We manually labeled the sentences of the proposed corpus using these eight categories. Since TJMS does not require any specific structure for the digital files of petitions, we had to deal with several issues to extract the textual content from these files. As far we know, this is the first work to consider rhetorical role identification of petitions in Portuguese language. The proposed corpus is availableFootnote 7 and we hope that it can foster research on this important subject.

We also analysed the performance of classic and deep learning machine learn methods on the proposed Rhetorical Role Identification corpus. The best performing method was an end-to-end-trained BERT-based sentence classifier. This model achieved 80.50 of F-score on our test set.

As future work, we intend to pre-train a BERT language model on judicial documents. The models evaluated in this work classify sentences independently, i.e. disregarding the sequence of sentences within a petition. Thus, we also intend to improve the obtained classification performance by employing deep learning models that consider the sequential structure of sentences. Nevertheless, the obtained results are promising. Therefore, we are working on the development of a new corpus for summarization of petitions, which will leverage the rhetorical role labels in the proposed corpus.

As we have stated, the contributions of this paper can open new research possibilities for judicial NLP. Additionally, our work can lead to solutions with potential to relieve the current bottleneck in Brazilian judges’ offices.