Text Classification in Legal Documents Extracted from Lawsuits in Brazilian Courts

Aguiar, André; Silveira, Raquel; Pinheiro, Vládia; Furtado, Vasco; Neto, João Araújo

doi:10.1007/978-3-030-91699-2_40

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13074))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

1412 Accesses
15 Citations

Abstract

Recently, Brazil’s National Council of Justice (CNJ) highlighted the importance of robust solutions to perform automated lawsuit classification. A correct lawsuit classification substantially improves the assertiveness of (i) distribution, (ii) organization of the agenda of court hearing and sessions, (iii) classification of urgent measures and evidence, (iv) identification of prescription and (v) prevention. This paper investigates different text classification methods and different combinations of embeddings, extracted from Portuguese language models, and information about legislation cited in the initial documents. The models were trained with a Golden Collection of 16 thousand initial petitions and indictments from the Court of Justice of the State of Ceará, in Brazil, whose lawsuits were classified in the five more representative CNJ’s classes - Common Civil Procedure, Execution of Extrajudicial Title, Criminal Action - Ordinary Procedure, Special Civil Court Procedure, and Tax Enforcement. Our best result was obtained by the BERT model, achieving 0.88 of F1 score (macro), in the experiment scenario that represents the lawsuit in an embedding formed by concatenating the texts of all the petitions that contain at least one citation to one legislation. Legal documents have specific characteristics such as long documents, specialized vocabulary, formal syntax, semantics based on a broad specific domain of knowledge, and citations to laws. Our interpretation is that the representation of the document through contextual embeddings generated by BERT, as well as the architecture of the model with bidirectional contexts, makes it possible to capture the specific context of the domain of legal documents.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Using Topic Modeling in Classification of Brazilian Lawsuits

Evaluating Text Classification in the Legal Domain Using BERT Embeddings

Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain

1 Introduction

Recently, Brazil’s National Council of Justice (CNJ) highlighted in a public call [1] the need for implementing procedural automation with robotization resources and the use of Artificial Intelligence (AI) techniques, such as machine learning in legal proceedings in Brazil. These applications allow the replication of human activity, in order to save human resources that could be relocated to tasks that demand greater creativity and expertise. Particularly in the field of jurisdictional efficiency, the CNJ calls the AI community in Brazil to develop robust solutions to perform automated lawsuit classification. A correct lawsuit classification substantially improves the assertiveness of (i) distribution, (ii) organization of the agenda of court hearing and sessions, (iii) classification of urgent measures and evidence, (iv) identification of prescription and (v) prevention, considering, also, issues such as the categorization in the CNJ’s Unified Procedural Tables.

The classification of lawsuits is initially assigned by the lawyers and attorneys, manually, and recorded in the initial lawsuit documents. In Brazilian justice, these documents are commonly classified as initial petitions (civil lawsuits) and indictment (criminal lawsuits). Some benefits of the classification of petitions and indictimens, highlighted by the CNJ in [1], are: simplifying procedures for access to justice by citizens and/or their representatives, freeing them from carrying out the procedural classification; reduce the necessary structure of human resources in notary offices and departments responsible for processing the cases, in the different spheres of justice; drastically reduce the need for human intervention in the distribution and processing of cases; provide greater detail in the lawsuit classification, resulting in better organization of the judgment guidelines and opening the way for the development of productivity improvement tools.

In order to automate this task, text classification methods should be developed from language models [2, 3] from the Natural Language Processing (NLP) area and learning models [4, 5] from the AI area, and trained from a Golden Collection of documents extracted from lawsuits. Text classification, i.e., the process of assigning one or multiple categories from a set of options to a document [6], is a prominent and well-researched task in Natural Language Processing (NLP) and text mining. Text classification variants include simple binary classification (for example, deciding if a document is spam or not spam), multi-class classification (selection of one from a number of classes), and multi-label classification (where multiple labels can be assigned to a single document). Text classification methods have been successfully applied to a number of NLP tasks and applications ranging from plagiarism [7] and pastiche detection [8], in order to estimating the period in which a text was published [9], with commercial or forensic goals (e.g. identifying potential criminals [10], crimes [11, 12], or antisocial behavior [13]). However, in the legal domain, to the best of our knowledge, is relatively under-explored, mainly for the Brazilian lawsuits.

Text classification in the legal domain is used in a number of different applications. Katz et al. [14] use extremely randomized trees and extensive feature engineering to predict if a decision by the Supreme Court of the United State would be affirmed or reversed. In a similar fashion, [15] trained a model to predict, given the textual content of a case from the European Court of Human Rights, if there has been a violation of human rights or not. In [16], the authors trained traditional classifiers on text descriptions of cases from the French Supreme Court, in order to predict with high accuracy the ruling of the French Supreme Court (six classes), the law area to which a case belongs to (eight classes), and the influence of the time period in which a ruling was made. Undavia et al. [17] evaluated a series of classifiers trained on a dataset of cases from the American Supreme Court.

In the Brazilian legal domain, Araújo et al. [18] present a novel dataset built from Brazil’s Supreme Court digitized legal documents, containing labeled text data and supports two types of tasks: document type classification; and theme assignment, a multilabel problem. Specifically for the problem stated by the CNJ in [1], we did not find any work that considered initial documents of lawsuits (petitions and complaints), especially those filed in lower courts of Brazilian Justice.

Considering the above, this paper investigates different text classification methods and different combinations of embeddings extracted from Portuguese language models, like BERTimbau [19], and information about legislation cited in the initial documents, validated by a Brazilian Legal Knowledge Graph, containing Brazilian Federal and Regional Legislation. The models were trained with a Golden Collection of 16 thousand initial petitions and indictments from the Court of Justice of the State of Ceará, in Brazil, whose lawsuits were classified in the five more representative CNJ’s classes - Common Civil Procedure, Execution of Extrajudicial Title, Criminal Action - Ordinary Procedure, Special Civil Court Procedure, and Tax Enforcement. The lawsuits categorized in these classes represent more than 80% of lawsuits processed in 2019 in the state of Ceará Court of Justice. In addition, the Golden Collection contains a class “others” with a mix of initial petitions and indictments from several classes, which were selected to provide negative examples for the learning model.

Our best result was obtained by the BERT model, achieving 0.88 of F1 score (macro), in the experiment scenario that represents the lawsuit in an embedding formed by concatenating the texts of all the petitions that contain at least one citation to one legislation. Legal documents have specific characteristics such as long documents, specialized vocabulary, formal syntax, semantics based on a broad specific domain of knowledge, and citations to laws. Our interpretation is that the representation of the document through contextual embeddings generated by BERT, as well as the architecture of the model with bidirectional contexts, makes it possible to capture the specific context of the domain of legal documents. We also emphasize that by representing a lawsuit with only the text of petitions that have legislation, we are reducing the size of the text, but keeping only the excerpts that present reasons for the content of the lawsuit, which are therefore relevant for the classification of lawsuits.

2 Related Works

Text classification methods are investigated and applied with commercial or forensic goals (e.g. identifying potential criminals [10], crimes [11, 12], or antisocial behavior [13]). However, in the legal domain, these methods have been under-explored, mainly in Brazilian lawsuits.

Katz et al. [14] use extremely randomized trees and extensive feature engineering to pre- dict if a decision by the Supreme Court of the United State would be affirmed or reversed, achieving an accuracy of 69.7%. Aletras et al. [15], in a similar fashion, trained a model to predict, given the textual content of a case from the European Court of Human Rights, if there has been a violation of human rights or not. The paper employed n- grams and topics as inputs to a SVM, reaching an accuracy of 79%.

Sulea et al. [16] trained a linear SVM on text descriptions of cases from the French Supreme Court in order to predict with high accuracy the ruling of the French Supreme Court and the law area to which a case belongs to. The authors also investigate the influence of the time period in which a ruling was made. They report results of 98% average F1 score in predicting a case ruling, 96% F1 score for predicting the law area of a case, and 87.07% F1 score on estimating the date of a ruling.

Undavia et al. [17] proposed to classify US Supreme Court documents into fifteen different categories, comparing various combinations between feature representation and classification models, with better results through the application of Convolutional Neural Networks (CNN) to Word2Vec [3] representations and was able to achieve an accuracy of 72.4% when classifying the cases into 15 broad categories and 31.9% when classifying over 279 finer-grained classes.

In the Brazilian legal domain, Araújo et al. [18] present a novel dataset built from Brazil’s Supreme Court digitized legal documents, containing labeled text data and supports two types of tasks: document type classification; and theme assignment, a multilabel problem. A similar model was applied to documents received by the Supreme Court in Brazil and reported in Silva et al. [20], which obtained a classifier using an embedding layer and a CNN applied to a problem involving six classes.

3 Corpus and Data Preparation

In this section, we describe the preparation and structuring of data used in the experiments that will be demonstrated in this paper. Initially, we present the dataset collected from lawsuits of the Court of Justice of the State of Ceará, in Brazil. Then, we presented the knowledge graph of federal and regional legislations, the Named Entity Recognition (NER) of these legislations and the validation process of legislations recognized in the legal documents.

3.1 Corpus and Golden Collection

Data processing was divided into phases, consisting of: (i) data cleaning phase, and (ii) data extraction phase. At the end of the data cleaning and extraction phases, a sample was produced with the lawsuits completed in 2019.

During the data cleaning phase, corrupt and inaccurate records were detected in their metadata, where an action to correct or remove these records was taken. In the metadata we can obtain the information of document date, document type, lawsuit number and lawsuit page. These data were relevant to organize the data extraction phase.

In the data extraction phase, Amazon’s Optical Character Recognition (OCR) service was used, which refers to the branch of computer science that allows the translation of an optically digitized image of a printed or written text into manageable and editable text. We identified some errors after the data extraction phase, such as: special characters and additional spacing. With this, we carry out a data cleaning process to minimize these errors.

Finally, 7,103 lawsuits were selected, of which 16,668 petitions were used to build the dataset, distributed in 6 selected classes, which were defined by experts of the Court of Justice of the State of Ceará, in Brazil. These classes express the type of lawsuits and were defined by the National Council of Justice of Brazil^{Footnote 1}.

3.2 Integration with the Brazilian Legal Knowledge Graph

A characteristic of legal documents is that they present citations to legislation, which are used to argue and present legal theories and arguments about the legal themes supporting the lawsuit. Considering that legislation can be important for classifying lawsuits, we use a Named Entity Recognizer (NER) to identify legislation in the text of the lawsuits. More specifically, we use a NER^{Footnote 2}, structured in the Conditional Random Field model (CRF) Classifier of the Stanford JavaNLP API^{Footnote 3} and trained with the LENER-BR dataset [21] (a dataset for Named Entity Recognition in Brazilian legal text).

Additionally, a knowledge graph was built with federal and regional legislation of the state of Ceará, in Brazil. The laws were taken from the Federal Government Legislation Portal^{Footnote 4}, Legislative and Legal Information Network^{Footnote 5} and from the Legislative Assembly of Ceará^{Footnote 6}. Table 1 demonstrates the legislation types, year of creation and quantity of legislation in the actual Brazilian Legal Knowledge Graph.

Table 1. Statistics and information about the Brazilian legal knowledge graph.

Full size table

The legislations extracted from the texts of the lawsuits by NER were validated in the knowledge graph of legislation, so that we consider only those present in the knowledge base as an entity “legislation”.

From the identification and validation of the “legislation” entities in the text of the lawsuits, the lawsuits that have at least one citation to the legislation were selected, totaling 6,283 lawsuits, with 11,131 petitions in all. In Table 2 we can see the name of the lawsuit classes of the dataset and the number of the National Council of Justice of Brazil (informed in parentheses in the Class column), and the distribution of dataset lawsuits in the 6 classes mentioned in Sect. 3.1.

Table 2. Distribution of lawsuits in legal classes.

Full size table

We investigate the legislation cited in the lawsuits for each class individually. Table 3 shows the amount of different legislations mentioned and in how many classes these legislations occur.

Table 3. Number of classes and the amount of legislation mentioned differently.

Full size table

In the text of the lawsuits’ petitions, were identified 110,044 citations to Brazilian legislation. According to Table 3, these citations represent 895 different Brazilian laws and legal norms. We verified that only 1.2% (10 legislations) of these legislations are mentioned in lawsuits of all classes, being characterized as more generic legislations that do not distinguish the class of lawsuit, such as the Brazilian Civil Code (Law 10.406, of January, 01/10/2002), Brazilian Criminal Code (Decree-Law 2,848, of 07/12/1940), Code of Civil Procedure (Law 13.105, of 03/16/2015) and National Tax Code (Law 5.172, of 10/25/1966). While, 71.84% (643 legislations) are cited in lawsuits of a single class, which can be considered legislations that characterize the theme and, consequently, the class of the lawsuit. As, for example, Law 970, of 12/16/1949, which provides for the attributions, organization and functioning of the National Economy Council, is mentioned only in the lawsuits of the “Common Civil Procedure” class.

Table 4 shows the total of citations of legislation, the total of distinct legislations that occur in citations and the total of distinct legislations that occur by legal class. We analyzed the amount of different legislation mentioned in each of the classes of lawsuits and verified that there are laws that occur only in each class. For example, the 59,720 citations to the legislation present in the lawsuits of the “Common Civil Procedure” class correspond to 536 different legislations. In cases of this class, Law 8,078, of 09/11/1990, for example, is cited 8,560 times. We emphasize that, of the 536 distinct legislations cited in this class, 329 legislations are cited only in this class.

Likewise, for the “Criminal Action - Ordinary Procedure” class, the 12,610 citations that occur in this class correspond to 125 different legislations. For example, Law 11,343, of 23/08/2006, is cited 1,910 times in lawsuits of this class. We highlight that 43 different legislations are mentioned only in this class.

Table 4. Distribution of citations of legislation in total numbers and distinctly by legal class.

Full size table

To complement the interpretation of citations in the lawsuits, we created a graph (shown in Fig. 1) that shows the relationship of the most frequent legislations with the lawsuits in which these legislations are cited. The nodes of this graph represent a law or a lawsuit and the edges represent the citation of the law in the lawsuit. We reinforce that some legislations are cited in many lawsuits, while other legislations are cited in few lawsuits, which suggests that the legislations cited in petitions can help in the identification of the class of lawsuits.

4 Lawsuit Classification

In this section, initially, we define the scenarios and algorithms used in the experiments for classifying lawsuits. Then, we present and discuss the results achieved.

4.1 Experiment Scenarios

Lawsuits classification experiments were conducted in different scenarios. In each scenario, we choose different features to represent a lawsuit. Below we present each scenario used in the experiments.

Embeddings of the Lawsuit/Case Text (S1).

Each lawsuit is made up of a set of petition-type text documents. We define that the text of a lawsuit corresponds to the concatenation of the texts of all the petitions that comprise it. In this scenario, the lawsuit is represented by the embedding of the lawsuit text, generated from the pre-trained model BERTimbau [19] (a pre-trained model BERT for Brazilian Portuguese).

Embeddings of the Lawsuit/Case Text with Citation (S2).

We assume that petitions that have citations to the legislation are more relevant to the lawsuit. In this way, for each petition in a lawsuit, we verify if it contains a citation for Brazilian legislation (according to the procedure described in Sect. 3.2). Then, a text is formed by concatenating the texts of all the petitions that contain at least one citation to one legislation. In this scenario, the lawsuit is represented by the embedding of this text, generated from the pre-trained BERTimbau model [19].

Embeddings of the Cited Laws (S3).

We assume that in the case of legal documents, legislations are cited to represent and substantiate the subject of the document, therefore, the content of these legislations is relevant to define the class of a lawsuit. In this context, a text is formed by the concatenation of the legislation cited in the petitions of a lawsuit. In this scenario, the lawsuit is represented by the embedding of the text of the summary of the cited legislations (these summaries were retrieved from Brazilian Legal Knowledge Graph), generated from the pre-trained model BERTimbau [19].

Embeddings of the Lawsuit/Case Text with Citation and Cited Laws (S4).

We assume that the laws cited are as important as the text of the lawsuit. In this scenario, a text is formed by the concatenation of the texts of all the petitions that contain at least one citation to the legislation and the texts of the summary of the cited legislations. The embedding of this text, generated from the pre-trained model BERTimbau [19], is used to represent the lawsuit.

Embeddings of the Lawsuit/Case Text with Citation and Embeddings of Cited Laws (S5).

In this scenario, a lawsuit is represented by two embeddings: (i) embedding of the lawsuit text, generated according to S2, and (ii) embedding of the summary of the cited legislations, generated according to S3.

TF-IDF_Law (S6).

We assume that the relevance of cited legislations in lawsuit texts is an important information for the learning model: legislation cited only in lawsuits of the same class should be considered more important than legislation cited in lawsuits of several classes. Based on this, we define a feature named TF-IDF_Law, a variant of TF-IDF (Term Frequency - Inverse Document Frequency), which aims to measure the relevance of legislation for a lawsuit, according to the frequency of citations of legislation in the lawsuit and in other lawsuits in the corpus. TF-IDF_Law(l,d) is calculated according to Eq. (1) below:

$$ TF - IDF_{Law} \left( {l,d} \right) = \frac{{f_{l, d} }}{{\Sigma_{ l \in d} f_{l, d} }} \times log\frac{N}{{\left| {\left\{ {d \in D : l \in d} \right\}} \right|}} $$

(1)

where f_l,d, the frequency of citation of the law l in the lawsuit d, is divided by the total number of citations to the law in the lawsuit. d. Then, the number of lawsuits N in the corpus D is divided by the total frequency of the law l in all lawsuits of the corpus. In this scenario, we represent a lawsuit as a bag-of-words with TF-IDF_Law features. In other words, given a vocabulary V = {cl₁, cl₂, …, cl_v} formed by the laws mentioned in the N lawsuits of the train set, we represent each lawsuit d_i as a length-|V| vector L(d_i), where L(d_i)_j = TF-IDF_Law (cl_j, d_i), with j = 0 to |V|–1 and ${cl}_{j}\in V$.

Embeddings of the Lawsuit/Case Text with Citation and TF-IDF_Law (S7).

In this scenario, a lawsuit is represented by: (i) embedding of the lawsuit text with citation, generated according to S2, and (ii) bag-of-words with TF-IDF_Law features, generated according to S6.

Embeddings of the Cited Laws and TF-IDF_Law (S8).

In this scenario, a lawsuit is represented by: (i) embedding of the menus of the aforementioned laws, generated according to S3, and (ii) bag-of-words with TF-IDF_Law features, generated according to S6.

Embeddings of the Lawsuit/Case Text with Citation and Cited Laws and TF-IDF_Law (S9).

In this scenario, a lawsuit is represented by: (i) embedding of the lawsuit text and the summary of the cited laws, generated according to S4, and (ii) bag-of-words with TF-IDF_Law features, generated according to S6.

Embeddings of the Lawsuit/Case Text with Citation, Embeddings of Cited Laws and TF-IDF_Law (S10).

In this scenario, a lawsuit is represented by: (i) embedding of the lawsuit text, generated according to S2, (ii) embeddings of the summary of the cited laws, generated according to S3, and (iii) bag-of-words with TF-IDF_Law features, generated according to S6.

Embeddings of the Lawsuit/Case Text and Topics (S11).

Given a set of text documents, a topic model is applied to find out interpretable semantic concepts, or topics, present in documents. We assume that the topics can suggest the main theme of the document, making it possible, therefore, to help in inferring the class of lawsuits. We choose Latent Dirichlet Allocation (LDA) [22] as the method for topic generation of the lawsuits. LDA is a probabilistic model generator of a corpus, where each document is represented as a mixture of latent topics. Each topic is, in turn, a distribution of words. In this way, we run LDA on corpus D to find 6 topics (we set the number of topics as the same number of classes). To each document, we choose the most likely topic for each lawsuit and the top-10 words more representative to the lawsuit’s topic. In this scenario, each lawsuit is represented by: (i) embedding of the lawsuit text, generated according to S1, and (ii) bag-of-words with the word distribution of the topics.

4.2 Models

We train four models on split-of-the-data training in the scenarios presented in Sect. 4.1: (i) Random Forest [23], (ii) Support Vector Machine (SVM) [24], (iii) Extreme Gradient Boosting (XGBoost) [25], and (iv) Bidirectional Encoder Representations from Transformers (BERT)^{Footnote 7} [2]. We train each model as a multiclass classification task, using the following parameters (obtained from an empirical evaluation):

Random Forest.

We applied the Random Forest algorithm using Scikit-learn’s Random Forest Classifier package^{Footnote 8}. We trained 1,000 trees with a maximum of 200 features to consider when looking for the best split.

Support Vector Machine (SVM).

We applied the SVM algorithm using Scikit-learn’s SVM package^{Footnote 9} with default parameters.

Extreme Gradient Boosting (XGBoost).

We applied the XGBoost algorithm using XGBoost Classifier package^{Footnote 10}. We trained 1,000 trees with a maximum depth of 4 and a learning rate of 0.1.

Bidirectional Encoder Representations from Transformers (BERT).

We used the fine-tuning-based approach with the pre-trained model BERTimbau [19], 4 epochs and batch size of 4 samples for the classification task. For the optimizer, we used the ADAM optimizer with a learning rate of 1e − 5.

4.3 Results and Discussion

To carry out the experiments in the scenarios and models described in Sects. 4.1 and 4.2, we divided the dataset into 85% for the train set and 15% for the test set. Table 5 shows the performance of the models in each experiment scenario, in terms of F1 score (macro) for the test set.

Table 5. Results in terms of F1 score macro for lawsuit classification in different scenarios and models.

Full size table

Based on the Kruskal-Wallis statistical test, we are able to claim that the results are statistically significant at a 99% confidence level, with p-value equal to 0.0008. The BERT model outperforms the other models in all scenarios, achieving the best result of 0.88 of F1 score (macro) in the experiment scenario that represents the lawsuit from the petition texts that have citation. Legal documents have specific characteristics such as long documents, specialized vocabulary, formal syntax, semantics based on a broad specific domain of knowledge, and citations to laws. Our interpretation is that the representation of the document through contextual embeddings generated by BERT, as well as the architecture of the model with bidirectional contexts, makes it possible to capture the specific context of the domain of legal documents. We also emphasize that by representing a lawsuit with only the text of petitions that have legislation citation, we are reducing the size of the text, but keeping only the excerpts that present reasons for the content of the lawsuit, which are therefore relevant for the classification of lawsuits.

The second best model is XGBoost in S7, S9 and S10 scenarios. These scenarios use the feature TF × IDF_Law, in addition to the embeddings of the texts of the petitions with legislation (in S7, S9 and S10) and the summary of the cited legislation (in S9 and S10). The characterization of the lawsuit with the petitions that have citations is complemented by the relevance of the legislation for the lawsuits, through the feature TF × IDF_Law. This suggests that the legislation cited informs about the class, helping to classify the lawsuits.

Finally, in order to evaluate the results in a cross-validation approach, we employ a stratified 5-fold cross-validation setup for the best result experiment (model BERT with scenario S2) and obtained the F1 macro score of 0.88 (the same result for the dataset divided into 85% for training and 15% for testing).

Figure 2 shows the confusion matrix for the best result achieved in the experiments, a scenario with the representation of text embeddings of the petitions of the lawsuits that have citations with the model trained with BERT. We observe some patterns which may help us understand these results. The confusion matrix shows us that the lawsuits of the “Tax Enforcement” class have the best results, with a recall close to 1.0, while the lawsuits of the “Others” class are the most difficult to predict, with a recall close to 0.71. Lawsuits of class “Others” correspond to lawsuits of mixed classes grouped into a single class, thus making it difficult to classify these lawsuits. The second most difficult class to predict is the “Execution of Extrajudicial Title” class. This class contains the smallest number of samples in the dataset.

Table 6 presents the F1 score results of the BERT model trained with the texts of the petitions of the lawsuits without (S1) and with (S2) citation to the legislation, for each class.

Table 6. Results in terms of F1 score for BERT model in S1 and S2 experiments scenario, considering the text of the lawsuits with and without citations to legislation, respectively.

Full size table

The classes’ F1 scores show variability from 0.78 to 1.00 in both scenarios. S2 scenario presents better F1 score results for the “Execution of Extrajudicial Title” and “Tax Enforcement” classes, with an increase of 3.6% and 2.0%, respectively. Whereas the S1 scenario has better results for the “Special Civil Court Procedure” and “Criminal Action - Ordinary Procedure” classes, with an increase of 2.4% and 1.5%, respectively. In the other classes, “Common Civil Procedure” and “Others”, the F1 score values are the same for both scenarios.

5 Conclusions

In this paper we investigate the application of different models and scenarios for classifying legal texts using a dataset of lawsuits from the Court of Justice of the State of Ceará, in Brazil. The best results are achieved by the BERT model (using the pre-trained BERTimbau model [19]) in the scenario in which the lawsuit is represented by the text of the petitions that have citations to Brazilian legislation.

The legal text has specific characteristics, in this way, we represent the text contextually from the BERTimbau (pre-trained model for Brazilian Portuguese) and provide only the petitions of the lawsuit that mention a Brazilian law. We argue that the contextual representation of the text and the citation to legislation help to identify the classes of a lawsuit.

As future work, we intend to investigate other features to characterize the lawsuits and to use other models, such as Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) for lawsuit classification. We also intend to develop a specific language model for the Brazilian legal domain, which can be used to improve the contextual representation of the texts of proceedings in the specific domain of the legal area. We also intend to investigate the accuracy of the model in lawsuits from other Courts of Justice in different Brazilian states.

Notes

1.
https://www.cnj.jus.br/sgt/consulta_publica_classes.php
2.
https://github.com/MPMG-DCC-UFMG/M02
3.
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/crf/CRFClassifier.html
4.
http://www4.planalto.gov.br/legislacao/
5.
https://www.lexml.gov.br/
6.
https://www.al.ce.gov.br/index.php/tividades-legislativas/leis
7.
Given the specifics of the BERT’s original architecture, this model was trained only in scenarios: (S1), (S2), (S3) and (S4).
8.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
9.
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
10.
https://xgboost.readthedocs.io/en/latest/python/python_api.html

References

Conselho Nacional de Justiça: CONVOCAÇÃO nº 01/2021 – Desenvolvimento- piloto de soluções para a automação processual e uso de técnicas de inteligência artificial no Poder Judiciário. https://acessoexterno.undp.org.br/Public/Jobs/18062021164751_Resultado%20para%20publica%C3%A7%C3%A3o_Sinapses.pdf. Accessed 20 June 2021
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019)
Google Scholar
Mikolov, T., Chen, K., Carrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space. http://arxiv.org/pdf/1301.3781.pdf. Accessed 20 Nov 2015
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(12), 1735–1780 (1997)
Article Google Scholar
Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 1, Long Papers, pp. 1107–1116. Association for Computational Linguistics, Valencia, Spain (2017)
Google Scholar
Shaheen, Z., Wohlgenannt, G., Filtz, E.: Large Scale Legal Text Classification Using Transformer Models. arXiv preprint arXiv:2010.12871 (2020)
Barrón-Cedeño, A., Vila, M., Martí, M.A., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)
Article Google Scholar
Dinu, L.P., Niculae, V., Sulea, O.-M.: Pastiche detection based on stopword rankings: exposing impersonators of a Romanian writer. In: Proceedings of the Workshop on Computational Approaches to Deception Detection (2012)
Google Scholar
Niculae, V., Zampieri, M., Dinu, L.P., Ciobanu, A.M.: Temporal text ranking and automatic dating of texts. In: Proceedings of EACL (2014)
Google Scholar
Sumner, C., Byers, A., Boochever, R., Park, G.J.: Predicting dark triad personality traits from twitter usage and a linguistic analysis of tweets. In: Proceedings of ICMLA (2012). https://doi.org/10.1109/ICMLA.2012.218
Pérez-Rosas, V., Mihalcea, R.: Experiments in open domain deception detection. In: Lluís, M., Chris, C.B., Jian, S., Daniele, P., Yuval, M. (eds.) Proceedings of EMNLP. Association for Computational Linguistics (2015). https://aclweb.org/anthology/D/D15/D15-1133.pdf
Pinheiro, V., Pequeno, T., Furtado, V., Nogueira, D.: Information extraction from text based on semantic inferentialism. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS (LNAI), vol. 5822, pp. 333–344. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04957-6_29
Chapter Google Scholar
Cheng, J., Danescu-Niculescu-Mizil, C., Leskovec, J.: Anti-social behavior in online discussion communities. In: Proceedings of ICWSM (2015)
Google Scholar
Katz, D.M., Bommarito, M.J.I., Blackman, J.: Predicting the behavior of the supreme court of the United States: a general approach. In: arXiv e-prints, page arXiv:1407.6333 (2014)
Aletras, N., Tsarapatsanis, D., Preotiuc-Pietro, D., Lampos, V.: Predicting judicial decisions of the european court of human rights: a natural language processing perspective. Peer J. Comput. Sci. 10 (2016)
Google Scholar
Sulea, O.M., Zampieri, M., Vela, M., vanGenabith, J. Predicting the law area and decisions of French Supreme Court cases. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP, pp. 716–722. INCOMA Ltd. (2017)
Google Scholar
Undavia, S., Meyers, A., Ortega, J.E.: A comparative study of classifying legal documents with neural networks. In: Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 515–522 (2018)
Google Scholar
Araújo, P.H.L., Campos, T.E., Braz, F.A.; Silva, N.C.: VICTOR: a dataset for Brazilian legal documents classification. In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 11–16 May, pp. 1449–1458. Marseille (2020)
Google Scholar
Fabio, S., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, 20–23 October. Rio Grande do Sul, Brazil (2020)
Google Scholar
Silva, N., Braz, F., Campos, T.: Document type classification for Brazil’s supreme court using a convolutional neural network. In: The Tenth International Conference on Forensic Computer Science and Cyber Law-ICoFCS, vol. 10, pp. 7–11 (2018)
Google Scholar
Luz de Araujo, P.H., de Campos, T.E., de Oliveira, R.R.R., Stauffer, M., Couto, S., Bermejo, P.: LeNER-Br: a dataset for named entity recognition in brazilian legal text. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 313–323. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_32
Chapter Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Hearst, M.A.: Support vector machine. IEEE Intell. Syst. 13(4), 18–28 (1998)
Article Google Scholar
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. New York (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Fortaleza, Fortaleza, Brazil
André Aguiar, Vládia Pinheiro, Vasco Furtado & João Araújo Neto
Federal Institute of Education, Science and Technology of Ceará, Fortaleza, Brazil
Raquel Silveira

Authors

André Aguiar
View author publications
Search author on:PubMed Google Scholar
Raquel Silveira
View author publications
Search author on:PubMed Google Scholar
Vládia Pinheiro
View author publications
Search author on:PubMed Google Scholar
Vasco Furtado
View author publications
Search author on:PubMed Google Scholar
João Araújo Neto
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Raquel Silveira .

Editor information

Editors and Affiliations

Universidade Federal de Sergipe, São Cristóvão, Brazil
André Britto
Universidade de São Paulo, São Paulo, Brazil
Karina Valdivia Delgado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aguiar, A., Silveira, R., Pinheiro, V., Furtado, V., Neto, J.A. (2021). Text Classification in Legal Documents Extracted from Lawsuits in Brazilian Courts. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_40

Download citation

DOI: https://doi.org/10.1007/978-3-030-91699-2_40
Published: 28 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

Text Classification in Legal Documents Extracted from Lawsuits in Brazilian Courts

Abstract

Similar content being viewed by others

Using Topic Modeling in Classification of Brazilian Lawsuits

Evaluating Text Classification in the Legal Domain Using BERT Embeddings

Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain

Explore related subjects

1 Introduction

2 Related Works

3 Corpus and Data Preparation

3.1 Corpus and Golden Collection

3.2 Integration with the Brazilian Legal Knowledge Graph

4 Lawsuit Classification

4.1 Experiment Scenarios

Embeddings of the Lawsuit/Case Text (S1).

Embeddings of the Lawsuit/Case Text with Citation (S2).

Embeddings of the Cited Laws (S3).

Embeddings of the Lawsuit/Case Text with Citation and Cited Laws (S4).

Embeddings of the Lawsuit/Case Text with Citation and Embeddings of Cited Laws (S5).

TF-IDFLaw (S6).

Embeddings of the Lawsuit/Case Text with Citation and TF-IDFLaw (S7).

Embeddings of the Cited Laws and TF-IDFLaw (S8).

Embeddings of the Lawsuit/Case Text with Citation and Cited Laws and TF-IDFLaw (S9).

Embeddings of the Lawsuit/Case Text with Citation, Embeddings of Cited Laws and TF-IDFLaw (S10).

Embeddings of the Lawsuit/Case Text and Topics (S11).

4.2 Models

Random Forest.

Support Vector Machine (SVM).

Extreme Gradient Boosting (XGBoost).

Bidirectional Encoder Representations from Transformers (BERT).

4.3 Results and Discussion

5 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us

TF-IDF_Law (S6).

Embeddings of the Lawsuit/Case Text with Citation and TF-IDF_Law (S7).

Embeddings of the Cited Laws and TF-IDF_Law (S8).

Embeddings of the Lawsuit/Case Text with Citation and Cited Laws and TF-IDF_Law (S9).

Embeddings of the Lawsuit/Case Text with Citation, Embeddings of Cited Laws and TF-IDF_Law (S10).