1 Introduction

As legal documents become increasingly digitized, Natural Language Processing (NLP) has gained importance for automating tasks in the legal field. NLP tools are now commonly used to address real-world legal problems, including the identification of participants in legal proceedings [24], the classification of legal documents [4], named-entity recognition [3], and legal text summarization [13]. These solutions employ traditional machine learning paradigms (e.g., TF-IDF and classifiers) or deep learning techniques to achieve their objectives [28].

Language models structured in the Transformers architecture, such as BERT [11] and its variants [7, 33], have achieved promising results in several downstream NLP tasks on generic reference datasets. In recent years, there have been some efforts to pretrain linguistic models for Brazilian Portuguese against more traditional word embeddings (Word2Vec, FastText, etc.), as well as structured models in BERT.

However, these models were trained with general documents and were not designed to represent the Brazilian legal language [28]. The effectiveness of language models on domain-specific tasks can be limited by the model’s lack of specialization for that domain. The development of language models for legal texts is justified since the language used for the preparation of legal documents has its own vocabulary, formal style, semantics based on a wide spectrum of knowledge, and frequent use of citations to laws. Evidence shows that using pretrained language models with domain-specific corpus can significantly improve performance on domain-specific tasks [31, 37].

This work describes the production process of a language model for the legal domain in the Portuguese language. The model was pretrained to acquire specialization for the domain, and later it could be adjusted for use in specific tasks. Two versions of the model were created: one as a complement to the BERTimbau model [33], and the other from scratch. The effectiveness of the model based on BERTimbau was evident when analyzing the perplexity measure of the models. Experiments were also carried out in the tasks of identifying legal entities and classifying legal petitions. The results show that the use of specific language models outperforms those obtained using the generic language model in the tasks studied, suggesting that the specialization of the language model for the legal domain is an important factor for improving the accuracy of learning algorithms.

2 Related Work

Progress in the field of NLP is closely linked to advancements in Machine Learning models, primarily due to the emergence of Word Embeddings. However, these models possess a drawback in that they generate static representations of words, meaning the same word always produces a fixed embedding, even when found in sentences with varying contexts. The ELMo model [27] and subsequently, Transformers [34], began to generate contextualized embeddings. As a result, the same word will have different representations in diverse situations, depending on the context of the entire sentence.

By employing self-attention mechanisms, Transformers are able to capture long-range relationships [19]. Models built upon the Transformers architecture, such as BERT [11] and GPT [6], have emerged as the state of the art in NLP tasks. The core principle behind using these pretrained language models lies in the transfer learning and self-supervised learning approach. In other words, the model is initially trained on a sizable unlabeled corpus to learn the universal representation of the language [14]. Subsequently, the knowledge gained during pretraining can be applied to fine-tune downstream tasks, reducing the reliance on labeled data and enhancing the performance of the models [25].

Although pretrained language models, such as BERT, exhibit strong performance in generic texts, they may yield inferior results in domain-specific texts [7]. Therefore, implementing strategies to enhance training by incorporating data with texts from a specific domain is a widely-used technique in various fields. Examples include: (i) BioBERT [16], which was trained on biomedical texts and outperformed BERT as well as other state-of-the-art models; (ii) FinBERT [38], which utilized a large corpus of financial communication comprising 4.9 billion tokens and demonstrated superior performance to BERT in sentiment classification tasks; (iii) SciBERT [5], which employed full texts from 1.14 million Semantic Scholar articles and showcased improved performance compared to BERT-Base in NLP tasks within the scientific domain; (iv) CodeBERT [12], which made use of open-source code from public GitHub repositories across six programming languages, achieving state-of-the-art results in natural language code search and code-to-documentation generation tasks.

The legal field serves as an excellent example of a domain that could benefit from generating pretrained language models, given the vast amounts of data produced daily in courts and legal digital platforms. LEGAL-BERT [7] was among the pioneers in developing legal language models, utilizing a corpus of approximately 12 GB with texts from European and North American legislation and cases. The strategies employed in LEGAL-BERT’s pretraining included: (i) LEGAL-BERT-FP, which considered additional training from BERT, and (ii) LEGAL-BERT-SC, which focused on training from scratch exclusively within the legal corpora. Both strategies outperformed BERT and achieved state-of-the-art results in three end-tasks.

However, pretraining must also take into account not only the specific domain but also the language in which the downstream tasks need to be addressed. A model trained on multiple languages may yield inferior results in languages that lack adequate representation in the dataset [10]. Consequently, efforts have been made to create monolingual pretrained language models that tackle tasks in the legal domain, such as Lawformer [37], ITALIAN-LEGAL-BERT [18], InLegalBERT [26], and JurisBERT [35].

3 LegalBert-pt: A BERT Model for the Brazilian Legal Domain

This section provides a detailed account of the steps taken to pretrain LegalBert-pt, a language model for the Portuguese legal domain.

3.1 Pretraining Data

To pretrain various versions of the LegalBert-pt language model, we collected a total of 1.5 million legal documents in Portuguese from ten Brazilian courts. These documents consisted of four types: initial petitions, petitions, decisions, and sentences. Table 1 shows the distribution of these documents.

The data were obtained from the Codex system of the Brazilian National Council of Justice (CNJ), which maintains the largest and most diverse set of legal texts in Brazilian Portuguese. As part of an agreement established with the researchers who authored this article, the CNJ provided these data for our research.

Our use of this corpus allowed us to pretrain variations of the LegalBert-pt model that are well-suited to handling the nuances and complexities of legal language in the Brazilian context. We drew upon previous research [3, 4, 35] that demonstrated the importance of using large and diverse datasets for training language models, particularly in domain-specific contexts.

Table 1. Statistical Legal Documents by Data Source.

To minimize errors in the texts of the documents, we employed a two-stage preparation process consisting of pre-processing and cleaning. In the pre-processing stage, we removed documents with less than 50 words and less than 80% valid words to ensure that the remaining corpus was of high quality. In the cleaning step, we removed special characters and extra spacing to further improve the corpus. Table 2 provides statistics on the types of legal documents that remained after the preparation steps. These documents were then used to train various versions of the LegalBert-pt language model.

Table 2. Details of the training corpus used to pretrain the different variations of LegalBert-pt.

After the preparation steps, the documents were divided into sentences with a maximum size of 512 tokens, generating a total of about 12,000,000 sentences.

3.2 Vocabulary Generation

Since BERTimbau [33] is trained on data from the general domain, we believe that training a language model for a specific domain can improve its performance on domain-specific tasks with a specific vocabulary. To this end, we generated a vocabulary consisting of 30,000 subword units using the SentencePiece library [15] and the Byte-Pair Encoding (BPE) algorithm [30]. We used 2 million random sentences from 1 million Wikipedia articles in Portuguese and 2 million random sentences from 1.5 million legal documents in our pretraining dataset described in Sect. 3.1.

To ensure compatibility with the original BERT code, we converted the resulting vocabulary to the WordPiece format. To do this, we followed the BERT tokenization rules. First, we added all special BERT tokens ([CLS], [MASK], [SEP], and [UNK]) and punctuation characters to the English vocabulary. Then, we split SentencePiece tokens that contain punctuation characters, removed the punctuation, and added the resulting subword units to the vocabulary. Finally, we prefixed subword units that do not begin with the SentencePiece metacharacter “_” with “##” and removed the “_” symbol from the remaining tokens.

Additionally, we included 5,977 identifiers of Brazilian legislation in the vocabulary. The resulting LegalBert-pt vocabulary consists of 36,345 subwords.

Figure 1 compares the number of subwords in the LegalBert-pt and BERTimbau vocabularies. We found that 16,885 subwords are common between the two vocabularies, while 19,460 subwords are specific to LegalBert-pt.

Fig. 1.
figure 1

Number of units of subwords of the BERTimbau and LegalBert-pt vocabularies.

3.3 Variations of the LegalBert-pt Model

The use of BERT in downstream tasks involves two stages: pretraining and model fine-tuning. In the pretraining stage, the model is trained from scratch or with additional steps from an existing model to learn bidirectional context between tokens. This step is computationally intensive and should only be performed once. In the fine-tuning stage, a pretrained model is further trained on a specific task of interest, such as text classification or named entity recognition.

For our study, we developed two variations of the pretraining of legal domain language models in Brazilian Portuguese: (i) pretraining from scratch using a specific domain corpus (LegalBert-pt SC) and (ii) an adaptation of BERTimbau with pretraining using a specific domain corpus (LegalBert-pt FP). Both models were pretrained as case-sensitive, as we focused on developing general-purpose models and capitalization is relevant for tasks such as named-entity recognition.

We pretrained the models using the Masked Language Model (MLM) task. We only used the MLM task in pretraining, as recent research [20] has suggested that the Next Sentence Prediction (NSP) task is not effective. The MLM objective allows the representation to learn left and right context, which allows us to pretrain a deep two-way transformer. The masked language model randomly chooses some tokens from the input and replaces them with either a special [MASK] token with 80% probability, a random vocabulary token with 10% probability, or the original token with 10% probability. The goal is to predict the ID of the masked word in the original vocabulary based on its context.

During training, we used the ADAMW optimizer [21] with the following parameters: \(\beta _1 = 0.9\), \(\beta _2 = 0.999\), \(\epsilon = {1\textrm{e}-6}\), and a learning rate of 1e – 4.

The LegalBert-pt SC model has the same architecture as BERTimbau-Base, with 12 layers, 768 hidden units, and 12 attention heads (a total of 110 million parameters). We used this architecture in all of our experiments. For this model, we used the specialized vocabulary generated for the legal domain as described in Sect. 3.2. We pretrained this model for 7.5 million steps on a legal domain corpus, as described in Sect. 3.3.

For the LegalBert-pt FP model, we followed the approach outlined in [11], initializing the weights from the pretrained BERTimbau-Base checkpoint [33], and then performing additional pretraining steps using a legal domain corpus, as described in Sect. 3.3. [11] suggests performing additional pretraining steps up to 1,000,000 steps. In our case, we pretrained the model for up to 2.4 million steps to evaluate the prolonged effect of pretraining on downstream tasks. BERTimbau-Base has been significantly pretrained on generic domains such as health, sport, technology and computing, laws and policies, among others, using a vocabulary of 30,000 subwords that is best suited for these generic domains. Therefore, we expect that domain-specific pretraining will result in better accuracy for specific tasks.

4 Evaluation

To evaluate the effectiveness of the pretrained LegalBert-pt language models, LegalBert-pt SC and LegalBert-pt FP, we conducted intrinsic and extrinsic evaluations, comparing them with a generic model, Bertimbau-Base [33], and with a language model of the legal domain of the Portuguese language, Legal-BERTimbau-base [17]. Intrinsic evaluation measures the quality of a model independent of any application, while extrinsic evaluation measures the usefulness of the model in a specific task.

We evaluated the models using two specific NLP tasks: Named Entity Recognition (NER) and Text Classification. These tasks were chosen to evaluate the application of the model at both the token level (NER) and the sentence level (text classification), and due to the availability of labeled datasets for these tasks. For each specific task, we fine-tuned the pretrained model. In sentence-level tasks, classification was performed using the coded representation of the special token [CLS], while in token-level tasks, the coded representation of each token was used.

We measured the performance of the pretrained models using perplexity and F1-score. Perplexity measures how well a language model predicts a sample of text and reflects the model’s ability to generate coherent and natural-sounding sentences. The F1-score measures the accuracy of the model in identifying named entities in text or classifying text into predefined categories.

4.1 Perplexity

Perplexity is an important intrinsic evaluation metric used to assess language model performance by quantifying the degree of uncertainty a model has about the predictions it makes. Low perplexity indicates that a model is reliable, but it does not guarantee accuracy. Perplexity is also often correlated with a model’s ultimate performance on specific tasks. Therefore, in addition to standard evaluation metrics, NLP researchers have started looking at perplexity to test how well language models capture language [23].

To analyze the perplexity of our language models, we built a corpus consisting of 750 legal documents obtained from various sources. This corpus included 250 legal documents representing initial petitions and complaints from the Court of Justice of the State of Ceará (TJ-CE) in Brazil, 250 legal documents of various types from the Public Ministry of the State of Ceará (MP-CE) in Brazil, and 250 legal documents from Extraordinary Appeals of the STF obtained randomly from the VICTOR dataset [4]. By evaluating perplexity on this corpus, we were able to assess how well our pretrained models captured the language used in legal documents in Brazilian Portuguese.

4.2 Named Entity Recognition

Named Entity Recognition (NER) is the task of identifying snippets of text that mention named entities (NEs) and classifying them into predefined categories such as person, organization, and location. Given a sequence of tokens, the NER model has to output the entity of each token. The NER model was designed as a token labeling task and performs entity identification and classification using the IOB tagging scheme [29].

To date, there are few gold standard datasets for named entities in the legal domain in Portuguese. To evaluate the NER model, we used two datasets separately, LENER-BR [3] and CDJUR [22], with annotated entities in legal documents. The LENER-BR [3] dataset was built by manually annotating 66 legal documents from several Brazilian courts. Additionally, four legislative documents were included, totaling 70 annotated documents. The entities were categorized as “ORGANIZATION”, “PERSON”, “TIME”, “LOCATION”, “LEGISLATION”, and “LEGAL CASES”, resulting in a total of 12,248 entity annotations. The CDJUR [22] dataset contains 1,074 manually annotated legal documents, with a total of 44,526 labeled entities. This dataset provides a detailed annotation of entities specific to the legal domain. For example, the “PERSON” category was specified in 9 entities that are typically present in a judicial process, such as plaintiff, lawyer, defendant, victim, witness, judge, prosecutor, police authority, and others. “ADDRESSES” were specified in 6 entities to identify different addresses present in a lawsuit. The LAWS category was specified in three entities: Main Law, Accessory Law, and Jurisprudence. Similarly, specifications were made for “EVIDENCE”, “PENALTY”, and “SENTENCE”.

We trained a NER model for each of these datasets, and a linear classifier layer was attached on top of each model to predict each token’s tag independently. The models’ performance was evaluated in terms of F1-score at the entity level, taking into account the partial correspondence between the predicted entity and the actual entity based on the Partial metric defined by MUC [9]. Partial correspondence is considered correct when the entity type of the prediction given by the model corresponds to the same entity type of the golden annotation, but not necessarily in the same position limits. For example, if the annotation is {“entity_type”: “MAIN-LAW”, “text”: “Law n\(^\circ \) 8.112/90, of 12/11/1990”} and the model prediction is {“entity_type”: “MAIN-LAW”, “text”: “Law n\(^\circ \) 8.112/90”}, it is considered correct.

4.3 Text Classification

Text classification is a widely researched task in NLP and text mining, involving the assignment of one or more categories to a document from a set of options. There are different text classification variants such as binary classification, multiclass classification, and multi-label classification. Language models from the field of NLP and learning models from the field of AI can be developed to automate this task, which can be trained from a gold collection of documents.

In the legal domain, text classification has a crucial application in the initial stages of the judicial process when a petitioner presents a petition to the Justice [1, 2]. At this stage, the petitioner is required to specify the matter to which the claim pertains. In Brazil, the petitioner has to choose the topic from a hierarchy of over 4,000 subjects as part of the Unified Procedural Boards (TPU system) [32], maintained by the CNJ. Making the correct association with the hierarchy theme is not trivial and often done incorrectly, causing delays in the judicial process and negative financial and societal impacts by generating rework and a sense of impunity.

We evaluated the performance of the language model classification using a gold collection of 64,000 textual petitions obtained from various Brazilian courts maintained by the CNJ’s Codex system. Each complaint is associated with a legal issue that the case addresses under the TPU. Each judicial process is associated with a hierarchy of matters, represented by three levels. More specifically, the gold collection used in the experiment contains 213 legal matters associated with the initial petitions, 9 matters in the first level of the hierarchy, 41 matters in the second level and 163 matters in the third level.

The evaluation of the models in the classification task was carried out in three scenarios. In the first scenario, the classification task was modeled as Hierarchical Text Classification (HTC), which categorizes text into a set of labels organized in a hierarchical structure. We use a Contrastive Learning Approach to Hierarchical Text Classification [36] with the pre-trained language model to train the classification model. In the second scenario, the classification task was modeled with multiclass classification, in which each process is associated only with the subject of the third level, therefore, contemplating 163 classes. In both scenarios, the text is truncated to use the initial 512 tokens of the court case text. Complementarily, the evaluation was carried out in a third scenario, in which the classification task was modeled with multiclass classification, and using 8,192 initial tokens from the text of the judicial process (according [8]). In both scenarios, we evaluated the models in terms of the F1-score for the third-level subject of the hierarchy.

5 Experimental Results

The initial evaluation of the language models involved intrinsic evaluation using perplexity as a metric. A lower perplexity score indicates a better model. Essentially, if a model assigns a high probability to the test set, it means that it is not surprised to see it, indicating a good understanding of how language works. The results of the language models regarding perplexity are in Table 3. In the tables, the values in bold are the best values among the experiments with the models.

Table 3. Results of language models in terms of perplexity.

The LegalBert-pt FP language model performed the best with a perplexity of 3,700, followed by the LegalBert-pt SC language model with a perplexity of 3,822, indicating that pretraining with a domain-specific corpus that includes diverse legal documents leads to a better understanding of the specific language. Models with lower perplexity are also expected to perform better on specific tasks. Therefore, the language models were fine-tuned for specific tasks, namely NER and Text Classification, according to [33].

The results of the application of language models in the Named Entity Recognition task, using corpora LENER-BR and CDJUR, are presented in Tables 4 and 5, respectively, as described in Sect. 4.2. Both experiments were run 5 times in 10 epochs to avoid bias, and the F1-score results are displayed in terms of the mean and the Standard Error of the Mean (SEM) of the runs. Values marked in bold indicate the best results, considering the mean and SEM.

Table 4. Results of the language models in terms of the mean + standard error of the F1-score in Named Entity Recognition in the LENER-BR corpus.
Table 5. Results of the language models in terms of the mean ± standard error of the F1-score in Named Entity Recognition in the CDJUR corpus.

The LegalBert-pt FP model demonstrated superior results, outperforming the BERTimbau-Base by 0.65% in F1-score micro, 1.75% in F1-score macro, and 0.65% in F1-score weighted for NER in the LENER-BR dataset, and by 0.89% in F1-score micro, 3.78% in F1-score macro, and 1.49% in F1-score weighted for NER in the CDJUR dataset. This reinforces the benefits of using domain-specific pretrained models compared to generic ones, as LegalBert-pt FP can understand both the generic context and specific legal language.

Moreover, when compared with Legal-BERTimbau-base, the LegalBert-pt FP model outperformed by 0.65% in F1-score micro, 1.31% in F1-score macro, and 0.65% in F1-score weighted for NER in the LENER-BR dataset, and by 1.49% in F1-score micro, 4.48% in F1-score macro, and 2.41% in F1-score weighted for NER in the CDJUR dataset. These results reinforce the effectiveness of pretraining a legal domain model with a diverse set of legal documents, enabling a wider coverage of specific terms and acquiring greater domain specialization.

The superiority of the language models in NER with the LENER-BR dataset compared to the CDJUR dataset is also notable. This difference can be attributed to the specificity and number of entities present in each dataset. The LENER-BR dataset has only 6 specific entities of the legal domain (organization, person, time, place, legislation, and jurisprudence), while the CDJUR dataset has 21 specific entities of the legal domain (author, lawyer, defendant, victim, witness, judge, prosecutor, police authority and other persons, author’s address, offense address, defendant’s address, witness address, victim’s address, other addresses, main law, accessory law, jurisprudence, evidence, penalty, and sentence). Thus, it is more challenging for the model to recognize the more specific entities present in the CDJUR dataset.

Table 6, 7 and 8 presents the results in terms of F1-score for the application of language models in the Text Classification task, in different scenarios, as described in Sect. 4.3. To avoid biasing the results, both experiments were run 5 times and the F1-score results are displayed in terms of the mean and SEM of the runs. Both models were run for 30 epochs, with early stopping and a 5 epoch patience criterion.

Table 6. Results of the language models in terms of the mean ± standard error of the F1-score in hierarchical text classification.

In the text classification task, in all evaluated scenarios, the LegalBert-pt models demonstrate superior results compared to other models. In the scenario that evaluates the hierarchical text classification, LegalBert-pt FP outperforms the generic BERTimbau-base model by 1.19% in F1-micro and 0.4% in F1-weighted. When compared with Legal-BERTimbau-base, LegalBert-pt FP also outperforms F1-micro by 1.59% and F1-weighted by 0.8%.

It is important to note, however, that the relatively close results between the generic and specific models in the text classification task suggest that the use of a generic textual structure in the composition of the initial petitions may not require in-depth domain-specific knowledge. Thus, relevant linguistic features for classification may be more universal and captured by a generic language model. It is also important to recognize that the performance of a model may vary depending on the task and dataset, and a specific model may be better suited for a particular task.

Table 7. Results of the language models in terms of the mean ± standard error of the F1-score in text classification with 512 tokens.
Table 8. Results of the language models in terms of the mean ± standard error of the F1-score in text classification with 8,192 tokens.

In the multi-class classification approach, the LegalBert-pt models outperform the Bertimbau-base results in all experiments. When analyzing the classification with the representation of the text by 512 tokens, the result LegalBert-pt FP exceeds by 2.6% in the F1-micro, 4.6% in the F1-macro and 3.0% in the F1-weighted. When analyzing the classification with the representation of the text by 8,192 tokens, the LegalBert-pt FP result exceeds 3.0% in the F1-micro, 3.7% in the F1-macro and 3.2% in the F1-weighted.

One possible reason for LegalBert-pt FP’s better performance in text classification is that it was pretrained on a dataset with language similar to the language of the texts used in the classification task. This could have allowed the model to learn relevant linguistic features that contribute to improved performance.

The results of our experiments demonstrate that domain-specific language models outperform domain-general language models on domain-specific tasks such as NER and text classification. This suggests that pretraining with domain-specific texts allows the language model to learn richer and more specific representations of the domain-specific language compared to domain-general language models. However, it is important to note that specialization in a single domain limits the model’s applicability to tasks in that domain. Therefore, in scenarios where the application needs to handle multiple domains, a broader approach may be necessary, such as using generic language models.

6 Conclusion and Future Works

This article outlined the process of training, using, and evaluating a language model specifically designed for the legal domain. It demonstrated that the language model enhances accuracy in particular tasks within the legal field.

Experiments were conducted to determine the optimal strategy for generating a language model for a new domain, weighing the options of additional pretraining from a pre-existing generic domain model or starting from scratch. Consequently, two versions of the language model pretrained on legal domain documents were developed: LegalBert-pt SC (trained from scratch) and LegalBert-pt FP (pretrained using the BERTimbau-Base model). For the LegalBert-pt SC model, legal documents and articles from the Portuguese Wikipedia were utilized to create a vocabulary that includes both generic and legal domain-specific terms in Portuguese.

The legal domain language models were compared to the BERTimbau-Base (a generic domain language model for the Portuguese language) using intrinsic evaluation (measured through perplexity) and extrinsic evaluation (by fine-tuning the language model for specific tasks such as Named Entity Recognition and Text Classification). In both evaluations, the language models pretrained on legal domain documents yielded superior results compared to the BERTimbau-Base. Notably, the most significant performance gains were observed in the most challenging final tasks. The language models developed in this study are available under the OpenRAIL license and can be accessed at http://huggingface.co/raquelsilveira/legalbertpt_sc and http://huggingface.co/raquelsilveira/legalbertpt_fp. The models can be further fine-tuned using other types of legal documents, which presents an opportunity for ongoing improvement. This contribution is significant for both the scientific and practical communities, as it advances our understanding of how language models can enhance legal tasks.

In future research, we aim to examine the performance of LegalBert-pt on additional datasets and explore the model’s application in other tasks within the legal domain. This will help assess the extent to which understanding the specific language of these models impacts the accuracy of domain-specific tasks. Comparisons with GPT models and the utilization of the pretrained model to distill LLMs are also planned for future endeavors.