1 Introduction

The ability to reason over laws is essential for legal professionals, enabling them to interpret and apply legal principles to complex real-world situations. Legal questions often lack straightforward answers, requiring thorough analysis, comprehensive research, and synthesis of multiple sources to develop well-founded arguments or solutions. Tax law, in particular, is crucial because it influences how governments fund public services and impacts economic activity by shaping investment decisions and individual spending. However, interpreting tax law presents significant challenges for Natural Language Processing (NLP) due to the inherent complexity and ambiguity of legal language, constant updates, amendments, and the need to contextualize regulations within specific jurisdictions.

Large Language Models (LLMs) show significant potential in enhancing the legal reasoning process [23]. These models can process extensive legal texts, including statutes, case law, and legal opinions, to extract relevant information and address the complexities of tax law. By leveraging advanced generation techniques, LLMs can answer legal questions using specific legal datasets, such as court cases and legal precedents, thus providing comprehensive and relevant information to legal professionals [29]. However, there is a gap in understanding how LLMs reason over legal texts, as existing question-answering tasks typically contain answers directly extractable from the provided texts, whereas legal reasoning often requires deeper comprehension and application of legal principles to nuanced scenarios [2].

To address this gap, we developed a novel dataset comprising real questions posed by legal entities in the domain of tax law, answered by legal experts with supporting legal texts (gold passages). This dataset allows us to assess the legal reasoning abilities of LLMs, focusing on their capacity to understand complex legal questions, use relevant law articles, and generate accurate and coherent responses. The evaluation compares LLM-generated answers to expert responses using metrics such as ROUGE [19], BLEU [27], and semantic similarity [40], alongside assessments by a strong LLM [41], contributing to a understanding of LLMs’ legal reasoning capabilities.

This research evaluates both open-source and proprietary LLMs in scenarios requiring comprehensive understanding and application of the law, distinct from the extractive approaches used in datasets like SQuAD [28] and TriviaQA [14]. Our dataset requires LLMs to comprehend and apply the law to generate appropriate answers, often involving complex vocabulary and contexts not directly mirrored in the texts [2].

This paper presents two significant contributions to the field of legal NLP, particularly within the challenging domain of tax law. First, it introduces a novel dataset consisting of real-world tax law questions, expert-crafted answers, and supporting legal texts, moving beyond extractive question-answering tasks and requiring models to demonstrate legal reasoning abilities. Using this dataset, we evaluate how well LLMs understand complex tax law questions and generate accurate, well-supported answers, providing a better understanding of current LLM capabilities and limitations in handling legal reasoning tasks.

2 Related Works

While there are numerous works utilizing Large Language Models (LLMs) in the legal domain [5, 22, 23, 25, 39], our interest lies in those that apply LLMs for question and answer (Q&A) tasks. These works can be classified into three main categories: those using retrieval-augmented generation, those evaluating LLMs based on prior knowledge, and those performing fine-tuning and testing the models on Q&A tasks in the legal domain. Below, we discuss the key works found in each of these categories.

2.1 Tax Law Applications of LLMs

Evaluating Q&A LLMs with Retrieval-Augmented Generation. The LLeQA [21] dataset includes 1,868 legal questions annotated by experts, containing answers and legal references. This work applies the Retrieval-Augmented Generation (RAG) technique, retrieving statutory articles from an extensive corpus of Belgian legislation. The model’s effectiveness is evaluated using the METEOR metric, demonstrating the feasibility of integrating information retrieval with LLMs to enhance the accuracy of legal responses. ChatLaw [5] addresses the creation of a large-scale language model for the legal domain, specifically in the Chinese context. This work combines vector database retrieval methods with keyword-based retrieval to increase the accuracy of responses. Integrating these techniques enables the model to provide more precise and contextually relevant answers.

Evaluating Q&A Legal Reasoning of LLMs. LAiW [6] proposes a benchmark for evaluating the capabilities of LLMs in the Chinese legal context. The aim of this work is to test how well models can handle specific legal tasks. The results show that some legal-specific LLMs perform better than their general counterparts, although there remains a significant gap compared to GPT-4 [26]. LawBench [8] offers a comprehensive assessment of LLM capabilities in legal tasks, including Q&A. This work extensively tested 51 popular LLMs, including 20 multilingual, 22 focused on Chinese, and 9 specific to law. The conclusion is that while fine-tuning LLMs on specific legal texts brings some improvements, the models still need to be usable and reliable for complex legal tasks.

Fine-Tuning and Evaluating Q&A Large Language Models. FedJudge [37] uses Federated Learning (FL) to overcome data privacy challenges. This framework optimizes federated legal LLMs, allowing the models to be trained locally on clients, with their parameters aggregated and distributed on a central server. FedJudge is evaluated on Q&A tasks using metrics such as ROUGE, BLEU, and BertScore to compare the quality of generated answers. This work demonstrates that the model provides more precise and relevant answers in different legal contexts. DISC-LawLLM [38] employs large language models trained on supervised datasets in the legal domain and incorporates a retrieval module to access and utilize external legal knowledge. This system assesses objective and subjective perspectives using DISC-Law-Eval, a benchmark that includes legal question answering. Additionally, subjective evaluation is carried out using the GPT-3.5 model as a judge.

3 Metodology

This section outlines the methodology utilized in our study, with a particular emphasis on the model selection process. We also detail the data collection process, the creation of a relevant corpus, and the experimental setups of the selected models, including the specific prompts and parameters used. Furthermore, we detail the evaluation approach, discussing both the metrics employed and the strategy for subjective evaluation.

3.1 Dataset Collection

Our dataset consists of a series of tax law questions related to legal entities. The questions were selected from a collection that is annually updated by the General Coordination of Taxation (Cosit) [9] of the Brazilian Federal Revenue Service. The dataset includes over a thousand question-answer pairs, with most answers being supported by a relevant normative or legal basis. The granularity of the references in the answers is as detailed as possible, citing the specific articles of law or other regulations used to formulate the responses. The questions represent real taxpayer doubts, and experts in the Brazilian tax field craft the answers. Below, we will discuss how the dataset was created.

Selection of Questions. We extracted a subsample from the comprehensive set of questions and answers provided by Cosit. In this selection process, we focused on questions that included responses with legal references rather than the entire regulation. Although the majority of responses included legal references, they were often elaborated by experts in a way that extended beyond the scope of the question or included excessive details such as tables and numerous examples. This complexity made them unsuitable for use in contexts like Retrieval-Augmented Generation (RAG). We excluded these overly detailed responses to ensure a fair evaluation with the LLMs. The initial outcomes of this selection process are depicted in the first three columns of Table 1.

Table 1. Questions and Answers with Legal References and Gold Passages

Collection of Regulations (Gold Passages). After selecting the questions and their corresponding legal references, laws, and articles, we gathered each regulatory document referenced by the experts in their responses to the questions posed by legal entities. Although this task was time-intensive, it was essential for assessing the reasoning capabilities of LLMs in relation to legal texts. Upon completion of this stage, the dataset comprised the question, answer, reference to the regulation, and the regulation itself (gold passages). Table 1 presents the final dataset.

Legislation Corpus. In this stage, we collected over 30 documents, which included laws, instructions, decrees, and opinions. Each document contains up to thousands of articles comprising multiple provisions. These documents represent a fraction of the Brazilian tax legislation and include the regulations that underpin the experts’ responses in the dataset. It is important to note that these regulations are constantly being amended, and many provisions have been revoked. All revoked provisions were excluded up to the dataset creation date to ensure a high-quality corpus. Additionally, any questions that had their regulatory basis revoked were eliminated during the question selection phase. Figure 1 shows the corpus documents.

3.2 Experimental Setup

In this study, we conducted a comprehensive evaluation of large language models (LLMs) in terms of their ability to reason about laws, specifically focusing on corporate taxation for legal entities. We evaluated the LLMs using the datasets created in this paper, which were created from real-world questions and answers about the taxation of legal entities, which were provided by subject-matter experts.

We selected over 20 LLMs for evaluation, encompassing both proprietary and open-source models. The chosen models include notable examples such as Mistral AI, Llama, Gemma, Qwen, various community fine-tuned versions of these models, and a proprietary model. Each model possesses unique characteristics and capabilities, providing a diverse range of perspectives for our assessment.

In order to maintain consistency in our evaluations, we standardized the temperature parameter at 0.1 for all chosen models. This low-temperature setting was chosen to reduce randomness in the output, thereby encouraging more deterministic responses. Additionally, we did not impose a maximum token limit, allowing the models to generate responses without any constraints on their length.

A specific prompt (see Prompt Question Answer in Appendix A) was crafted to guide the models in reasoning about the law and generating appropriate responses. The prompt explicitly instructs the models to reason through the legal context provided and formulate an answer. If a model is unable to generate a satisfactory response, it is instructed to state that it does not know the answer.

Fig. 1.
figure 1

Documents from the legislative corpus

The legal information needed to answer each question, like an article from a law or a legal document, is included in the prompt. This information is the same one used by the experts to create the reference answers, ensuring a fair basis for comparison. By using standardized prompts and incorporating relevant legal provisions, it ensures that the models have access to the same information as human experts. This enables a thorough evaluation of their reasoning capabilities.

It is important to note that the questions and the reference answers are presented in Brazilian Portuguese. This aspect of the study tests the models’ reasoning abilities and evaluates their proficiency in generating accurate and contextually appropriate responses in the Portuguese language. Given that many LLMs are primarily trained on English-language datasets, assessing their performance on Brazilian Portuguese legal texts for understanding the applicability and limitations of these models in non-English-speaking jurisdictions.

Although the dataset used in this experiment contains a corpus suitable for Retrieval-Augmented Generation (RAG), our evaluation focused solely on the tasks of generation and reasoning. This decision was inspired by other prominent datasets, such as SQuAD 2.0 and HotpotQA, which also provide the expected passages alongside the ground truth answers, allowing for a direct assessment of the model’s generation capabilities without the retrieval step. By concentrating on these aspects, we aimed to isolate and thoroughly evaluate the LLMs’ ability to generate accurate and reasoned responses based solely on the provided legal context.

3.3 Evaluation Metrics

Our study evaluated large language models (LLMs) using a comprehensive approach integrating quantitative and qualitative methods. For the quantitative assessment, we employed the BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics [19, 27]. In the field of natural language processing, these metrics are crucial for assessing the quality of text generation by comparing the models’ responses to a predefined set of reference answers. Specifically, in the domain of questions and answers related to corporate taxation, these metrics provide a quantitative measure of how closely the generated responses align with the ideal answers.

Despite their widespread use, metrics such as BLEU, ROUGE [37], and METEOR [21], which are widely used for evaluating language models, they primarily provide a quantitative perspective and may not fully capture the accuracy of responses in question-answering scenarios [20]. This limitation arises because these metrics do not adequately assess the factual accuracy or the relevance of the generated responses, which are critical in determining whether the questions were answered correctly.

In order to address this gap, we adopted a more nuanced qualitative approach, utilizing the capabilities of a powerful language model as a surrogate for human judgment. Specifically, we employed GPT-4 to evaluate the performance of other models. This approach is premised on the notion that a robust LLM, such as GPT-4, can effectively emulate human judgment in evaluating responses [7, 10, 17, 20, 31, 35, 41] to open-ended questions, thereby providing a closer approximation to human evaluative criteria.

For the qualitative evaluation, we used a carefully designed prompt to assess the factual accuracy of the models’ responses. The accuracy of each model was then calculated based on this assessment. The specific prompt used for this evaluation can be found in Prompt Evaluation in the Appendix A for further details.

Table 2. Model Performance Metrics

4 Results

In this section, we will present the results of the experiments described in the Experimental Setup section. We used the metrics outlined in the Metrics section, both of which are situated in the Methodology (Sect. 3). We selected the main language models operating in Portuguese to evaluate how well they can reason and answer questions in the context of tax law, with the aim of identifying potential improvements (Table 2).

4.1 Model Performance Analysis

The latest versions of the Llama, Qwen, and Mistral families exhibit significant advancements compared to their predecessors. These models incorporate several architectural enhancements, including SwiGLU activation [30] and Grouped Query Attention (GQA) [4]. Both the Qwen2-72B-Instruct [1] and Llama-3-70b-chat-hf [3] models benefited from these improvements, particularly the modifications to the tokenizer and the inclusion of GQA, leading to notable performance gains. As a result, the Qwen2-72B-Instruct  [1] model achieved the highest accuracy. Similar results have been observed in other LLM evaluation benchmarks  [1], highlighting the superior performance of models incorporating these techniques.

The performance analysis of the models revealed that model size significantly impacts the results, but this impact is not always straightforward. Larger models, such as Qwen2-72B-Instruct [1] and Mixtral-8x22B-Instruct-v0.1 [13], achieved superior performance, exhibiting the highest ROUGE-L, BLEU, Bert Score F1, and GPT-4 evaluated accuracy metrics. However, we observed that smaller models, such as Mistral-7B-Instruct-v0.3 [12] and OpenHermes-2p5-Mistral-7B [32], outperformed some larger models in specific metrics. For instance, Mistral-7B-Instruct-v0.3 attained a Bert Score F1 of 0.71, surpassing several larger models, and OpenHermes-2p5-Mistral-7B demonstrated remarkable performance with accuracy comparable to significantly larger models. These findings suggest that while larger models generally deliver better results due to their ability to capture more complex information, well-trained and fine-tuned smaller models can offer competitive performance in specific contexts. This trend indicates that the training quality and the model’s suitability to the particular dataset are crucial factors that can mitigate the size disparity among models.

Although the volume of Portuguese data used in training these models has yet to be verified, the architectural and training improvements suggest the enhanced performance of the LLMs in Q&A tasks in corporate tax law. In the Mistral family, the Mixtral-8x22B-Instruct-v0.1 [24] model stood out with the highest scores in ROUGE-L, BLEU, and Bert Score F1, indicating the potential of the mixture of experts architecture [13] for legal texts in Portuguese.

The analysis of fine-tuned open-source models reveals significant improvements over the base models. The openchat-3.5-1210 [34] and OpenHermes-2p5-Mistral-7B [32], both derived from the Mistral-7B-v0.1 [12], showed notable increases in accuracy after fine-tuning. Similarly, the vicuna-13b-v1.5 and vicuna-7b-v1.5 models [41], fine-tuned from Llama 2 [33], also demonstrated advances in response accuracy. Furthermore, models such as WizardLM-13B-V1.2 [36], SOLAR-10.7B-Instruct-v1.0 [15, 16], and Platypus2-70B-instruct [18], derived from Llama 2, improved the results of their base models. Notably, these fine-tuning processes were conducted on diverse datasets, not on the experimental dataset itself, yet still led to enhanced metrics within the experimental dataset. These improvements suggest that fine-tuning can effectively enhance the capabilities of Q&A and legal text generation tasks when applied to specific datasets.

4.2 Evaluation Metrics Analysis

The traditional metrics like BLEU and ROUGE may not fully capture the nuances needed for accurate question answering in Q&A tasks. The Bert Score F1 metric is extensively recognized for its alignment with human evaluation due to its capacity to capture deep semantic similarities between texts [40], surpassing the lexical matching capabilities of traditional metrics like ROUGE-L and BLEU [40].

While this study does not aim to prove that LLM as a judge for evaluation is aligned with human evaluation, recent studies have been exploring this alignment [20, 31, 35, 41]. Our research evaluates the quality of responses generated by LLMs in legal domain Q&A tasks. The strong correlation between the LLM (GPT-4) Accuracy Evaluation and Bert Score F1, as evidenced by the Pearson (0.657) and Kendall (0.491) correlations (see Table 3), suggests that both metrics capture semantic aspects relevant to human-perceived quality. The results are in line with studies [40] recommending using Pearson and Kendall correlations to evaluate metric quality.

Table 3. Correlation Matrices for Evaluation Metrics
Fig. 2.
figure 2

Bland-Altman Plots for Metrics vs LLM (GPT-4) Accuracy Evaluation

Furthermore, the Bland-Altman analysis, which is particularly suitable for comparing measurement methods [11], confirms that Bert Score F1 and LLM (GPT-4) Accuracy Evaluation are more concordant, as shown by the lower variation in point dispersion and the narrower width of the limits of agreement (see Fig. 2). In contrast, ROUGE-L and BLEU demonstrated higher mean differences and wider limits of agreement. Bert Score F1 exhibited a mean difference close to zero and narrower limits of agreement, indicating better concordance with LLM Accuracy Evaluation measurements.

These findings imply that LLM (GPT-4) Accuracy Evaluation, similar to Bert Score F1, could be a valuable and representative metric for assessing the actual performance of language models. While ROUGE-L and BLEU show higher correlations with Bert Score F1, the stronger correlation and concordance of LLM Accuracy Evaluation with Bert Score F1 indicate its potential alignment with human evaluation. This supports the development of evaluation metrics that more accurately reflect human-perceived quality, aligning with the direction of current research investigating the potential of LLMs used to align with human evaluation [20, 31, 35, 41].

5 Conclusion

This study underscores the importance of tax law in society and the potential of language models to assist in its understanding and application. We developed a novel dataset of real-world tax law questions and expert answers in Brazilian Portuguese and conducted a rigorous evaluation of various language models. While our findings suggest that these models show promise in comprehending and reasoning about complex legal texts, further research is necessary to fully demonstrate their effectiveness in legal reasoning across a broader range of scenarios and tasks.

Our evaluation showed that advancements in model architecture have a noticeable impact on performance, and fine-tuning open-source models, even when done on diverse datasets rather than those specific to the legal domain, can still improve their ability to generate relevant and accurate responses. This suggests that continuous improvements and adaptations are valuable in enhancing the capabilities of language models in legal tasks.

For assessing model performance, we used Bert Score F1, known for its strong correlation with human evaluations in tasks involving descriptive and structural understanding, and a newer metric, LLM Accuracy Evaluation. While Bert Score F1 is already established as an effective measure aligned with human judgment, especially in descriptive tasks, our results showed that LLM Accuracy Evaluation also demonstrated strong correlation with Bert Score F1 through Pearson and Kendall correlations. The Bland-Altman analysis further confirmed that the LLM metric aligns closely with Bert Score, suggesting its potential as a reliable alternative in evaluations. However, it is important to note that while these findings are encouraging, the use of these metrics for reasoning-based tasks, such as those in this study, still requires further validation. The LLM metric is a promising tool, but more research is needed to establish its effectiveness fully, particularly in capturing the nuances of legal reasoning.

Limitations and Future Work. A limitation of our study is that while it focused on evaluating the generation and reasoning capabilities of LLMs, it did not require the models to identify specific legal provisions as part of their responses. Our dataset includes a comprehensive corpus containing the necessary laws, enabling the application of Retrieval-Augmented Generation (RAG) techniques. This allows models to retrieve relevant legal provisions and incorporate them into their answers, which is essential for a more complete statutory reasoning process. By providing the relevant legal articles, which do not directly mirror the answers to the questions, our study did assess a portion of the statutory reasoning by testing the models’ ability to apply the law to generate accurate responses. Future work could leverage this corpus to explore the integration of RAG, aiming to enhance the models’ ability to not only generate correct answers but also to identify and cite the appropriate legal provisions, thereby achieving a more robust and comprehensive statutory reasoning.

Dataset and Code Availability. The dataset used in this study, as well as the code for reproducing the experiments and analyses, are publicly available. The dataset can be accessed at the following link: https://github.com/joaopaulopresa/dataset. The code can be found here: https://github.com/joaopaulopresa/code.