1 Introduction

Recent advances in Language Models (LMs) have generated significant interest due to their demonstrated capabilities on a wide range of language tasks, including text classification, language translation, and text generation [3, 7]. LM performance has been particularly impressive on standardized tests, which present challenging questions requiring high levels of domain-specific knowledge and reasoning. For instance, recent benchmarks on GPT-4 [16] showed that it can achieve human-level performance on a variety of graduate-level benchmarks.

Despite the impressive performance of LMs on standardized tests, few evaluations have been performed in Portuguese [15], partially due to the lack of available datasets in the language. This lack of high-quality, standardized datasets presents a significant challenge for researchers interested in developing and evaluating LMs in Portuguese. To address this gap for Brazilian Portuguese, we introduce BLUEX, a dataset consisting of entrance exams for the two leading universities in Brazil. Our dataset offers a rich source of high-quality high school-level questions annotated with their respective subjects, as well as flags indicating the required capabilities necessary to respond accurately to the questions, such as knowledge of Brazilian culture and the application of mathematical reasoning. These annotations can be used to evaluate the performance of LMs on a variety of subjects and capabilities such as domain-specific knowledge and reasoning. Additionally, BLUEX includes a collection of recently administered entrance exams that are unlikely to be included in the training data of many currently popular LMs.

In anticipation of the emergence of multimodal models that combine text and image understanding, we have annotated BLUEX to indicate the position of images in each question. Additionally, we have included all necessary images with the dataset to facilitate research on multimodal language tasks. We believe that this resource will be essential in evaluating the performance of models that reason with both text and image inputs to solve complex problems.

In this paper, we describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs. Our findings suggest that BLUEX provides a valuable resource for benchmarking and advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. This is particularly relevant since even the current state-of-the-art models, such as GPT-4, still have considerable room for improvement and do not achieve the highest cutoff grades for both universities.

2 Related Work

In the realm of Portuguese Natural Language Processing (NLP) datasets, there appears to be a limited availability.

For question-answering tasks, Faquad [21] is available, which exhibits an extractive style akin to SQuAD [18]. It features questions concerning Brazilian higher education institutions, with documents sourced from a federal university and supplemented by Wikipedia articles. Another option is the Multilingual Knowledge Questions and Answers (MKQA) dataset, which covers 26 languages [12]. This dataset was generated by selecting 10,000 queries from the Natural Questions dataset [10] and acquiring new passage-independent answers for each question. Subsequently, human translators translated the questions and answers into 25 non-English, typologically diverse languages, including Portuguese.

Regarding sentence entailment tasks, ASSIN 1 and 2 [5, 19] are available. These datasets encompass Recognizing Textual Entailment (RTE), also referred to as Natural Language Inference (NLI), and Semantic Textual Similarity (STS) tasks. The former involves predicting if a given text (premise) implies another text (hypothesis), while the latter quantifies the semantic equivalence between two sentences.

The Portuguese Language Understanding Evaluation (PLUE) benchmark [6] provides Portuguese translations of the GLUE [26], SNLI [1], and SciTAIL [8] datasets. These translations have been generated using automatic translation tools including Google Translate and OpusMT [24].

The Winograd Schema Challenge (WSC) dataset [9] contains pairs of sentences with minimal differences, featuring an ambiguous pronoun that is resolved divergently between the two sentences. Melo et al. [13] manually translated and adapted this dataset to Portuguese.

For sentiment analysis tasks, the TweetsentBr dataset [2] consists of 15,000 tweets related to the TV show domain, collected between January and July 2017. The tweets were manually annotated by seven annotators into three classes: positive, neutral, and negative.

The Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation (MASSIVE) [4] is a 1M-example dataset containing realistic virtual utterances in 51 languages, including Portuguese. Professional translators translated the dataset from English, and it is annotated for slot (55 classes) and intent (60 classes) prediction tasks.

A dataset more closely related to BLUEX is the ENEM-challenge dataset [22], which includes the editions of the Brazilian national exam, Exame Nacional do Ensino Medio (ENEM), from 2009 to 2017. Additionally, Nunes et al. [15] introduced a dataset containing the ENEM exam of 2022, the same paper evaluated the performance of LMs such as GPT-3.5-Turbo and GPT-4 on both the ENEM-challenge and the ENEM 2022 datasets.

3 The BLUEX Dataset

3.1 Dataset Creation

BLUEX is a dataset comprising more than 1,000 multiple choice questions from the entrance exams of the two leading universities in Brazil, Unicamp and USP, administered between 2018 and 2023. The dataset was created by automatically extracting each question text, alternatives, and related images using scripts, and subsequently each example was manually annotated to correct extraction errors and provide additional metadata such as image positioning.

3.2 Annotated Question Metadata

The annotated metadata is described below.

  • Prior Knowledge (PRK) - Indicates whether the question requires knowledge from outside of what has been provided in the question, such as familiarity with a particular author’s work or a specific mathematical formula.

  • Text Understanding (TU) - Indicates whether the question requires understanding of a particular text.

  • Image Understanding (IU) - Indicates whether the question requires understanding of an image. It should be noted that not all questions with images require their understanding to answer the question.

  • Mathematical Reasoning (MR) - Indicates whether the question requires mathematical reasoning, such as the ability to perform calculations and symbolic manipulations.

  • Multilingual (ML) - Indicates whether the question requires knowledge of two or more languages, such as questions designed to test English skills of Portuguese speakers.

  • Brazilian Knowledge (BK) - Indicates whether the question involves knowledge specific to Brazil, such as Brazilian history, literature, geography, or culture.

  • Subjects - A list of subjects related to the question, such as geography, physics, etc.

  • Related Images - A list of all the related images for the question.

  • Alternative Type - Indicates whether the answer choices are presented as text or as images. This is important because some questions may use images as answer choices, which requires different processing techniques than questions with only textual answers.

By providing such annotations along with the questions we aim to facilitate research into language understanding and reasoning in Portuguese for both pure language models and multimodal models. We believe that BLUEX will be a valuable resource for researchers to evaluate and improve the performance of future language models in the context of Portuguese-language standardized tests.

3.3 Image Positioning

Many of the questions in the exams require a contextual or informational understanding of images. Despite active research in the field of multimodal models, models that can adeptly process both text and image data and yield satisfactory results remain scarce in the public domain. We believe that BLUEX can serve as an essential evaluation tool for such models. Anticipating the use of models that will process images and text in an interleaved manner, we also provide precise information regarding the placement of images within the question, as illustrated in Fig. 1.

Fig. 1.
figure 1

Example of image annotation in BLUEX.

3.4 Dataset Distribution

The BLUEX dataset covers a wide range of high school subjects, including Mathematics, Physics, Chemistry, Biology, History, Geography, English, Philosophy and Portuguese, as well as multidisciplinary questions that involve two or more subjects. The distribution of questions is shown in Table 1, where we also provide the distribution for the subset of questions without images, which accounts for approximately 58% of the total dataset.

Furthermore, Table 2 shows the distribution of the dataset across annotated categories, as explained in Sect. 3.2. We observe that the majority of questions require specific knowledge and the ability to comprehend text, two expected capabilities in students taking these exams. Note that any given question can be part of multiple categories.

Table 1. Distribution over subjects.
Table 2. Distribution over categories.

4 Results

To enable future comparisons, we evaluated our dataset using several language models, ranging from 6B to 66B parameters, including OpenAI’s GPT-4 and GPT-3.5-Turbo models. Our experiments were conducted using large language models with no specific training for this task. Each model was provided with one example in the input and then asked to answer a question from the test set. The example was randomly selected from an exam of the same university as the current question, but from a different year. For example, if the current question is from UNICAMP 2019, the example provided in the prompt would be a question from a UNICAMP exam, but not from 2019. We excluded all questions containing images from our experiments since the language models we used can only process text. This resulted in a total of 638 questions being used, which corresponds to approximately 60% of the dataset.

Table 3. Accuracy in the BLUEX dataset.

Table 3 summarizes our experimental findings, including the mean score achieved by exam-taking students, as well as the mean cutoff score of the most competitive major, which is medicine in both universities.Footnote 1 The BLUEX column shows the accuracy of the whole subset used in the evaluation, while the UNICAMP and USP columns account for only the questions from the respective universities. The MR and BK columns account only for questions that include those categories.

Among the language models tested in the 7B-parameter range, Sabiá [17], a model further pre-trained in Portuguese, consistently outperformed all other models, coming close to matching the average human score. Among the open-source models in the 60B-parameter range, LLaMA 65B [25] significantly outperformed OPT 66B [30] and achieved similar performance to GPT-3.5-Turbo. Sabiá 65B achieved better performance than GPT-3.5-Turbo but still lagged behind GPT-4 by ten points. GPT-4 was by far the best model in our evaluations but did not achieve an average score high enough to pass in medicine, the most competitive major. It is worth noting that the average and cutoff scores provided in Table 3 are computed taking into account the whole exam, including questions with images, while the scores obtained by the language models utilize only the subset of questions with no images.

We also conducted a more detailed analysis of the models’ performance by examining their ability to handle specific question types. Table 3 presents the findings for questions that required Mathematical Reasoning (MR) and Brazilian Knowledge (BK). We observe that, with the exception of GPT-4, all models struggled to perform significantly better than random chance in questions that required Mathematical Reasoning. Even GPT-4 only achieved an accuracy of 44% in MR questions. On the other hand, when considering questions that require brazilian knowledge, Sabiá greatly outperformed all the other models in the 7B-parameter range, indicating that the extra pretraining in Portuguese provided the model with additional regional knowledge. In the 60B-parameter range, Sabiá also showed improvement over LLaMA, increasing the accuracy in these questions by 10 points and slightly outperforming GPT-3.5-Turbo. Nevertheless, it could not match the remarkable performance of GPT-4.

Fig. 2.
figure 2

Accuracy of the best models over the years of the exams.

Moreover, Fig. 2 displays the performance of the top four models on the exams conducted each year. It can be observed that the models have a small variance between the years, which is expected as the difficulty of each exam and the number of questions in the subset vary across years. A surprising result, however, is the increased performance that all models seem to exhibit in 2023. The average and highest cutoff scores also increased slightly over the years, indicating that the exams became slightly easier in recent years. Since the 2023 exams were very recently administered, it is unlikely that they are part of any of the studied models’ training data. Therefore, since the models’ performance in the most recent years is comparable to that in older exams, it is reasonable to assume that the models are not merely memorizing the answers for the questions in the dataset.

5 Conclusion

This work introduced BLUEX, a new dataset that consists of 13 college entrance exams applied between 2018 and 2023 from two of the leading Brazilian universities, UNICAMP and USP. Each question of these exams was extensively annotated to help measure different abilities across multiple subjects in Portuguese. Beyond that, by providing images and their corresponding positions within the text, BLUEX is one of the few Portuguese datasets that are ready to evaluate multimodal models. We provide results from multiple LMs as baselines and reference scores based on students performance to facilitate future comparisons. We believe that BLUEX will be a important benchmark in the evaluation of the Portuguese capabilities of future models.

6 Future Work

The models used in this study employed a single in-context example. However, there’s room for further investigation, such as determining whether increasing the number of few-shot examples could boost the performance of each model, as well as assessing their zero-shot performance. Furthermore, Nunes et al. [15] showed that GPT-4’s performance on ENEM questions was significantly boosted when chain-of-thought prompts [28] were used. Adopting a similar approach here could potentially lead to performance improvement.

Finally, regarding multimodal models, their performance can be assessed utilizing the BLUEX dataset. This provides an opportunity for researchers to investigate the models’ capabilities in integrating visual and textual information to address high school level questions.