BLUEX: A Benchmark Based on Brazilian Leading Universities Entrance eXams

Almeida, Thales Sales; Laitz, Thiago; Bonás, Giovana K.; Nogueira, Rodrigo

doi:10.1007/978-3-031-45368-7_22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14195))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

536 Accesses
13 Citations

Abstract

One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Where Am I? Where Am I Going, and How Do I Get There?: Increasing Learner Agency Through Large-Scale Self Assessment in Language Learning

Designing LLM-Based Study Guides and Personalized Feedback in Brazilian Higher Education: A Skill-Focused Approach to Learning and Assessment Perceptions

Case Study Involving Interdisciplinary Academic Projects for Reading Fluency Assessment and Analysis

1 Introduction

Recent advances in Language Models (LMs) have generated significant interest due to their demonstrated capabilities on a wide range of language tasks, including text classification, language translation, and text generation [3, 7]. LM performance has been particularly impressive on standardized tests, which present challenging questions requiring high levels of domain-specific knowledge and reasoning. For instance, recent benchmarks on GPT-4 [16] showed that it can achieve human-level performance on a variety of graduate-level benchmarks.

Despite the impressive performance of LMs on standardized tests, few evaluations have been performed in Portuguese [15], partially due to the lack of available datasets in the language. This lack of high-quality, standardized datasets presents a significant challenge for researchers interested in developing and evaluating LMs in Portuguese. To address this gap for Brazilian Portuguese, we introduce BLUEX, a dataset consisting of entrance exams for the two leading universities in Brazil. Our dataset offers a rich source of high-quality high school-level questions annotated with their respective subjects, as well as flags indicating the required capabilities necessary to respond accurately to the questions, such as knowledge of Brazilian culture and the application of mathematical reasoning. These annotations can be used to evaluate the performance of LMs on a variety of subjects and capabilities such as domain-specific knowledge and reasoning. Additionally, BLUEX includes a collection of recently administered entrance exams that are unlikely to be included in the training data of many currently popular LMs.

In anticipation of the emergence of multimodal models that combine text and image understanding, we have annotated BLUEX to indicate the position of images in each question. Additionally, we have included all necessary images with the dataset to facilitate research on multimodal language tasks. We believe that this resource will be essential in evaluating the performance of models that reason with both text and image inputs to solve complex problems.

In this paper, we describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs. Our findings suggest that BLUEX provides a valuable resource for benchmarking and advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. This is particularly relevant since even the current state-of-the-art models, such as GPT-4, still have considerable room for improvement and do not achieve the highest cutoff grades for both universities.

2 Related Work

In the realm of Portuguese Natural Language Processing (NLP) datasets, there appears to be a limited availability.

For question-answering tasks, Faquad [21] is available, which exhibits an extractive style akin to SQuAD [18]. It features questions concerning Brazilian higher education institutions, with documents sourced from a federal university and supplemented by Wikipedia articles. Another option is the Multilingual Knowledge Questions and Answers (MKQA) dataset, which covers 26 languages [12]. This dataset was generated by selecting 10,000 queries from the Natural Questions dataset [10] and acquiring new passage-independent answers for each question. Subsequently, human translators translated the questions and answers into 25 non-English, typologically diverse languages, including Portuguese.

Regarding sentence entailment tasks, ASSIN 1 and 2 [5, 19] are available. These datasets encompass Recognizing Textual Entailment (RTE), also referred to as Natural Language Inference (NLI), and Semantic Textual Similarity (STS) tasks. The former involves predicting if a given text (premise) implies another text (hypothesis), while the latter quantifies the semantic equivalence between two sentences.

The Portuguese Language Understanding Evaluation (PLUE) benchmark [6] provides Portuguese translations of the GLUE [26], SNLI [1], and SciTAIL [8] datasets. These translations have been generated using automatic translation tools including Google Translate and OpusMT [24].

The Winograd Schema Challenge (WSC) dataset [9] contains pairs of sentences with minimal differences, featuring an ambiguous pronoun that is resolved divergently between the two sentences. Melo et al. [13] manually translated and adapted this dataset to Portuguese.

For sentiment analysis tasks, the TweetsentBr dataset [2] consists of 15,000 tweets related to the TV show domain, collected between January and July 2017. The tweets were manually annotated by seven annotators into three classes: positive, neutral, and negative.

The Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation (MASSIVE) [4] is a 1M-example dataset containing realistic virtual utterances in 51 languages, including Portuguese. Professional translators translated the dataset from English, and it is annotated for slot (55 classes) and intent (60 classes) prediction tasks.

A dataset more closely related to BLUEX is the ENEM-challenge dataset [22], which includes the editions of the Brazilian national exam, Exame Nacional do Ensino Medio (ENEM), from 2009 to 2017. Additionally, Nunes et al. [15] introduced a dataset containing the ENEM exam of 2022, the same paper evaluated the performance of LMs such as GPT-3.5-Turbo and GPT-4 on both the ENEM-challenge and the ENEM 2022 datasets.

3 The BLUEX Dataset

3.1 Dataset Creation

BLUEX is a dataset comprising more than 1,000 multiple choice questions from the entrance exams of the two leading universities in Brazil, Unicamp and USP, administered between 2018 and 2023. The dataset was created by automatically extracting each question text, alternatives, and related images using scripts, and subsequently each example was manually annotated to correct extraction errors and provide additional metadata such as image positioning.

3.2 Annotated Question Metadata

The annotated metadata is described below.

Prior Knowledge (PRK) - Indicates whether the question requires knowledge from outside of what has been provided in the question, such as familiarity with a particular author’s work or a specific mathematical formula.
Text Understanding (TU) - Indicates whether the question requires understanding of a particular text.
Image Understanding (IU) - Indicates whether the question requires understanding of an image. It should be noted that not all questions with images require their understanding to answer the question.
Mathematical Reasoning (MR) - Indicates whether the question requires mathematical reasoning, such as the ability to perform calculations and symbolic manipulations.
Multilingual (ML) - Indicates whether the question requires knowledge of two or more languages, such as questions designed to test English skills of Portuguese speakers.
Brazilian Knowledge (BK) - Indicates whether the question involves knowledge specific to Brazil, such as Brazilian history, literature, geography, or culture.
Subjects - A list of subjects related to the question, such as geography, physics, etc.
Related Images - A list of all the related images for the question.
Alternative Type - Indicates whether the answer choices are presented as text or as images. This is important because some questions may use images as answer choices, which requires different processing techniques than questions with only textual answers.

By providing such annotations along with the questions we aim to facilitate research into language understanding and reasoning in Portuguese for both pure language models and multimodal models. We believe that BLUEX will be a valuable resource for researchers to evaluate and improve the performance of future language models in the context of Portuguese-language standardized tests.

3.3 Image Positioning

Many of the questions in the exams require a contextual or informational understanding of images. Despite active research in the field of multimodal models, models that can adeptly process both text and image data and yield satisfactory results remain scarce in the public domain. We believe that BLUEX can serve as an essential evaluation tool for such models. Anticipating the use of models that will process images and text in an interleaved manner, we also provide precise information regarding the placement of images within the question, as illustrated in Fig. 1.

3.4 Dataset Distribution

The BLUEX dataset covers a wide range of high school subjects, including Mathematics, Physics, Chemistry, Biology, History, Geography, English, Philosophy and Portuguese, as well as multidisciplinary questions that involve two or more subjects. The distribution of questions is shown in Table 1, where we also provide the distribution for the subset of questions without images, which accounts for approximately 58% of the total dataset.

Furthermore, Table 2 shows the distribution of the dataset across annotated categories, as explained in Sect. 3.2. We observe that the majority of questions require specific knowledge and the ability to comprehend text, two expected capabilities in students taking these exams. Note that any given question can be part of multiple categories.

Table 1. Distribution over subjects.

Full size table

Table 2. Distribution over categories.

Full size table

4 Results

To enable future comparisons, we evaluated our dataset using several language models, ranging from 6B to 66B parameters, including OpenAI’s GPT-4 and GPT-3.5-Turbo models. Our experiments were conducted using large language models with no specific training for this task. Each model was provided with one example in the input and then asked to answer a question from the test set. The example was randomly selected from an exam of the same university as the current question, but from a different year. For example, if the current question is from UNICAMP 2019, the example provided in the prompt would be a question from a UNICAMP exam, but not from 2019. We excluded all questions containing images from our experiments since the language models we used can only process text. This resulted in a total of 638 questions being used, which corresponds to approximately 60% of the dataset.

Table 3. Accuracy in the BLUEX dataset.

Full size table

Table 3 summarizes our experimental findings, including the mean score achieved by exam-taking students, as well as the mean cutoff score of the most competitive major, which is medicine in both universities.^{Footnote 1} The BLUEX column shows the accuracy of the whole subset used in the evaluation, while the UNICAMP and USP columns account for only the questions from the respective universities. The MR and BK columns account only for questions that include those categories.

Among the language models tested in the 7B-parameter range, Sabiá [17], a model further pre-trained in Portuguese, consistently outperformed all other models, coming close to matching the average human score. Among the open-source models in the 60B-parameter range, LLaMA 65B [25] significantly outperformed OPT 66B [30] and achieved similar performance to GPT-3.5-Turbo. Sabiá 65B achieved better performance than GPT-3.5-Turbo but still lagged behind GPT-4 by ten points. GPT-4 was by far the best model in our evaluations but did not achieve an average score high enough to pass in medicine, the most competitive major. It is worth noting that the average and cutoff scores provided in Table 3 are computed taking into account the whole exam, including questions with images, while the scores obtained by the language models utilize only the subset of questions with no images.

We also conducted a more detailed analysis of the models’ performance by examining their ability to handle specific question types. Table 3 presents the findings for questions that required Mathematical Reasoning (MR) and Brazilian Knowledge (BK). We observe that, with the exception of GPT-4, all models struggled to perform significantly better than random chance in questions that required Mathematical Reasoning. Even GPT-4 only achieved an accuracy of 44% in MR questions. On the other hand, when considering questions that require brazilian knowledge, Sabiá greatly outperformed all the other models in the 7B-parameter range, indicating that the extra pretraining in Portuguese provided the model with additional regional knowledge. In the 60B-parameter range, Sabiá also showed improvement over LLaMA, increasing the accuracy in these questions by 10 points and slightly outperforming GPT-3.5-Turbo. Nevertheless, it could not match the remarkable performance of GPT-4.

Moreover, Fig. 2 displays the performance of the top four models on the exams conducted each year. It can be observed that the models have a small variance between the years, which is expected as the difficulty of each exam and the number of questions in the subset vary across years. A surprising result, however, is the increased performance that all models seem to exhibit in 2023. The average and highest cutoff scores also increased slightly over the years, indicating that the exams became slightly easier in recent years. Since the 2023 exams were very recently administered, it is unlikely that they are part of any of the studied models’ training data. Therefore, since the models’ performance in the most recent years is comparable to that in older exams, it is reasonable to assume that the models are not merely memorizing the answers for the questions in the dataset.

5 Conclusion

This work introduced BLUEX, a new dataset that consists of 13 college entrance exams applied between 2018 and 2023 from two of the leading Brazilian universities, UNICAMP and USP. Each question of these exams was extensively annotated to help measure different abilities across multiple subjects in Portuguese. Beyond that, by providing images and their corresponding positions within the text, BLUEX is one of the few Portuguese datasets that are ready to evaluate multimodal models. We provide results from multiple LMs as baselines and reference scores based on students performance to facilitate future comparisons. We believe that BLUEX will be a important benchmark in the evaluation of the Portuguese capabilities of future models.

6 Future Work

The models used in this study employed a single in-context example. However, there’s room for further investigation, such as determining whether increasing the number of few-shot examples could boost the performance of each model, as well as assessing their zero-shot performance. Furthermore, Nunes et al. [15] showed that GPT-4’s performance on ENEM questions was significantly boosted when chain-of-thought prompts [28] were used. Adopting a similar approach here could potentially lead to performance improvement.

Finally, regarding multimodal models, their performance can be assessed utilizing the BLUEX dataset. This provides an opportunity for researchers to investigate the models’ capabilities in integrating visual and textual information to address high school level questions.

Notes

1.
The average and cutoff scores are reported by the entities responsible for administering the exams. The results presented in Table 3 are the average of all the exams contained in the BLUEX dataset.

References

Bowman, S., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642 (2015)
Google Scholar
Brum, H.B., das Graças Volpe Nunes, M.: Building a sentiment corpus of tweets in Brazilian Portuguese (2017)
Google Scholar
Chowdhery, A., et al.: Palm: scaling language modeling with pathways (2022)
Google Scholar
FitzGerald, J., et al.: MASSIVE: a 1 m-example multilingual natural language understanding dataset with 51 typologically-diverse languages (2022)
Google Scholar
Fonseca, E., Santos, L., Criscuolo, M., Aluisio, S.: ASSIN: Avaliacao de similaridade semantica e inferencia textual. In: 12th International Conference on Computational Processing of the Portuguese Language, Tomar, Portugal, pp. 13–15 (2016)
Google Scholar
Gomes, J.R.S.: PLUE: Portuguese language understanding evaluation (2020). https://github.com/jubs12/PLUE
Hoffmann, J., et al.: Training compute-optimal large language models (2022)
Google Scholar
Khot, T., Sabharwal, A., Clark, P.: SciTaiL: a textual entailment dataset from science question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Kocijan, V., Lukasiewicz, T., Davis, E., Marcus, G., Morgenstern, L.: A review of Winograd Schema Challenge datasets and approaches. arXiv preprint arXiv:2004.13831 (2020)
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 453–466 (2019)
Article Google Scholar
Lin, X.V., et al.: Few-shot learning with multilingual language models (2022)
Google Scholar
Longpre, S., Lu, Y., Daiber, J.: MKQA: a linguistically diverse benchmark for multilingual open domain question answering. Trans. Assoc. Computat. Linguist. 9, 1389–1406 (2021)
Article Google Scholar
de Melo, G., Imaizumi, V., Cozman, F.: Winograd schemas in portuguese. In: Anais do XVI Encontro Nacional de Inteligência Artificial e Computacional, pp. 787–798. SBC (2019)
Google Scholar
Muennighoff, N., et al.: Crosslingual generalization through multitask finetuning (2022)
Google Scholar
Nunes, D., Primi, R., Pires, R., Lotufo, R., Nogueira, R.: Evaluating GPT-3.5 and GPT-4 models on Brazilian University admission exams (2023)
Google Scholar
OpenAI: GPT-4 technical report (2023)
Google Scholar
Pires, R., Abonizio, H., Almeida, T.S., Nogueira, R.: Sabiá: Portuguese large language models (2023)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392 (2016)
Google Scholar
Real, L., Fonseca, E., Gonçalo Oliveira, H.: The ASSIN 2 shared task: a quick overview. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds.) PROPOR 2020. LNCS (LNAI), vol. 12037, pp. 406–412. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41505-1_39
Chapter Google Scholar
de la Rosa, J., Ponferrada, E.G., Villegas, P., de Prado Salas, P.G., Romero, M., Grandury, M.: BERTIN: efficient pre-training of a Spanish language model using perplexity sampling (2022)
Google Scholar
Sayama, H.F., Araujo, A.V., Fernandes, E.R.: FaQuAD: reading comprehension dataset in the domain of Brazilian higher education. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 443–448. IEEE (2019)
Google Scholar
Silveira, I.C., Mauá, D.D.: Advances in automatically solving the ENEM. In: 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), pp. 43–48. IEEE (2018)
Google Scholar
Taori, R., et al.: Stanford Alpaca: an instruction-following LLaMA model (2023). https://github.com/tatsu-lab/stanford_alpaca
Tiedemann, J., Thottingal, S.: OPUS-MT - building open translation services for the world. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT), Lisbon, Portugal (2020)
Google Scholar
Touvron, H., et al.: LLaMA: open and efficient foundation language models (2023)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJ4km2R5t7
Wang, B., Komatsuzaki, A.: GPT-J-6B: a 6 billion parameter autoregressive language model, May 2021. https://github.com/kingoflolz/mesh-transformer-jax
Wei, J., et al.: Chain of thought prompting elicits reasoning in large language models. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=_VjQlMeSB_J
Le Scao, T., et al.: BLOOM: a 176B-parameter open-access multilingual language model (2023)
Google Scholar
Zhang, S., et al.: OPT: open pre-trained transformer language models (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

State University of Campinas (UNICAMP), Campinas, Brazil
Thales Sales Almeida, Thiago Laitz, Giovana K. Bonás & Rodrigo Nogueira
Maritaca AI, Campinas, Brazil
Thales Sales Almeida & Rodrigo Nogueira
NeuralMind AI, Campinas, Brazil
Thiago Laitz

Authors

Thales Sales Almeida
View author publications
Search author on:PubMed Google Scholar
Thiago Laitz
View author publications
Search author on:PubMed Google Scholar
Giovana K. Bonás
View author publications
Search author on:PubMed Google Scholar
Rodrigo Nogueira
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Thales Sales Almeida .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Murilo C. Naldi
Centro Universitario da FEI, São Bernardo do Campo, Brazil
Reinaldo A. C. Bianchi

7 Appendix

1.1 7.1 Prompt for Evaluation

The prompt used for all the experiments in this paper is shown in the Fig. 3.

Table 4. Results for each model by subject in BLUEX.

Full size table

1.2 7.2 Benchmark per Subject

Table 4 provides a detailed report of each model achieved accuracy by subject. Questions that were associated with more than one subject contributed to the accuracy of both scores. For example, a question related to mathematics and English will be taken into account when calculating the accuracy of both mathematics and English subjects.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Almeida, T.S., Laitz, T., Bonás, G.K., Nogueira, R. (2023). BLUEX: A Benchmark Based on Brazilian Leading Universities Entrance eXams. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14195. Springer, Cham. https://doi.org/10.1007/978-3-031-45368-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-45368-7_22
Published: 12 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45367-0
Online ISBN: 978-3-031-45368-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BLUEX: A Benchmark Based on Brazilian Leading Universities Entrance eXams

Abstract

Similar content being viewed by others

Where Am I? Where Am I Going, and How Do I Get There?: Increasing Learner Agency Through Large-Scale Self Assessment in Language Learning

Designing LLM-Based Study Guides and Personalized Feedback in Brazilian Higher Education: A Skill-Focused Approach to Learning and Assessment Perceptions

Case Study Involving Interdisciplinary Academic Projects for Reading Fluency Assessment and Analysis

Explore related subjects

1 Introduction

2 Related Work

3 The BLUEX Dataset

3.1 Dataset Creation

3.2 Annotated Question Metadata

3.3 Image Positioning

3.4 Dataset Distribution

4 Results

5 Conclusion

6 Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

7 Appendix

7 Appendix

1.1 7.1 Prompt for Evaluation

1.2 7.2 Benchmark per Subject

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us