1 Introduction

Automated Essay Scoring (AES) is the computer technology that evaluates and scores written prose [27]. It aims to provide computational models for automatically grading essays or with minimal involvement of humans [21]. This research area began with Page [21] in 1966 with the Project Essay Grader system, which, according to Ke and Ng [15], remains since then.

In Brazil, most of the work for the automatic essay grading task focuses on the High School National Exam (ENEM - Exame Nacional do Ensino Médio) [5]. ENEM is used to assess the quality of high school education and as an admission test for most public and private universities. It provides an essay test and requires the production of a text in the dissertation-argumentative genre on a specific prompt (topic). Each ENEM essay is graded according to five traits (competencies) detailed below.

  1. 1.

    Adherence to the formal written norm of Portuguese.

  2. 2.

    Conform to the argumentative text genre and the proposed topic (prompt) to develop a text using knowledge from different areas.

  3. 3.

    Select, relate, organize, and interpret data and arguments to defend a point of view.

  4. 4.

    Usage of argumentative linguistic structures.

  5. 5.

    Develop an intervention proposal to solve the problem in question.

Although existing works attempt to grade an essay considering the five traits, they do not approach off-topic essays, i.e., essays that do not adhere to the expected prompt [1, 9, 20]. Off-topic essays are related to the second trait of the ENEM exam, and according to Higgins et al. [12], they may be classified into two types:

  1. 1.

    Unexpected topic. Possibly well-written essays that do not address the expected topic.

  2. 2.

    Bad-faith. Essays that mainly consist of text copied from the prompt or with irrelevant musings, such as purposely inserted chunks of text unrelated to the topic and the essay itself.

These essay types are challenging and considered a problem for automated essay scoring systems, as a well-written essay that does not address the proposed topic may receive an overestimated score from an automated grader because of linguistic features, such as text structure and surface [23]. Moreover, Kabra et al. [14] has shown that different state-of-the-art automated essay scoring methods fail to provide a low score for poo-quality essays. More than that, off-topic essays occur infrequently, making them difficult to detect automatically. For example, in the 2022 ENEM exam, of the 2,355,395 produced essays, only 31,734 were off-topic, that is, 1.34%Footnote 1.

We investigated three strategies for detecting off-topic essays to tackle the above-mentioned challenges. The first is based on handcrafted features for feeding supervised machine learning algorithms to detect off-topic essays, while the second adopts a fine-tuned BERT model. The last approach explores prompt engineering, examining whether the language proficiency retained by a large language model is useful in detecting off-topic essays. These methods were evaluated using the Essay-BR corpus [19]. The fine-tuned BERT model achieved the best result with 75% balanced accuracy.

The rest of the paper is organized as follows: in Sect. 2, we briefly present related work. Section 3 details the corpus used to evaluate and compare the approaches. In Sect. 4, we detailed the developed strategies. In Sect. 5, we reported and analyzed the achieved results. Finally, Sect. 6 concludes the paper by indicating future directions.

2 Related Work

Due to the difficulty of dealing with off-topic essays and the lack of a large corpus of off-topic essays, most works for the Portuguese language focused only on on-topic essays [9, 20]. So, there are a few works on this theme; we briefly discuss them here.

Passero et al. [23] presented a systematic literature review on automatically detecting off-topic essays. They identified five papers, all of them for the English language. The studies dealt with off-topic essays, mainly using textual features, such as Latent Semantic Analysis [17], n-grams, and Latent Dirichlet Allocation [4], among others, to classify an essay as on- or off-topic. From this study, the authors found that the approaches had high error rates, and the studies mostly used artificial essay sets for validation.

Passero et al. [22] adapted the five studies identified by Passero et al. [23] to Portuguese. The authors evaluated the adaptations in a corpus of 2,164 essaysFootnote 2. As this corpus has only 12 off-topic essays, the authors used the following strategy to use random on-topic essays as off-topic ones. For each set of N essays of a prompt (negative or on-topic examples), N essays are randomly selected from other prompts (positive or off-topic examples). The adapted work of Beigman Klebanov et al. [2] achieved the best result. It detects off-topic essays by estimating the topicality of each word in the essay by comparing its occurrence in essays from the same prompt to essays from other prompts.

Pinho et al. [25] compared several supervised machine learning methods to classify an essay as on- or off-topic. To compare the classifiers, they used a private corpus of 1,320 essays, 230 of which were off-topic. A convolutional neural network achieved the best result with 89.4% accuracy with three-fold cross-validation.

In the following section, we present the corpus used in this study.

3 Corpus

We used the Essay-BR corpus [19] to evaluate and compare the developed strategies. That corpus contains 4,570 argumentative essays graded according to the five traits of the ENEM exam. Out of the total number of essays, 82 are graded with a score of zero, i.e., 1.79%, a similar value to the number of essays scored with zero in the 2022 ENEM exam. The corpus is organized according to Table 1.

Table 1. The number of on-topic and off-topic essays in the corpus.

To understand why these 82 essays received a zero score, we asked two linguists with experience in grading ENEM essays to evaluate them. For that, we provided the essays and prompts, where each reviewer evaluated all essays independently, following the ENEM criteriaFootnote 3. The reviewers agreed that all 82 essays did not follow the expected prompt and received a zero score for not adhering to it.

Table 1 presents the number of on-topic and off-topic essays in the corpus for each set. As we can see, the corpus is very unbalanced, indicating the need to apply sampling strategies to adjust the class distribution of the corpus.

In what follows, we detail our approaches to handling off-topic essay detection.

4 Off-Topic Detection Strategies

Due to the unbalancing of the corpus, we first evaluated over- and under-sampling methods. For oversampling strategies, we analyzed approaches for generating synthetic data, such as SMOTE (Synthetic Minority Over-Sampling Technique) [6] and ADASYN (Adaptive Synthetic Sampling) [11]. Also, our second oversampling strategy uses random on-topic essays as off-topic ones. We selected 1.543 on-topic essays from the training set and switched their prompts to transform them into off-topic ones. This way, we balanced the training set, resulting in 1.599 on- and off-topic essays. For undersampling, we used a random undersampling strategy from the Imbalanced-learn library [18]. It is important to note that these sampling methods were only applied to the training set.

After balancing the corpus, we developed two methods to detect off-topic essays, as depicted in Fig. 1. In addition to these methods, we checked whether a large language model may detect off-topic essays. In the following subsections, we detail these approaches.

Fig. 1.
figure 1

Approaches to detect off-topic essays.

4.1 Features

We manually extracted six features to detect off-topic essays. We extracted these features by comparing each paragraph of an essay to each paragraph of its corresponding prompt. We then compute the cosine similarity for each compared paragraph pair using the extracted feature values, applying Eq. 1, where \(\overrightarrow{u} \cdot \overrightarrow{v}\) is the dot product of the two vectors.

$$\begin{aligned} cos(\theta ) = \frac{\overrightarrow{u} \cdot \overrightarrow{v}}{\Vert \overrightarrow{u}\Vert \Vert \overrightarrow{v}\Vert } \end{aligned}$$
(1)

Finally, we calculate the arithmetic mean among all cosine similarity values to obtain a similarity between an essay and its prompt. We describe the features in what follows.

  • Boolean frequency (BF). This feature assigns one value if a word in an essay occurs in its corresponding prompt. Otherwise, a zero value will be assigned, according to Eq. 2.

    $$\begin{aligned} BF_{i, j} = 1 \text { if } i \text { occurs in } j \text { and } 0 \text { otherwise} \end{aligned}$$
    (2)
  • Term frequency (TF). This feature computes how often a word in an essay occurs in its corresponding prompt. It is the relative frequency of a word in an essay within a prompt, as presented in Eq. 3.

    $$\begin{aligned} TF_{i, j} = \frac{f_{i, j}}{\displaystyle \sum _{i^{'} \in j} f_{i^{'}, j}} \end{aligned}$$
    (3)
  • Term Frequency-Inverse Document Frequency (TF-IDF). According to Eq. 4, this feature assigns the TF-IDF value to each word in an essay for the analyzed prompt, where \(TF_{i, j}\) is the term frequency of i in j, \(df_i\) is the number of prompts containing i, and N is the total number of prompts.

    $$\begin{aligned} W_{i, j} = TF_{i, j} \times \log \left( \frac{N}{df_{i}} \right) \end{aligned}$$
    (4)
  • Cosine of Word Embeddings (COS). To calculate the cosine similarity between an essay paragraph and a prompt paragraph, we got the embeddings for the words, computed the average of the word embeddings for each paragraph, and calculated the cosine similarity between these vectors. We used a 300-dimensional GLoVe model pre-trained for Portuguese to get embedding values [10].

  • Word Mover’s Distance (WMD). This feature assesses the distance between two documents even when they have no words in common [16]. It measures the dissimilarity between two text documents as the minimum amount of distance that embedded words of one document need to “travel” to reach the embedded words of another document. It is important to notice that WMD is a distance function, i.e., the lower the distance value is, the more similar the documents are. To get the WMD distance, we first tokenized and removed stopwords of the paragraphs using the Natural Language Toolkit (NLTK) [3]; next, we got the embeddings for the words of the paragraphs; finally, to get the WMD distance, we used the method from the Gensim library [30] that receives a paragraph pair encoded as word embeddings as input and returns the WMD value.

  • Sentence Transformers (ST). It is a modification of the pre-trained BERT network [8] that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity [26]. For this feature, we used a multilingual pre-trained modelFootnote 4 to get the embedding values for each paragraph pair. Then, we computed the cosine similarity between these vectors.

We evaluated some classifiers to assess our approach after the feature extraction step. We used Support Vector Machines (SVM), Naïve Bayes (NB), Decision Tree (DT), MultiLayer Perceptron (MLP), Random Forest (RF) and XGBoost (XGB) from the Scikit-Learn library [24].

4.2 BERT

We fine-tuned the BERTimbau model [29] to detect off-topic essays. For that, we adopt a method developed by de Sousa et al. [28] in which the essays and prompts are used to tune the model. In this way, the model may identify whether an essay adheres to a prompt. We used the large version of the BERTimbau model with the parameters presented in Table 2 to tune the model.

Table 2. Training parameters for the BERTimbau large model.

One can see that we employed a linear layer to make predictions. This layer takes the BERT’s hidden size (768 inputs) as input and output of size 1 to predict whether the essay is on- or off-topic. Besides, we used six epochs to train the model and chose the best model through the lowest error in the validation set. We computed the loss using the Binary Cross-Entropy With Logits Loss. This loss function combines a Sigmoid layer and the Binary Cross-Entropy Loss in one class, which is proper for binary classification problems.

4.3 Prompt Engineering

We investigated whether the language proficiency retained by a large language model may identify an off-topic essay. For that, we used Gemini 1.5 ProFootnote 5, a multimodal large language model developed by Google. This model has demonstrated remarkable capabilities on various tasks and has gained significant attention in diverse domainsFootnote 6. Furthermore, this model has a free-of-charge option with limits of 15 requests per minute, 1 million tokens per minute, and 1,500 requests per day, which is sufficient for our experiments.

Our prompt design is straightforward. We instructed Gemini to inform whether an essay is on- or off-topic based on a prompt. In this way, we provided delimiters, such as <prompt> and “‘essay”’ for our inputs to distinguish them, as depicted in Fig. 2.

Fig. 2.
figure 2

A simple prompt asking to classify the input essay based on a prompt.

As shown in the above figure, we instructed the model to classify an essay as on- or off-topic given an essay and prompt as input.

In what follows, we detail the obtained results.

5 Results and Analysis

We evaluated our approaches using the test set of the Essay-BR corpus [19]. We obtained the best results with the undersampling strategy, as shown in Table 3.

Table 3. Results with undersampling strategy.

Our feature-based approach achieved the best result with the Random Forest classifier, and the fine-tuned BERTimbau model [29] achieved the best overall result. To our surprise, the Gemini model did not identify off-topic essays. As this model reports that it can detect essays that do not adhere to a topic, we expected that it achieved good results for this task.

After assessing the approaches, we performed a feature selection to identify the most important features in our feature-based strategy. To calculate the importance of each feature, we computed the Gini importance (Fig. 3), which measures the impurity of a node in a decision tree with a more substantial weight given to the most important features.

Fig. 3.
figure 3

Importance of each feature.

From this figure, we can see that the TF-IDF contributes the least to the accuracy of the model. At the same time, the features that contribute most to accuracy are the sentence transformer, the cosine of word embeddings, and the word mover’s distance. This result is in line with the study of Huang et al. [13], in which the approach developed by the authors achieved good results in detecting off-topic essays.

Based on the feature selection, we removed the TF-IDF feature and re-evaluated our feature-based approach. With this, our results in recall (off-topic) and balanced accuracy metrics improved a little. On the other hand, the results in the recall (on-topic) and f-score metrics were slightly worse, as shown in Table 4.

Table 4. Results after feature selection.

We also investigated the results obtained by the BERTimbau model through a confusion matrix, presented in Table 5. From this table, we can see a high number of false negatives (27), justifying the low precision value in the off-topic class.

Table 5. Confusion matrix of the BERTimbau approach.

We further analyzed the false negative predictions, performing an error analysis to understand the misclassifications. With this, we realized that out of twenty-seven false negatives, only four were good essays with scores greater than 800. The remaining essays had grades below 300, with issues such as textual genre (six essays), poor and unstructured text (fifteen essays), and tangent to the topic (two essays).

Our analysis revealed that the strategy of tunning BERTimbau with essays and prompts makes the automatic classification more rigorous than human review. Despite that, this result may be a contribution, as different state-of-the-art automated essay scoring methods fail to provide a low score for low-quality essays [14]. However, it is important to analyze the extent to which BERTimbau is strict with poor-quality essays. A deeper inspection of such essays remains necessary for future work.

The developed methods are publicly available at https://github.com/liara-ifpi/Off-topic-essay.

6 Conclusion and Future Work

This paper investigated methods for detecting off-topic essays, such as handcrafted features to feed supervised machine-learning algorithms, the fine-tuned BERTimbau model, and the Gemini language model. These strategies were evaluated in a Brazilian Portuguese corpus, showing some learned lessons, challenges, and contributions. As a learned lesson, we highlight the potential of the sentence-transformer as a feature to identify off-topic essays. As a challenge, we may mention the need to create a corpus with more off-topic essays to improve the precision of classifiers. Finally, one of the contributions of our experiments is the potential of the fine-tuned BERTimbaul model to score or identify low-quality essays.

Future work includes joining the sentence-transformer feature into the BERTimbau model, investigating unsupervised learning strategies, as very recent work has achieved good results for the English language [7], and building a more balanced corpus.