Abstract
Automated Essay Scoring is one of the most important educational applications of natural language processing. It helps teachers with automatic assessments, providing a cheaper, faster, and more deterministic approach than humans when scoring essays. Nevertheless, off-topic essays pose challenges in this area, causing an automated grader to overestimate the score of an essay that does not adhere to a proposed topic. Thus, detecting off-topic essays is important for dealing with unrelated text responses to a given topic. This paper explored approaches based on handcrafted features to feed supervised machine-learning algorithms, tuning a BERT model, and prompt engineering with a large language model. We assessed these strategies in a public corpus of Portuguese essays, achieving the best result using a fine-tuned BERT model with a 75% balanced accuracy. Furthermore, this strategy was able to identify low-quality essays.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Automated Essay Scoring (AES) is the computer technology that evaluates and scores written prose [27]. It aims to provide computational models for automatically grading essays or with minimal involvement of humans [21]. This research area began with Page [21] in 1966 with the Project Essay Grader system, which, according to Ke and Ng [15], remains since then.
In Brazil, most of the work for the automatic essay grading task focuses on the High School National Exam (ENEM - Exame Nacional do Ensino Médio) [5]. ENEM is used to assess the quality of high school education and as an admission test for most public and private universities. It provides an essay test and requires the production of a text in the dissertation-argumentative genre on a specific prompt (topic). Each ENEM essay is graded according to five traits (competencies) detailed below.
-
1.
Adherence to the formal written norm of Portuguese.
-
2.
Conform to the argumentative text genre and the proposed topic (prompt) to develop a text using knowledge from different areas.
-
3.
Select, relate, organize, and interpret data and arguments to defend a point of view.
-
4.
Usage of argumentative linguistic structures.
-
5.
Develop an intervention proposal to solve the problem in question.
Although existing works attempt to grade an essay considering the five traits, they do not approach off-topic essays, i.e., essays that do not adhere to the expected prompt [1, 9, 20]. Off-topic essays are related to the second trait of the ENEM exam, and according to Higgins et al. [12], they may be classified into two types:
-
1.
Unexpected topic. Possibly well-written essays that do not address the expected topic.
-
2.
Bad-faith. Essays that mainly consist of text copied from the prompt or with irrelevant musings, such as purposely inserted chunks of text unrelated to the topic and the essay itself.
These essay types are challenging and considered a problem for automated essay scoring systems, as a well-written essay that does not address the proposed topic may receive an overestimated score from an automated grader because of linguistic features, such as text structure and surface [23]. Moreover, Kabra et al. [14] has shown that different state-of-the-art automated essay scoring methods fail to provide a low score for poo-quality essays. More than that, off-topic essays occur infrequently, making them difficult to detect automatically. For example, in the 2022 ENEM exam, of the 2,355,395 produced essays, only 31,734 were off-topic, that is, 1.34%Footnote 1.
We investigated three strategies for detecting off-topic essays to tackle the above-mentioned challenges. The first is based on handcrafted features for feeding supervised machine learning algorithms to detect off-topic essays, while the second adopts a fine-tuned BERT model. The last approach explores prompt engineering, examining whether the language proficiency retained by a large language model is useful in detecting off-topic essays. These methods were evaluated using the Essay-BR corpus [19]. The fine-tuned BERT model achieved the best result with 75% balanced accuracy.
The rest of the paper is organized as follows: in Sect. 2, we briefly present related work. Section 3 details the corpus used to evaluate and compare the approaches. In Sect. 4, we detailed the developed strategies. In Sect. 5, we reported and analyzed the achieved results. Finally, Sect. 6 concludes the paper by indicating future directions.
2 Related Work
Due to the difficulty of dealing with off-topic essays and the lack of a large corpus of off-topic essays, most works for the Portuguese language focused only on on-topic essays [9, 20]. So, there are a few works on this theme; we briefly discuss them here.
Passero et al. [23] presented a systematic literature review on automatically detecting off-topic essays. They identified five papers, all of them for the English language. The studies dealt with off-topic essays, mainly using textual features, such as Latent Semantic Analysis [17], n-grams, and Latent Dirichlet Allocation [4], among others, to classify an essay as on- or off-topic. From this study, the authors found that the approaches had high error rates, and the studies mostly used artificial essay sets for validation.
Passero et al. [22] adapted the five studies identified by Passero et al. [23] to Portuguese. The authors evaluated the adaptations in a corpus of 2,164 essaysFootnote 2. As this corpus has only 12 off-topic essays, the authors used the following strategy to use random on-topic essays as off-topic ones. For each set of N essays of a prompt (negative or on-topic examples), N essays are randomly selected from other prompts (positive or off-topic examples). The adapted work of Beigman Klebanov et al. [2] achieved the best result. It detects off-topic essays by estimating the topicality of each word in the essay by comparing its occurrence in essays from the same prompt to essays from other prompts.
Pinho et al. [25] compared several supervised machine learning methods to classify an essay as on- or off-topic. To compare the classifiers, they used a private corpus of 1,320 essays, 230 of which were off-topic. A convolutional neural network achieved the best result with 89.4% accuracy with three-fold cross-validation.
In the following section, we present the corpus used in this study.
3 Corpus
We used the Essay-BR corpus [19] to evaluate and compare the developed strategies. That corpus contains 4,570 argumentative essays graded according to the five traits of the ENEM exam. Out of the total number of essays, 82 are graded with a score of zero, i.e., 1.79%, a similar value to the number of essays scored with zero in the 2022 ENEM exam. The corpus is organized according to Table 1.
To understand why these 82 essays received a zero score, we asked two linguists with experience in grading ENEM essays to evaluate them. For that, we provided the essays and prompts, where each reviewer evaluated all essays independently, following the ENEM criteriaFootnote 3. The reviewers agreed that all 82 essays did not follow the expected prompt and received a zero score for not adhering to it.
Table 1 presents the number of on-topic and off-topic essays in the corpus for each set. As we can see, the corpus is very unbalanced, indicating the need to apply sampling strategies to adjust the class distribution of the corpus.
In what follows, we detail our approaches to handling off-topic essay detection.
4 Off-Topic Detection Strategies
Due to the unbalancing of the corpus, we first evaluated over- and under-sampling methods. For oversampling strategies, we analyzed approaches for generating synthetic data, such as SMOTE (Synthetic Minority Over-Sampling Technique) [6] and ADASYN (Adaptive Synthetic Sampling) [11]. Also, our second oversampling strategy uses random on-topic essays as off-topic ones. We selected 1.543 on-topic essays from the training set and switched their prompts to transform them into off-topic ones. This way, we balanced the training set, resulting in 1.599 on- and off-topic essays. For undersampling, we used a random undersampling strategy from the Imbalanced-learn library [18]. It is important to note that these sampling methods were only applied to the training set.
After balancing the corpus, we developed two methods to detect off-topic essays, as depicted in Fig. 1. In addition to these methods, we checked whether a large language model may detect off-topic essays. In the following subsections, we detail these approaches.
4.1 Features
We manually extracted six features to detect off-topic essays. We extracted these features by comparing each paragraph of an essay to each paragraph of its corresponding prompt. We then compute the cosine similarity for each compared paragraph pair using the extracted feature values, applying Eq. 1, where \(\overrightarrow{u} \cdot \overrightarrow{v}\) is the dot product of the two vectors.
Finally, we calculate the arithmetic mean among all cosine similarity values to obtain a similarity between an essay and its prompt. We describe the features in what follows.
-
Boolean frequency (BF). This feature assigns one value if a word in an essay occurs in its corresponding prompt. Otherwise, a zero value will be assigned, according to Eq. 2.
$$\begin{aligned} BF_{i, j} = 1 \text { if } i \text { occurs in } j \text { and } 0 \text { otherwise} \end{aligned}$$(2) -
Term frequency (TF). This feature computes how often a word in an essay occurs in its corresponding prompt. It is the relative frequency of a word in an essay within a prompt, as presented in Eq. 3.
$$\begin{aligned} TF_{i, j} = \frac{f_{i, j}}{\displaystyle \sum _{i^{'} \in j} f_{i^{'}, j}} \end{aligned}$$(3) -
Term Frequency-Inverse Document Frequency (TF-IDF). According to Eq. 4, this feature assigns the TF-IDF value to each word in an essay for the analyzed prompt, where \(TF_{i, j}\) is the term frequency of i in j, \(df_i\) is the number of prompts containing i, and N is the total number of prompts.
$$\begin{aligned} W_{i, j} = TF_{i, j} \times \log \left( \frac{N}{df_{i}} \right) \end{aligned}$$(4) -
Cosine of Word Embeddings (COS). To calculate the cosine similarity between an essay paragraph and a prompt paragraph, we got the embeddings for the words, computed the average of the word embeddings for each paragraph, and calculated the cosine similarity between these vectors. We used a 300-dimensional GLoVe model pre-trained for Portuguese to get embedding values [10].
-
Word Mover’s Distance (WMD). This feature assesses the distance between two documents even when they have no words in common [16]. It measures the dissimilarity between two text documents as the minimum amount of distance that embedded words of one document need to “travel” to reach the embedded words of another document. It is important to notice that WMD is a distance function, i.e., the lower the distance value is, the more similar the documents are. To get the WMD distance, we first tokenized and removed stopwords of the paragraphs using the Natural Language Toolkit (NLTK) [3]; next, we got the embeddings for the words of the paragraphs; finally, to get the WMD distance, we used the method from the Gensim library [30] that receives a paragraph pair encoded as word embeddings as input and returns the WMD value.
-
Sentence Transformers (ST). It is a modification of the pre-trained BERT network [8] that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity [26]. For this feature, we used a multilingual pre-trained modelFootnote 4 to get the embedding values for each paragraph pair. Then, we computed the cosine similarity between these vectors.
We evaluated some classifiers to assess our approach after the feature extraction step. We used Support Vector Machines (SVM), Naïve Bayes (NB), Decision Tree (DT), MultiLayer Perceptron (MLP), Random Forest (RF) and XGBoost (XGB) from the Scikit-Learn library [24].
4.2 BERT
We fine-tuned the BERTimbau model [29] to detect off-topic essays. For that, we adopt a method developed by de Sousa et al. [28] in which the essays and prompts are used to tune the model. In this way, the model may identify whether an essay adheres to a prompt. We used the large version of the BERTimbau model with the parameters presented in Table 2 to tune the model.
One can see that we employed a linear layer to make predictions. This layer takes the BERT’s hidden size (768 inputs) as input and output of size 1 to predict whether the essay is on- or off-topic. Besides, we used six epochs to train the model and chose the best model through the lowest error in the validation set. We computed the loss using the Binary Cross-Entropy With Logits Loss. This loss function combines a Sigmoid layer and the Binary Cross-Entropy Loss in one class, which is proper for binary classification problems.
4.3 Prompt Engineering
We investigated whether the language proficiency retained by a large language model may identify an off-topic essay. For that, we used Gemini 1.5 ProFootnote 5, a multimodal large language model developed by Google. This model has demonstrated remarkable capabilities on various tasks and has gained significant attention in diverse domainsFootnote 6. Furthermore, this model has a free-of-charge option with limits of 15 requests per minute, 1 million tokens per minute, and 1,500 requests per day, which is sufficient for our experiments.
Our prompt design is straightforward. We instructed Gemini to inform whether an essay is on- or off-topic based on a prompt. In this way, we provided delimiters, such as <prompt> and “‘essay”’ for our inputs to distinguish them, as depicted in Fig. 2.
As shown in the above figure, we instructed the model to classify an essay as on- or off-topic given an essay and prompt as input.
In what follows, we detail the obtained results.
5 Results and Analysis
We evaluated our approaches using the test set of the Essay-BR corpus [19]. We obtained the best results with the undersampling strategy, as shown in Table 3.
Our feature-based approach achieved the best result with the Random Forest classifier, and the fine-tuned BERTimbau model [29] achieved the best overall result. To our surprise, the Gemini model did not identify off-topic essays. As this model reports that it can detect essays that do not adhere to a topic, we expected that it achieved good results for this task.
After assessing the approaches, we performed a feature selection to identify the most important features in our feature-based strategy. To calculate the importance of each feature, we computed the Gini importance (Fig. 3), which measures the impurity of a node in a decision tree with a more substantial weight given to the most important features.
From this figure, we can see that the TF-IDF contributes the least to the accuracy of the model. At the same time, the features that contribute most to accuracy are the sentence transformer, the cosine of word embeddings, and the word mover’s distance. This result is in line with the study of Huang et al. [13], in which the approach developed by the authors achieved good results in detecting off-topic essays.
Based on the feature selection, we removed the TF-IDF feature and re-evaluated our feature-based approach. With this, our results in recall (off-topic) and balanced accuracy metrics improved a little. On the other hand, the results in the recall (on-topic) and f-score metrics were slightly worse, as shown in Table 4.
We also investigated the results obtained by the BERTimbau model through a confusion matrix, presented in Table 5. From this table, we can see a high number of false negatives (27), justifying the low precision value in the off-topic class.
We further analyzed the false negative predictions, performing an error analysis to understand the misclassifications. With this, we realized that out of twenty-seven false negatives, only four were good essays with scores greater than 800. The remaining essays had grades below 300, with issues such as textual genre (six essays), poor and unstructured text (fifteen essays), and tangent to the topic (two essays).
Our analysis revealed that the strategy of tunning BERTimbau with essays and prompts makes the automatic classification more rigorous than human review. Despite that, this result may be a contribution, as different state-of-the-art automated essay scoring methods fail to provide a low score for low-quality essays [14]. However, it is important to analyze the extent to which BERTimbau is strict with poor-quality essays. A deeper inspection of such essays remains necessary for future work.
The developed methods are publicly available at https://github.com/liara-ifpi/Off-topic-essay.
6 Conclusion and Future Work
This paper investigated methods for detecting off-topic essays, such as handcrafted features to feed supervised machine-learning algorithms, the fine-tuned BERTimbau model, and the Gemini language model. These strategies were evaluated in a Brazilian Portuguese corpus, showing some learned lessons, challenges, and contributions. As a learned lesson, we highlight the potential of the sentence-transformer as a feature to identify off-topic essays. As a challenge, we may mention the need to create a corpus with more off-topic essays to improve the precision of classifiers. Finally, one of the contributions of our experiments is the potential of the fine-tuned BERTimbaul model to score or identify low-quality essays.
Future work includes joining the sentence-transformer feature into the BERTimbau model, investigating unsupervised learning strategies, as very recent work has achieved good results for the English language [7], and building a more balanced corpus.
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
References
Amorim, E., Veloso, A.: A multi-aspect analysis of automatic essay scoring for Brazilian Portuguese. In: Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 94–102. Association for Computational Linguistics, Valencia, Spain (2017)
Beigman Klebanov, B., Flor, M., Gyawali, B.: Topicality-based indices for essay scoring. In: Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 63–72. Association for Computational Linguistics, San Diego, CA (2016)
Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc. (2009)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Caseli, H.M., Nunes, M.G.V. (eds.): Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. BPLN, 2 edn. (2024). ISBN 978-65-00-95750-1
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Das, S.D., Vadi, Y.A., Yadav, K.: Transformer-based joint modelling for automatic essay scoring and off-topic detection. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 16751–16761. ELRA and ICCL, Torino, Italia (2024)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019)
Fonseca, E., Medeiros, I., Kamikawachi, D., Bokan, A.: Automatically grading Brazilian student essays. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 170–179. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_18
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Silva, J., Aluísio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. In: Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, pp. 122–131. Sociedade Brasileira de Computação, Uberlândia, Brazil (2017)
He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE, Hong Kong (2008)
Higgins, D., Burstein, J., Attali, Y.: Identifying off-topic student essays without topic-specific training data. Nat. Lang. Eng. 12(2), 145–159 (2006)
Huang, P., Li, L., Wu, C., Zhang, X., Liu, Z.: A study of sentence-bert based essay off-topic detection. In: Proceedings of the 4th International Conference on Computing, Networks and Internet of Things, pp. 515–519. Association for Computing Machinery, Xiamen, China (2023)
Kabra, A., Bhatia, M., Singla, Y.K., Jessy Li, J., Ratn Shah, R.: Evaluation toolkit for robustness testing of automatic essay scoring systems. In: Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD), pp. 90–99. Association for Computing Machinery, Bangalore, India (2022)
Ke, Z., Ng, V.: Automated essay scoring: a survey of the state of the art. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 6300–6308. AAAI Press, Macao, China (2019)
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 957–966. Proceedings of Machine Learning Research, PMLR, Lille, France (2015)
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)
Marinho, J.C., Anchiêta, R.T., Moura, R.S.: Essay-BR: a Brazilian corpus of essays. In: XXXIV Simpósio Brasileiro de Banco de Dados: Dataset Showcase Workshop. SBBD 2021, pp. 53–64. SBC, Online (2021)
Marinho, J.C., C., F., Anchiêta, R.T., Moura, R.S.: Automated essay scoring: an approach based on enem competencies. In: Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional, pp. 49–60. SBC, Campinas, Brazil (2022)
Page, E.B.: The imminence of... grading essays by computer. The Phi Delta Kappan 47(5), 238–243 (1966)
Passero, G., Ferreira, R., Dazzi, R.L.S.: Off-topic essay detection: a comparative study on the Portuguese language. Revista Brasileira de Informática na Educação 27(03), 177–190 (2019)
Passero, G., Ferreira, R., Haendchen Filho, A., Dazzi, R.: Off-topic essay detection: a systematic review. In: Proceedings of the XXVIII Brazilian Symposium on Computers in Education, pp. 51–60. SBC, Recife, Brazil (2017)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pinho, C.M.D.A., Gaspar, M.A., Sassi, R.J.: Aplicação de técnicas de inteligência artificial para classificação de fuga ao tema em redações. Educação em Revista 40, e39773 (2024)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics, Hong Kong, China (2019)
Shermis, M.D., Barrera, F.D.: Exit assessments: evaluating writing ability through automated essay scoring. In: Annual Meeting of the American Educational Research Association, pp. 1–30. ERIC, New Orleans, LA (2002)
de Sousa, R.F., Marinho, J.C., Neto, F.A.R., Anchiêta, R.T., Moura, R.S.: PiLN at PROPOR: a BERT-based strategy for grading narrative essays. In: Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2, pp. 10–13. Association for Computational Linguistics, Santiago de Compostela, Galicia/Spain (2024)
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks, pp. 46–50. University of Malta, Valletta, Malta (2010)
Acknowledgments
The authors are grateful to Fundação de Amparo à Pesquisa do Estado do Piauí (FAPEPI) and Virtex Telecom for supporting this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Silva, J.M., Anchiêta, R.T., de Sousa, R.F., Moura, R.S. (2025). Investigating Methods to Detect Off-Topic Essays. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15414. Springer, Cham. https://doi.org/10.1007/978-3-031-79035-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-79035-5_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79034-8
Online ISBN: 978-3-031-79035-5
eBook Packages: Computer ScienceComputer Science (R0)


