1 Introduction

In the literature on artificial intelligence in education, certain behaviors have received significant attention within the context of students’ interactions with online learning environments. Recent research has identified two main types of behavior observed in students during these interactions: learning-oriented behavior and non-learning-oriented behavior. Non-learning-oriented behavior often involves students who do not effectively utilize the help resources available in virtual learning environments [7], resulting in an unproductive search for assistance [2]. This type of behavior is considered a form of “gaming the system” [12], defined as “trying to succeed in an educational environment by exploiting the properties of the system rather than learning the material and trying to use that knowledge to respond correctly”. Furthermore, this behavior can lead students to produce rushed or incomplete work [15] and has been associated with lower academic performance [6], and can even result in almost zero learning [29]. Understanding these student behaviors in virtual learning environments is crucial for developing effective support and intervention strategies to improve student engagement and academic performance. More specifically, an important challenge is to seek an effective identification mechanism for gaming the system behavior, an issue that has been addressed in various proposals in the literature in different proposals but remains largely unexplored in the programming domain.

Given the mentioned context, previous research has predominantly focused on two different approaches for modeling “gaming the system” behaviors. The first approach, known as knowledge engineering, involves the development of explicit rules by domain experts to identify“gaming” behaviors [17, 21, 23] and machine learning, where the model designer first creates a set of features and then a supervised learning algorithm is used to select features to predict labels of human-coded “gaming” [12, 27].

Additionally, a recent study compared three gaming detectors across multiple systems: a knowledge engineering model and a machine-learned model [20]. The knowledge engineering model was developed using cognitive task analysis in a Tutor in the Algebra domain, identifying 13 student behavior patterns. Comparisons focused on the predictive performance of the models on new data sets and their interpretability. The results indicated that the knowledge engineering model achieved better generalization and interpretability compared to the machine-learned model. However, this model was developed by studying students during interaction with an algebra tutor with a specific type of task and is not generalizable to other contexts as demonstrated by [13]. Despite the important contribution of previous work [13] in considering some types of tasks, the diversity of tasks is limited and does not consider other domains, such as computer programming.

This behavior has been particularly observed in the area of programming, although it is a relatively recent phenomenon and little documented in the literature [25]. Despite this, some recent studies have indicated that students taking introductory programming courses have reported high frequencies of this behavior [24, 28]. However, although previous studies have made progress in identifying and understanding this behavior, making use of machine learning and knowledge engineering models, previously validated detectors have not fully captured the “gaming the system” attitudes observed in our programming learning context for starters. This challenge may be related to the transferability of gaming motivations between different contexts, as well as detection. This limitation in considering domain characteristics may restrict the generalization of “gaming the system” detectors to new contexts.

With the growing demand for effective and engaging learning systems, the generalizability, interpretability, and development cost of “gaming the system” detectors are becoming increasingly important. Therefore, in this article we adopt an approach focused on the exploration and specific detection of “gaming the system” attitudes expressed by novice programmers interacting with a learning environment. We propose a hybrid approach to detect this behavior, combining knowledge engineering and machine learning algorithms. Furthermore, aiming to contribute to a generalization, in this study, we used this approach to detect the behavior of beginning programmers in a specific demographic context: a rural school in the Northeast of Brazil, a region historically marked by high social inequality [19]. The research aimed to understand the attitudes of beginner programmers who exhibit “gaming the system” behavior and detecting such behavior. As a result, in addition to the hybrid detection approach, we identified two new game attitudes and incorporated them into the knowledge-based model, using cognitive task analysis to understand how experts encode these attitudes.

2 Background and Related Work

Research into detecting“gaming the system” behavior generally focuses on identifying and characterizing such behavior. However, it is crucial to distinguish between “gaming the system” and cheating. While abusing the system involves taking advantage of flaws or loopholes in the system, cheating implies directly violating the established rules [6]. It is therefore essential not to simplistically label students who manifest these behaviors as “gamers” or “cheaters”. They may have complex motivations that lead them to act in an unproductive or undesirable way [6].

To mitigate this type of behavior, many authors have proposed simply redesigning the tutor to eliminate the behavior [8]. However, if the behavior is symptomatic and involves other aspects and unknown motivations, such a solution could mask the problem rather than eliminate it. Therefore, when designing systems that meet the objectives, attitudes, and behaviors of students in order to positively impact learning, it is necessary to investigate the motivations and characterize the profile or profiles of students who manifest the behavior of “gaming the system”. In this sense, some authors have carried out studies seeking to understand the motivations and beliefs of students who manifest this behavior.

In a study conducted by [27], the authors investigated “gaming the system” behavior through a 32-question survey. They found that students who manifested this behavior had distinct characteristics: they were less likely to own a computer at home, had lower confidence in their mathematical abilities, frequently avoided doing math homework, showed fewer problems concentrating on the computer, expressed frustration when faced with difficult tasks, preferred learning with a computer over other ways, sought to solve problems quickly, had less interest in math classes, preferred facts and data to concepts and ideas, perceived hints as more useful in understanding similar problems, prioritized getting through as many items as possible and had the goal of learning new things. These results highlight the association between “gaming the system” behavior and various student attitudes and beliefs about learning.

In previous studies, it has been observed that students may adopt the behavior of “gaming the system” due to the belief that they cannot succeed otherwise [4, 6]. In addition, the relationship between this behavior and performance goals, to the detriment of learning goals [6, 16]. However, some motivations may not be transferable from one country to another, as well as help-seeking behavior. Some research has shown that models of effective help-seeking do not apply uniformly across different countries, such as Costa Rica, the Philippines, and the USA [18]. This finding is in line with the results of previous studies in various educational contexts, highlighting the influence of racial and gender interactions on help-seeking behaviors, which in turn can influence subsequent performance patterns [22]. In addition, several researchers have investigated how demographic variables can influence help-seeking behavior [14].

On the other hand, the authors studied students in a flipped CS1 course, examining survey data to identify factors that contribute to novice programmers’ engagement in three inappropriate behaviors, two of which are related to “gaming the system” behavior [28]. The authors identified that students with a constructive mindset are correlated with less “gaming the system” behavior, while those who perceive the computing course as fun have an increased risk of engaging in such behavior. In addition, discomfort with speaking and listening to English was also associated with a greater propensity to “gaming the system” [28]. These findings highlight the importance of considering not only academic aspects but also psychological and socio-emotional factors in understanding “gaming the system” behavior. However, this study found no correlation between gambling attitudes and students’ prior knowledge, contradicting previous reports in the literature [6].

2.1 “Gaming the System” Behavior Detectors

One of the most recent works on variable discovery for detection proposes a latent variable model for detecting “gaming the system” behavior based on item response theory (IRT-GD) in the algebra domain [13]. The model estimates a latent “gaming” tendency for each student, taking into account contextual factors. These factors include the frequency with which students game in less common gaming contexts compared to more common contexts, as observed by students’ action patterns and the predictability of human labels. The IRT-GD is based on a previously validated knowledge engineering game detector (KE-GD), which focuses on students’ action characteristics and the predictability of human labels. However, although the study does not explicitly explain the variables considered to implement the detector, the authors claim to have used a previously validated knowledge engineering gaming detector containing 13 interpretable patterns that model behavior [20, 21].

3 Research Methods

In this section, we describe the materials and methods used to conduct the research presented in this article. The construction of our “gaming the system” behavior detection model follows a hybrid approach, combining knowledge engineering and machine learning algorithms. To this end, we detail the experiments carried out to generate our database, followed by the elaboration of the model based on knowledge engineering, and, finally, we explain the method used to develop the model based on machine learning.

3.1 Experiment

The experiment carried out in this research took place in the first semester of 2023 and involved students from two second-year technical high school classes located in the interior of Northeast Brazil. These students were new to programming and had recently completed an introductory course on the subject. They were currently enrolled in a second course, focused on Java programming (CS2). CS2 students were chosen because they were new to programming and had some familiarity with basic concepts. The study was carried out at this time, as the students were about to start learning object-oriented programming in the CS2 course. The experiment was carried out in four sessions, with two classes each, totaling 67 participating students, 29 from the afternoon class and 38 from the morning class, who participated in all problem-solving sessions. Of these participants, 35 were female and 32 were male, aged between 16 and 21 years. Of these, 61% were white and 39% were of African descent.

Materials. The ADAFootnote 1 environment was created as a learning tool for introductory programming courses, following a conceptual structure composed of elements such as classes, sessions, problems, alternatives, attempts, tips, code, and subjects. Each lesson is linked to student and problem-solving sessions. The problems include four hints and alternatives, and students can make multiple attempts to solve them. Features like the problem solver allow students to submit partial or complete solutions, while hints are offered at different levels, from abstract suggestions to direct answers, to guide students through each stage of the problem. Students have the freedom to request hints at any time while solving the problem.

Procedure. The problem-solving sessions were held in person in two computer labs, with one student per computer. Each laboratory was organized into five vertical rows of eight computers. In addition to the students, three researchers were present in each laboratory. The experiment took place during computer programming classes (sessions) in three stages: pre-test, intervention, and post-test. In the pre-test, the first session was subdivided into 30 min for presentation and 70 min for problem-solving. During this period, the objectives of the study were explained to the students, forms of consent for participation in the experiment were distributed, following the guidelines of the ethics committee, and a comprehensive explanation of the environment’s functionalities was provided. In addition, students used ADA to register and respond to the socioeconomic questionnaire.

After the introductory session, participants carried out two more problem-solving sessions (100 min) using the ADA environment, being observed by three researchers. Through these observations, it was possible to record occurrences of “game the system” behavior in a similar way to some previous works [3, 5, 21]. Therefore, field observations were carried out so that observers observed each student’s behavior several times during each class period. Each observer coded the frequency of “gaming the system” behavior and its nature.

Table 1. Gaming the System Attitudes

3.2 Model Development

Based on the findings of the previous model [21], a new model for coding system abuse behavior was designed, and adapted to the type of task provided by the ADA environment. In this model, the initial approach is derived from the previous proposal, which groups the student’s actions, during interaction with the virtual learning environment, into sequences called text replays, with up to five actions in each. The authors, based on these text replays, developed patterns for sequences of student actions characterized as“gaming the system”. Thus, if a pattern is identified in the analysis of the data set, all actions involved are labeled as abuse. To develop the model, the authors initially created a list of coders’ interpretations of student behavior, whether related to gaming behavior or not, and these interpretations were used to create the patterns (Fig. 1).

Fig. 1.
figure 1

Confusion Matrix: Decision Tree, Neural Network (MLP) and K-Nearest Neighbors (KNN)

The development of our knowledge-based model was conducted in four stages: analysis of the previous model, study of attitudes, investigation of specialized coding, and elaboration of the new model. In the first step, the 13 standards presented in the previous model [21] were categorized into two groups: compatible and incompatible with the ADA environment. Standards considered compatible were maintained, while those considered incompatible were removed or adjusted. Next, an analysis of the coverage of the five attitudes related to the manifestation of “gaming the system” behavior was carried out, as presented in Table 1. We observed that attitudes 4 and 5 were not addressed in the standards of the previous model. Therefore, similar to a previous study [21], we chose to elicit experts’ knowledge through an active participation approach [11].

Applying the active participation approach [11], knowledge elicitation sessions were conducted with experts. In the first session, we conducted an initial interaction in which the elicitor observed and took separate notes for each expert as they coded and thought aloud [26] about game behavior using text replays. In the second session, the elicitor coded some excerpts while thinking aloud, allowing each expert to comment and correct their thought process. This method provided the elicitor with a deeper understanding of the process, resulting in the elaboration of the first version of the rules. Next, a new knowledge elicitation session was conducted, following each expert as they coded and expressed their thoughts aloud. Subsequently, an evaluation session of the rules was held by experts, we made adjustments as necessary and, finally, we presented the final version in a final session. It is worth noting that the preliminary model was substantially influenced by previous work [21]. The final version of the model will be detailed and discussed in the next session.

After developing our knowledge-based model, we move on to developing the machine learning-based model. Using the latter, we proceeded to annotate the data generated during the experiment. Subsequently, we analyzed three stages: pre-processing, classification, and evaluation of the data. In data preprocessing, we perform two crucial tasks: data cleaning and data transformation. In data cleaning, we remove missing data to avoid inconsistencies and errors during analysis and modeling. In data transformation, we apply normalization and standardization. In normalization we adjust attribute values to a common scale, generally between 0 and 1, allowing each attribute to contribute equally to the final result. In standardization, we adjust the data so that it has a mean of zero and a standard deviation of one, useful for standardizing deviations from the mean.

In the classification step, we use a technique called 15-fold cross-validation. This method involves dividing the data into 15 parts and in each iteration training the model on 14 parts and testing it on the remaining part. By repeating this process 15 times, we increase the algorithm’s generalization capacity, resulting in more accurate predictions [9]. Then, we applied three algorithms: decision tree, k-nearest neighbors (KNN) and neural networks using the Multi-Layer Perceptron (MLP) [10]. This selection was based on previous studies [3]. The implementation was carried out with the scikit-learnFootnote 2 library, configuring the decision tree with the CART algorithm, the KNN with 3 neighbors and the neural network with MLP. In the evaluation stage, we analyze the performance of each algorithm on the training and testing subsets. We used six metrics: area under the ROC curve (AUC), accuracy, precision, recall, F1 and Kappa , with a significance level of \(p-value < 0.05\). AUC evaluates the model’s ability to perform accurate classifications. Precision, recall, and F1 provide a complete view of performance, while Kappa checks whether the model determines the correct sequences for reasons that are not based on chance.

4 Results and Discussion

Table 2. Gaming the System Interpretations
Table 3. Gaming the System Patterns

4.1 Model Based on Knowledge Engineering

Similar to the model proposed in the previous work [21], our cognitive study of identifying gaming behavior can be subdivided into two parts: interpretation of student actions and identification of game patterns. Although these two parts are normally performed simultaneously by the specialist, the cognitive model is capable of executing them separately without compromising the integrity of the process.

Interpreting Student Actions. According to the functionalities presented in the ADA environment, as detailed in the 3.1 section, students can request help at various levels and submit their solutions to a problem in two stages: partial solution and full solution. The partial solution involves a multiple-choice activity, while the full solution is presented in code form. To create text replays, the specialist begins by observing the pauses between each student’s actions in the environment, building a probable interpretation of the student’s mental process. In Table 2, we offer a list and description of the elements of student behaviors identified during the knowledge elicitation process with the expert, in line with the findings of previous work [21].

The length of pauses before asking for help can indicate whether the student took time to think about the problem before seeking assistance. Likewise, the length of the pause after receiving help indicates whether the student took time to read the help message, whether they were scanning the help message for specific information, or whether they were expressing the Att2 or Att3 attitudes in Table 1. The pauses before trying each alternative may indicate whether the student took time to reflect before choosing another option, or whether he was expressing the Att1 attitude from Table 1.

When a partial submission is considered incorrect, the presence of a pause after this action suggests that the student is reflecting on the error made. Given that this type of action often involves expected errors, a prolonged pause before the mistake usually indicates a sincere, albeit unsuccessful, attempt to resolve the problem. In contrast, a short pause often indicates that students are trying to guess the answer by submitting several partial solutions to the problem.

Additionally, when a student makes some mistakes by submitting incorrect alternatives, he or she may exit the problem, quickly select another problem, and then return to the original problem to submit the correct answer, which may manifest the Att4 attitude. As for the total solution, if the student makes an empty submission or presents useless code for the proposed problem, it is an indication that he may be expressing the Att5 attitude.

Identifying “Gaming the System” Patterns. When identifying potential student actions when interacting with the environment, whether related to the manifestation of “gaming the system” attitudes or not, it is crucial to reference Table 2. In this table, interpretations of the student’s individual actions are described, but it is essential to understand that analyzing each action in isolation is not enough to determine whether the student is cheating the system. Therefore, it is necessary to find patterns of actions that serve as robust evidence of such behavior. In Table 3, we present the patterns identified during the knowledge discovery process, in collaboration with experts in the field.

In the first column of Table 3, we assign a tag to each identified pattern. In the second column, we describe the rules derived based on the interpretations of Table 2. Finally, in the third column, we relate each rule to a possible student attitude during interaction with the environment. Each pattern can be represented by the Eq. 1, where a pattern is composed of a set of up to 5 text replays (r) associated with up to five types of “gaming the system” attitudes (a).

$$\begin{aligned} P & = & (\bigcup _{j \le 5} (r), a_{1..5}) \end{aligned}$$
(1)

After identifying some possible student actions when interacting with the environment, whether related to the manifestation of “gaming the system” attitudes or not, as described in Table 2, the specialist interprets the combination of these actions to judge whether the student is manipulating the system. This is because analyzing each action in isolation is not enough to encode gaming behavior. Therefore, it is essential to identify patterns of actions that act as significant evidence that the student is “gaming the system”. In conclusion, our knowledge elicitation process encompassed literature review, dataset analysis, and expert observations. As a result of this process, we identified eight essential patterns for encoding a clip as a game instance in the programming learning system, as presented in Table 3. In the first column of the table, we assign a unique tag to each pattern. The second column highlights the rules derived based on the interpretations of Table 2 from our database and interactions with experts. In the third column, we associate each rule with a specific student attitude during interaction with the environment. Notably, some clips were linked to more than one pattern.

Table 4. Results of the algorithm comparison

4.2 Machine Learning Based Model

After the development of the knowledge-based model was completed, we proceeded with the development of the machine-learning model. To do this, we used the patterns presented in Table 3 to label our dataset in a two-step process, led by two expert coders. In the first stage, the coders performed the task independently, resulting in lower reliability between them, with a Cohen’s Kappa coefficient of 0.72 to distinguish between the classes class0 and class1. In a second round of labeling, coders identified instances of disagreement, participated in discussions to align their interpretations of the behaviors, and subsequently conducted supplemental labeling. This collaborative effort resulted in a notably higher Cohen’s Kappa coefficient of 0.93 for distinguishing between gaming and non-gaming behaviors in the second set of text replays. As a result, we obtained our dataset divided according to Table 5.

Using the already labeled data, we applied three machine learning algorithms (CART, KNN, and MLP) and evaluated their performance through confusion matrix analysis, along with previously mentioned metrics such as accuracy (ACU), precision (PRE), recall (REC), F1 score (FMe), Kappa coefficient (KAP) and Area under the ROC Curve (ROC). Additionally, we will conduct a statistical analysis to assess the significance of each algorithm’s performance. This procedure will provide us with a deeper understanding of the effectiveness of each algorithm in detecting behavior, considering the combination of experimental data.

Table 5. Dataset Distribution by Class

Table 4 shows the results of the comparison of algorithms, considering various performance metrics during the training and testing phases. The algorithms compared are KNN, CART, and MLP. During the training phase, it can be seen that the CART algorithm obtained perfect results (value of 1.00) for all the metrics evaluated, indicating excellent and consistent performance. On the other hand, KNN and MLP showed slightly lower values for all metrics, with MLP showing the lowest values compared to the other two algorithms. In the test phase, the results were different. KNN showed a significant drop in all metrics compared to the training phase, suggesting a possible overfitting tendency during training. CART, although it showed a slight decrease in some metrics, maintained a generally high performance close to the training results. MLP also saw a drop in its performance metrics but managed to maintain higher values compared to KNN. This analysis suggests that the CART algorithm may have a superior generalization capacity compared to the other two algorithms, maintaining consistent performance during both training and testing.

We perform the Wilcoxon test to compare the algorithms applied to the Class. The significance level used was p-value=0.05. Several performance metrics were compared between the CART, KNN, and MLP algorithms during the training stage. For all metrics evaluated (ACU, PRE, REC, FMe, KAP, and ROC), the p-value was consistently 0.00006 when comparing CART to KNN and CART to MLP, indicating a statistically significant difference. In the testing stage, the results show variation in p values between different combinations of algorithms and metrics. The comparison between CART and KNN resulted in p=0.00006 for all metrics, suggesting a significant difference. Comparing MLP with KNN, p values ranged from 0.0067 to 0.0102, depending on the metric, indicating that there was statistical significance even in this comparison.

Analyzing the data from the confusion matrix of the test data, for the Decision Tree, 218 true negatives, 22 false negatives, 31 false positives, and 140 true positives were recorded. For KNN, there were 197 true negatives, 72 false negatives, 52 false positives, and 90 true positives. For the Neural Network, only 196 true negatives, 68 false negatives, 53 false positives, and 94 true positives were identified. These results provide a detailed overview of each algorithm’s performance in classifying data.

Overall, although the Neural Network showed a high number of true positives, this came at the cost of a large number of false positives, indicating a lack of accuracy in its predictions. Meanwhile, the Decision Tree and KNN demonstrated a more balanced ability to correctly classify instances, with the Decision Tree performing slightly better at identifying true negatives and Neural Network achieving mixed results. Ultimately, the choice of the most appropriate algorithm will depend on the specific characteristics of the data set and the objectives of the application. Therefore, considering all the analyses, the CART decision tree algorithm performed best.

Comparing the results of this research with previous work, in previous work, the authors state that the full model was applied to the training set, where it accurately detected 340 (64.03%) game clips, and misdiagnosed 551 (7.07%) non-games. clips, and obtained a Kappa coefficient of 0.430 [21]. On the test set, the model achieved an accuracy of 93 (52.54%) gaming clips, while misdiagnosing 210 (8.67%) non-gaming clips, resulting in a Kappa coefficient of 0.330. This performance was comparable to the Baker [3] model. In this research, the best algorithm evaluated was the CART decision tree, achieving a hit rate between 76% and 100% in all metrics evaluated. This algorithm identified 45.99% true positives, 41.82% true negatives, 6.38% false positives, and 5.78% false negatives. Furthermore, our classifier performed better than previous work [21].

5 Conclusion

In this study, we investigated “gaming the system” behavior among novice programmers in a public, rural school in Northeast Brazil, employing an experimental approach and developing a detector using a hybrid approach. We use cognitive task analysis to understand how experts encode sequences of actions in a learning environment, distinguishing between “gaming the system” or not, and implement a cognitive model of the process of encoding these behaviors. The identified patterns revealed rules for five “gaming the system” attitudes, including two new ones not present in previous models. However, our machine learning model showed mixed results, with the Neural Network having high number of true positives but at the cost of a high number of false positives. Decision Tree and KNN demonstrated a more balanced classification ability. Overall, the decision tree CART algorithm performed best. Although our current model was built with data from a specific introductory programming context, it represents a significant contribution to understanding and detecting this behavior, given the difficulty in generalizing “gaming the system” detectors reported in the literature.

As a future perspective, we plan to conduct a comparative analysis of data from public school students in rural and urban environments, looking for similar socioeconomic and demographic contexts. By replicating our study in different settings, we hope to better understand the observed behaviors and identify possible variations or distinct patterns across these contexts. This will allow us to explore how factors such as geographic location, access to educational resources, and demographic characteristics can influence the manifestation of “gaming the system” behavior. Furthermore, by testing and evaluating our detector with additional datasets from different environments, we will be able to verify its effectiveness and generalizability across a variety of educational contexts. This broad, comparative approach will help us improve our understanding of “gaming the system” behavior and develop more effective strategies for detecting and addressing it in different educational settings.