1 Introduction

The National Examination for Admission to Graduate Programs in Computing (POSCOMP) is an annual exam conducted in Brazil. Its purpose is to assess the knowledge of candidates in the field of computer science who wish to enroll in graduate programs offered in Brazil [19]. Institutions offering these programs use the participants’ scores as a selection criterion, either as a requirement or as part of the admission process for master’s or doctoral programs [19].

Students can be evaluated through various tests administered in Brazil, such as POSCOMP, the National Student Performance Exam (ENADE), and the Basic Education Evaluation System (SAEB), among others. Additionally, several studies have been conducted on exam data analysis using data mining and machine learning techniques [5, 9, 22]. For instance, the studies aim to analyze the performance of students in computer science courses using exploratory data analysis and clustering techniques. Furthermore, the study [17] utilizes machine learning techniques to predict the grades in ENADE for Computer Science courses in Brazil. These studies highlight the importance of analyzing, identifying, and exploring national exam data.

Educational data analysis has become a crucial tool in developing methodological strategies to improve teaching and learning [11]. Using computational tools in this analysis offers significant benefits for teaching methodologies and processes. It also allows administrators and coordinators to evaluate student performance across various topics and specialties within computer science courses. Consequently, these tools assist in developing strategies and improving areas where students face difficulties [10, 23].

In this context, POSCOMP microdata provides an opportunity to extract useful and non-trivial information, particularly regarding the specialties desired by the participants [3]. POSCOMP includes information about the specialty that candidates wish to pursue in their master’s or doctoral studies. Identifying these specialties is essential for graduate programs in Brazil, enabling them to develop research projects aligned with the knowledge of both students and advisors, thereby ensuring research quality and academic progress [1].

The creation of an automated and intelligent system is crucial in assisting participants in choosing specialties for research development [12]. As defined by the authors [16], decision-based agents seek a sequence of actions to achieve their goals. Therefore, using intelligent tools to guide specialty choices is vital for both institutions and students, contributing to the progress of research during their academic journey [6].

This study aims to develop an automated and intelligent system for predicting the specialties of participants based on their characteristics and the scores obtained in the POSCOMP exam. This tool aids in classification using machine learning techniques, helping participants make informed decisions about which specialty or research area to pursue in their master’s or doctoral studies. Ultimately, this can lead to improved academic performance and the quality of research produced at the institution.

2 Related Work

Studies in the field of artificial intelligence in education employ machine learning and data mining techniques to enhance educational processes. These techniques are essential for knowledge discovery from exam data, such as POSCOMP and ENADE [15]. The application of AI techniques in education, like educational data mining, allows for the analysis and evaluation of student performance to identify profiles, determine which subjects pose difficulties, and predict student dropout rates [4, 5, 9]. These analyses are performed using publicly available data or data from private institutions.

In searching for studies related to this research, several works were found that address student performance analysis in educational institutions, as well as the use of data mining and machine learning techniques for classification and prediction in education. These studies contribute significantly to education, particularly in identifying student performance and profiles and addressing dropout rates in higher education courses.

The authors in [13] surveyed 1,239 participants, including professors, researchers, and professionals in the field of computer science, to assess the relevance of the content covered by POSCOMP. The quantitative results indicated that certain exam contents have varying degrees of relevance, providing insights for improving educational programs and preparation strategies for the exam based on the most valued areas of knowledge.

The study conducted in [14] presents the web platform POSCOMP Coach, developed to assist candidates in preparing for the graduate entrance exam in computer science. The platform offers a database of 1,120 distinct questions from POSCOMP exams conducted between 2002 and 2017, allowing for timed practice tests with automatic correction. The study also includes an evaluation of the platform, involving a comparative analysis with similar solutions and usability assessment. The results highlight positive aspects of the proposed solution, including the number of available questions and high usability scores from students nationwide.

The research in [21] presents a process for evaluating undergraduate courses by analyzing ENADE questions from 2008 to 2017 and POSCOMP questions from 2014 to 2018, aiming to determine which subjects are required in the questions. The results demonstrated the need for a comprehensive curriculum in computing fundamentals to provide a solid foundation for students.

The research by [18] developed a mobile application using Android to assist candidates in preparing for the POSCOMP exam by the SBC. The app includes a collection of exam questions organized by topic, each with detailed solutions, enabling users to study autonomously. The use of mobile computing to support education has proven attractive, allowing users to access study materials and knowledge anytime, anywhere.

The authors in [3] conducted a comparative analysis of the 2014 to 2019 POSCOMP editions to evaluate computer science graduates and the Reference Curriculum (CR) of the SBC, approved in 2016. This comparison identified that approximately 60% of the CR content was not covered in POSCOMP exams. Only 14 topics showed significant and consistent presence across the exam editions, with significant differences in the number of questions related to the curriculum axes. This comparative analysis highlights the discrepancy between the CR content and the coverage in POSCOMP exams over the years, suggesting the need to better align the exam content with the proposed curriculum.

The authors in [8] evaluated student performance to improve courses using educational data mining techniques. They employed methods to extract useful information from student performance data and predict future outcomes using machine learning techniques, specifically decision tree. The Knowledge Discovery in Databases (KDD) process was used for data preparation, knowledge extraction, and the identification of useful patterns in educational datasets.

The research by [2] applied educational data mining approaches to model student performance using classification models such as Naïve Bayes, Decision Tree, and Deep Learning. They reported that the Deep Learning classifier outperformed others, with an overall prediction accuracy of 95%. This work was conducted using data from an information technology bachelor’s course over six semesters.

The authors in [20] reported that dropout rates are higher in computer science courses. To address this issue, they applied data mining techniques, which are gaining traction in education. They used decision tree techniques and employed three execution options: Use Training Set, Supplied Test Set, and Cross-validation. They concluded that the Use Training Set option provided the best classification accuracy, but noted that it was not realistic for training. The J4.8 algorithm and Cross-validation showed good classification accuracy.

In [1], data mining techniques were used to analyze and predict student academic performance to propose interventions for improvement. The authors aimed to calculate the academic performance of undergraduate students using data mining techniques, specifically classification algorithms, on records of 800 Computer Science students. They used four feature selection methods: genetic algorithms, gain ratio, relief, and information gain, and for classification algorithms: K-Nearest Neighbor, Naïve Bayes, Bagging, Random Forest, and J48 Decision Tree. The experimental results showed that genetic algorithms provided the best accuracy of 91.37% with the KNN classifier.

Among various performance analysis approaches many studies focus on identifying student profiles and predicting dropout rates. However, there is a need to implement intelligent tools for decision-making in identifying participant specialties. This study recommends developing a vocational guidance system using machine learning algorithms for specialty classification. The goal is to assist POSCOMP participants in choosing specialties for master’s or doctoral programs based on their exam scores. This work addresses gaps left by previous studies and opens new possibilities for vocational guidance in the field of computer science.

3 Methodology

This study aims to develop an intelligent tool to assist students, professors, and educational institutions, primarily in identifying specialties for conducting postgraduate research. For this purpose, data from participants who took the POSCOMP exam between 2016 and 2019 were used. Data from 2020, 2021, and 2022 were not collected due to the pandemic. Thus, the development and implementation of the vocational guidance system employed supervised machine learning techniques and algorithms to build the specialty classifier.

Data were obtained through an official request to the Brazilian Computer Society (SBC). These data pertain to candidates who graduated from computer science-related courses [22]. A total of 14,575 records with 35 attributes of the participants were identified. The KDD process was used to carry out the knowledge discovery stages and implement data analysis. As defined by [7] on the KDD process, it involves a non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from data.

3.1 Preprocessing

During attribute selection, participant characteristics, and exam scores were considered. Among the 35 attributes provided by SBC, age, gender, state, region, specialties, and participant scores were selected. These attributes were used in machine learning algorithms for the classifier to identify specialties based on participant characteristics and knowledge.

With the selected attributes, preprocessing was performed to identify duplicate and inappropriate information for the model application. For instance, candidates who took the exam only for self-evaluation, without intending to enter postgraduate programs, were removed. Similarly, participants residing in Peru were also removed. Participants who registered incorrectly were excluded since no information was found for replacing and correcting this data. Additionally, participants with an average score below 17 points in the final grade were removed based on descriptive analysis and normal distribution.

In the POSCOMP data, there is an attribute for specialties, containing the areas of concentration and research lines that participants choose for their master’s and doctoral studies. When registering, candidates can select more than one specialty as an alternative for possible studies during their postgraduate studies. Procedures were implemented to identify candidates who opted for more than one option. Thus, only the specialties considered as the first option by the participants were selected. As specialties were selected, these were merged. Due to the similarity in specialties across different postgraduate programs, such as artificial intelligence and computational intelligence, as well as software engineering and software quality, it was decided to merge these specialties.

Table 1. POSCOMP areas and respective topics.

Each exam question addresses areas of computer science, which include mathematics, computing fundamentals, and computing technology. Mathematics covers topics such as differential and integral calculus, and linear algebra, among others. Additionally, the computing fundamentals area includes topics such as algorithm analysis and data structures, among others. Information technology encompasses topics such as databases, compilers, computer graphics, and artificial intelligence. Table 1 shows the other topics covered in the exam.

Two new attributes were created from the specialties attribute based on the most requested by participants. With more than 400 specialties identified in the POSCOMP data, a procedure was performed to group and create two new attributes: research lines and concentration areas. In the research line, specialties with the same research purpose were grouped. For example, software engineering encompasses areas of computation methodologies and techniques and computational modeling. Similarly, specialties in theoretical computing, and formal methods, among others, are grouped with other areas such as artificial intelligence, computing systems, optimization, etc. Figure 1 details the grouping process.

Fig. 1.
figure 1

Creation of new attributes based on specialties.

3.2 Vocational Guidance System Proposal

This work proposes to develop a vocational guidance system for computer science students and participants, enabling the identification of specialties according to their knowledge and skills. The tool provides research options based on knowledge and characteristics extracted from POSCOMP microdata. Thus, research during the master’s and doctoral studies can be favorable and not detrimental to their studies. The system classifies the specialty based on knowledge acquired over the years, whether as students or professionals.

For the system implementation, three classification models were used. Due to the data complexity, it was identified that the data are non-linear, which led to the decision to employ three models to make it efficient in the final prediction. Figure 2 presents the vocational guidance system execution process.

Fig. 2.
figure 2

Vocational system application model.

Initially, new data, including participant characteristics and scores, are inserted into the system. This data goes through a transformation process before being applied to Model 1, acting as an initial classifier, which determines if the data belongs to the computing or information category. Depending on the result of this classification, the data is directed to Model 2 or Model 3.

If the data is classified as Computing, it is directed to Model 2. This model predicts whether the participant’s specialty is artificial intelligence or computing systems. Conversely, if the data is classified as Information, it is directed to Model 3. This model predicts whether the participant’s specialty is Software Engineering or Computer Networks. The prediction result is then presented by the system, providing an indication of the participant’s specialty based on the input data. This process allows for the classification of specialties, guiding their studies in postgraduate programs in Brazil.

In the model implementation, the aim was to achieve satisfactory performance for each algorithm used, which included decision trees, random forests, SVM, and neural networks. For each algorithm, the dataset was divided into 70% for training and 30% for testing. This approach allowed for identifying the best hyperparameters for each algorithm. Subsequently, cross-validation techniques were used to evaluate the model’s performance. Using cross-validation techniques, 10 folds were employed to evaluate each model.

To obtain good model performance, hyperparameter tuning was conducted. This adjustment was crucial to directly control the structure, function, and performance of the algorithms. The process involved experimenting with different values and parameters during algorithm training to find the best combinations. After hyperparameter tuning, the best ones for each algorithm were identified. Then, cross-validation was performed to determine the best algorithm for Models 1, 2, and 3. The source code is available for review in this repositoryFootnote 1.

4 Results and Discussion

In this section, we present the results obtained from exploratory data analysis and the application of machine learning algorithms.

Through exploratory data analysis, we identified the most sought-after research areas among candidates for master’s and doctoral programs. The data revealed that artificial intelligence, software engineering, computer systems, and information systems are the most popular fields. This interest is driven by market demand, particularly in artificial intelligence, which encompasses data science, data engineering, and data mining, among other areas [24].

Figure 3 shows the dominant specialties in each state based on the participants’ choices. In Alagoas, computer systems are most in demand; in Espírito Santo, information systems have the highest representation; in Mato Grosso do Sul and Pará, software engineering predominates; while in Minas Gerais and São Paulo, artificial intelligence and software engineering, respectively, are the most sought-after. This information is valuable for educational institutions’ planning and vocational guidance, helping to offer courses and specializations in each state according to student demand.

Fig. 3.
figure 3

Predominant specialties by state, classified by region in Brazil.

We categorized the specialties present in the participants’ data into four classes for algorithm application, as outlined in Sect. 3.1. The identified and applied classes for the classifiers were: Artificial Intelligence, Software Engineering, Computer Systems, and Computer Networks. Figure 4 shows the number of instances in each class.

Fig. 4.
figure 4

Sample distribution by specialties in the POSCOMP exam.

For specialty classification, samples were divided into binary to generate predictions for each model. This ensures that each model is trained with a specific data set, guaranteeing the effectiveness and accuracy of the predictions. The size of the data set used for each model highlights the variation in the number of participants involved in its construction and evaluation. The model 1 used 8613 data points, the model 2 used 5451 data points, and the model 3 used 3162 data points.

4.1 Model Analysis

The four classification algorithms used for classification with the POSCOMP data sets, based on participants’ characteristics and scores, are presented below. The best hyperparameters were selected to achieve optimal performance for Models 1, 2 and 3.

In Model 1, participants identified with “computing” or “information” were directed to Model 2 or Model 3, respectively. The best hyperparameters for each algorithm are shown in Table 2.

Table 2. The best hyperparameters of each algorithm with the training and testing accuracies for model 1.

After determining the hyperparameters, cross-validation was performed to evaluate the algorithms’ performance. The random forest algorithm yielded the best results, likely due to its ability to capture data complexity. Therefore, the random forest algorithm was chosen for implementing Model 1.

Model 2 aims to classify participants in the computing field, distinguishing between artificial intelligence and computer systems. The best hyperparameters for each algorithm for Model 2 are presented in Table 3. Following cross-validation, the random forest algorithm again outperformed others and was selected for Model 2.

Table 3. The best hyperparameters of each algorithm with the training and testing accuracies for model 2.

In Model 3, participants classified in the information field by Model 1 were directed here, where the final prediction focused on software engineering or computer networks. The best hyperparameters for each algorithm are shown in Table 4. Post-cross-validation, the random forest algorithm demonstrated superior performance and was selected for this model.

Table 4. The best hyperparameters of each algorithm with the training and testing accuracies for model 3.

To analyze the vocational guidance system’s performance, a test with 8613 samples was conducted to calculate the models’ metrics for classifying POSCOMP participants’ data. The objective was to evaluate these metrics and observe the results. If the system correctly classified all three models, it was considered a success; otherwise, it was considered a failure. A confusion matrix was generated, and machine learning metrics were calculated.

The system achieved an overall accuracy of 52.59%, indicating the presence of non-linearities that compromised accuracy. Participants retaking the exam with different specialty choices could also confuse the models, hindering correct classification.

The metrics evaluating the class performance showed that artificial intelligence and computer networks had accuracies of 67% and 86.66%, respectively, while software engineering and computer systems recorded 76.97% and 77.89%, respectively. This indicates variable performance among classes, with the classifier being more accurate in classifying computer networks, despite having fewer samples. Table 5 shows the accuracy of each class along with other relevant metrics.

Table 5. Metrics by class of the guidance system.

However, the analysis of precision and sensitivity metrics revealed significant variations among the classes. The artificial intelligence class had a high precision of 81.86%, indicating many participants’ scores clustered in a specific feature space. For computer networks, precision was only 18.94%, suggesting classification difficulties, possibly due to the low sample size and score dispersion.

Sensitivity showed that computer systems had the highest value, indicating effectiveness in identifying true positives in this class. However, the low precision of 38.02% proposes some participants may be incorrectly classified in this class.

The classifier’s specificity, measuring the ability to correctly identify negative classes, exceeded 80%. However, variability in specificity among classes indicates performance differences depending on the participant.

The F1-score, combining precision and sensitivity, also showed significant variations. Artificial intelligence and computer systems classes had relatively high scores, while software engineering and computer networks had lower scores, reflecting disparities in individual metrics. This could be due to the small sample size of some classes relative to others.

The vocational guidance system’s classifier results indicate some classes excel in the evaluated metrics. However, lower-than-expected performance in some specialties may be due to the similarity between areas and the disparity in class sizes in the data, suggesting the need for class balancing to improve the classifier’s predictive ability.

5 Conclusions

The aim of this study was to develop a vocational guidance system using POSCOMP data from 2016 to 2019 to assist in decision-making when choosing specialties that align with candidates’ knowledge and skills.

This work aims to help students and professionals select specialties for their master’s or doctoral programs. Additionally, it assists in selecting advisors for academic research. This study allows computer science students to avoid doubts and frustrations when deciding on an advisor. It also helps advisors understand their mentees’ skills and knowledge, using the tool to identify which specialties align with their research interests. Thus, this tool aids in the development of research projects, significantly reducing dropout rates and enhancing the quality of research by allowing studies aligned with participants’ preferences.

For future work, we suggest employing deep learning algorithms to handle the complexity of participants’ non-linear data. It is also recommended to use a unified model for the system, which could enable more effective classification.