key: cord-0168640-5wp4epk0
authors: Kumar, Anshul; Edwards, Taylor DiJohnson Roger; Walker, Lisa
title: The application of adaptive minimum match k-nearest neighbors to identify at-risk students in health professions education
date: 2021-08-05
journal: nan
DOI: nan
sha: 7de9424831d1ce6d32a3a87171032c6aca260758
doc_id: 168640
cord_uid: 5wp4epk0

Purpose: When a learner fails to reach a milestone, educators often wonder if there had been any warning signs that could have allowed them to intervene sooner. Machine learning can predict which students are at risk of failing a high-stakes certification exam. If predictions can be made well in advance of the exam, then educators can meaningfully intervene before students take the exam to reduce the chances of a failing score. Method: Using already-collected, first-year student assessment data from five cohorts in a Master of Physician Assistant Studies program, the authors implement an"adaptive minimum match"version of the k-nearest neighbors algorithm (AMMKNN), using changing numbers of neighbors to predict each student's future exam scores on the Physician Assistant National Certifying Examination (PANCE). Validation occurred in two ways: Leave-one-out cross-validation (LOOCV) and evaluating the predictions in a new cohort. Results: AMMKNN achieved an accuracy of 93% in LOOCV. AMMKNN generates a predicted PANCE score for each student, one year before they are scheduled to take the exam. Students can then be classified into extra support, optional extra support, or no extra support groups. The educator then has one year to provide the appropriate customized support to each category of student. Conclusions: Predictive analytics can identify at-risk students, so they can receive additional support or remediation when preparing for high-stakes certification exams. Educators can use the included methods and code to generate predicted test outcomes for students. The authors recommend that educators use this or similar predictive methods responsibly and transparently, as one of many tools used to support students.

Students in health professions programs must typically pass a certification exam before they can practice in their field. Obviously, it is the goal of a) students to pass the exam on their first attempt and b) educators to have tools to maximize student success on exams.

We use predictive analytics and machine learning methods to develop one such tool and demonstrate how it can be applied within a health professions education program. The purpose of this tool is to predict (guess) each student's score on a high-stakes certification exam one year before the student actually takes the exam. These predicted scores-if they are accurate enough-can help identify students who might benefit from additional support or remediation from educators before the exam.

We use machine learning to look for patterns in the data of previous cohorts of students, whose certification exam scores are known. We then use these established patterns to make predictions about the future certification exam scores of currently enrolled students. Our data come from five now-graduated cohorts of students in a Master of Physician Assistant (PA) Studies program. Such predictive analytic approaches are used in a variety of educational settings. 1 As Ekowo and Palmer 2 describe, these approaches can be used to both discriminate against or uplift and support vulnerable students, depending on the intentions of the user.

Existing scholarship tends to a) identify at-risk students using standard available machine learning approaches, and b) focus more on model-building than on how to practically leverage predictive modeling when working with students in a health professions program. We build on this work by a) developing and applying a new type of "adaptive minimum match" k-nearest neighbors (AMMKNN) algorithm that uses different numbers of neighbors for each prediction, and b) presenting disaggregated confusion matrices to evaluate predictive results within the context of how they will be applied in practice. Additionally, we have made the AMMKNN algorithm available in an open source R package, called AdaptiveLearnalytics, 3 for others to use and modify.

A typical approach to educational analytics is to use student performance or behavior data at intermediate stages to predict a final outcome of interest. For example, student results on homework assignments or quizzes as well as student activity records in online learning management systems during a semester-long course can be used to make guesses about final exam performance. Prediction of subsequent outcomes based on earlier performance has been demonstrated with students studying a number of topics, including informatics, 4 human computer interaction, 5 computer hardware, 6 and mathematics. 7 Liz-Domınguez et al. 8 refer to this as grade prediction. The same approach has also been used to predict which students will withdraw from an educational program. 9, 10 These studies train and evaluate the usefulness of machine learning models using a number of metrics such as accuracy, sensitivity, specificity, and AUC (area under the curve). Such machine learning models can be used on future groups of students to create early warning systems that identify at-risk students and intervene, as described by Arnold and Pistilli. 11 Within health professions specifically, commentaries on machine learning abound, [12] [13] [14] while new empirical studies that apply machine learning are less common.

Black et al. 15 take a similar approach to ours within the physician assistant education context (which we build on by testing additional predictive models, validating our results on an entire new cohort, and addressing practical applications of analytics results).

Predictive analytic methods have been used on data from students learning oral pathology, 16 blended medicine, 17 and psychomotor skills. 18 It has also been used to evaluate surgical competence, 19, 20 essays written by medical students, 21, 22 and residency applicants. 23 Chan and Zary, 24 Tolsgaard et al., 25 and ten Cate et al. 26 review studies related to analytics in health professions education. All of these examples show that predictive techniques can be used to help educators make guesses about future student outcomes.

The physician assistant (PA) profession is currently undergoing tremendous growth in the United States. To become a certified PA in the United States, students typically study for two or more years and then must achieve a score of at least 350 on the Physician Assistant National Certifying Exam (PANCE; graded on a 200-800 scale). In a typical PA program, the first year of education primarily features coursework with regular assessments and the second year consists of clinical rotations. Given these characteristics, PA programs are a good example of a health professions program that might benefit from predictive analytic tools.

Our data come from a PA program that mostly uses team-based learning (TBL), which is relatively new in PA education and has been growing in popularity in medical and other health professions programs. 27 TBL appears to be used at some medical schools at pertinent stages of the curriculum. 28, 29 TBL relies on small group interaction, creating opportunities for students to solve problems together and practice key concepts. 30 The learning process consists of three stages: student preparation, readiness assurance, and application. 31 Wallace and Walker 27 describe the following key details of TBL: a) Create TBL assessments and learning activities using backwards design, b) Form diverse teams of 5-7 students, c) Students study content before class. In class, they engage in individual and team assessments, listen to minilectures, and participate in application activities. TBL assessments include iRATs (individual readiness assurance test) and tRATs (team readiness assurance test). Students in this PA program also take the PACKRAT (Physician Assistant Clinical Knowledge Rating and Assessment Tool) exam at the end of their first year in the program. The PACKRAT is a cumulative, nationally-benchmarked assessment that covers important PA knowledge areas.

The frequency of assessments in TBL curricula generates data that can be easily included in predictive models. However, such an abundance of data is by no means necessary for using our methods. In fact, we ended up combining or eliminating many pieces of information. Health professions education programs that do not generate as much data can still use predictive analytics.

Our study uses data on student performance from a Master of Physician Assistant (PA) Studies program to create a predictive machine learning model to identify students at potential risk of failing the Physician Assistant National Certifying Exam (PANCE). Table 1 shows the process of data collection, analysis, and application of results as we plan to use it in practice. We used R version 3.6.3 (R Foundation for Statistical Computing, Vienna, Austria) for all analysis. We also developed an open source R package called AdaptiveLearnalytics 3 to carry out key portions of our analysis. Our analysis can be found in the supplemental Data and Code Appendix file.

[ Table 1 ]

We utilized de-identified student data from five successive cohorts starting in 2015-a total of 224 students-from a single PA program. The data set was organized such that each row is a student and each column is a variable (containing a measured grade or other characteristic). We utilize iRAT quiz, final exam, final course, and PACKRAT scores from students' first year in the PA program, as well as undergraduate grade point average (GPA) data. Even though regular assessments do occur throughout the second year of the PA program, we do not use this data or any demographic data. We exclude second-year data so that an entire year is available to provide additional support to students who are at risk of failing the PANCE.

Our dependent variable is the PANCE score from each student's first attempt at this exam, after the student completes the 25-month PA program. PANCE scores can range from 200-800 points, and students must score 350 or above to pass. We use student performance variables in the data set, described below, as independent variables to predict students' PANCE scores.

This data set, collected as part of this PA program's mostly-TBL curriculum, gives us more information about each student than we would expect to have in a traditional (non-TBL) curriculum. For each TBL course, we calculated the mean of each student's iRAT scores and eliminated the individual iRAT score variables. For example, if students completed seven iRAT quizzes during Course A, we made a new variable with the mean of those seven iRAT quiz scores. We then eliminated the seven individual iRAT quiz scores from our data set.

After these preprocessing steps, we had the following independent variables for each student for 18 first year courses: iRAT mean (when applicable), final exam score, final grade. We also had the following additional variables: overall undergraduate GPA, undergraduate science GPA, first-year PACKRAT score. As described in the Data and Code Appendix, we tried two different preparations (A and B) of these independent variables to see which one yielded better results for each type of predictive model.

Once this manual preprocessing of variables was complete, we further narrowed down our independent variables by calculating the Pearson correlation coefficient of each independent variable with PANCE score. Any variable correlated with PANCE below a selected threshold was eliminated. A full explanation and list of variables is in the Data and Code Appendix. Prior to training any predictive models, all independent variables were standardized.

Since a numeric score is used to determine if a student passes or fails the PANCE, ours is both a classification and regression problem. It is classification because we are trying to predict if a student passes or fails. It is regression because test results are numeric scores.

As educators, we decided it would be most useful to classify students into three groups: very likely to fail (predicted PANCE score <350), moderate risk of failing (predicted PANCE score 350-375), low risk of failing (predicted PANCE score >375). The "moderate risk" category ensures that we were able to identify all "at-risk" students and not only those who would fail. This additional category supports our goal of using predictive analytics to help students advance toward examination with the best preparation resources that we can provide.

Before settling on a single predictive technique, we started by using a number of standard approaches, such as random forest (RF), support vector machine (SVM), and standard K-nearest neighbors (KNN). These attempts are presented in the Data and Code Appendix. As shown in our results, none of these allowed us to create a predictive model that would achieve our educational goals of prioritizing the detection of students who might fail the PANCE while still minimizing incorrect (false positive) predictions.

We developed a modified version of KNN called "adaptive minimum match" KNN (AMMKNN), summarized below. If we imagine a student named X who has just finished their first year of the PA program, this is how AMMKNN predicts X's PANCE score:

1. Start with a training data set of all previously-graduated students from the program, in which rows are students and columns are the independent variables (the dependent variable, PANCE score, is omitted).

2. Rank the training students from closest to X to farthest from X, based on Euclidean distance.

3. Select the 20 closest matches to X out of the ranked training students. This gives us a list of X's 20 nearest neighbors, to be used in subsequent steps. So far, this procedure is the same as standard KNN with K = 20.

4. We will now use the (known) PANCE scores of these 20 closest matches to make a guess about X's (unknown) future PANCE score. 5. Calculate the mean PANCE score of X's 1 closest neighbor (which is just that neighbor's score itself). Calculate the mean PANCE score of X's 2 closest neighbors.

Calculate the mean PANCE score of X's 3 closest neighbors. Continue this process until there are 20 means, each calculated on a different number of X's nearest neighbors. Find the lowest out of the 20 calculated means, called X's minimum of means.

6. Identify the lowest PANCE score out of X's 20 nearest neighbors. This is X's minimum match.

Determine if X's PACKRAT (the most important independent variable) score is less or greater than 2 standard deviations below the mean PACKRAT score. (This is to flag students who are possible outliers on the lower end).

a. If greater (which is common): X's predicted PANCE score is their minimum of means.

b. If less (which is rare): X's predicted PANCE score is their minimum match.

We can do this procedure each year for Student X and all of their classmates who are in between years 1 and 2 of the PA program. The training data set contains all students from previous cohorts who have completed their first attempt of the PANCE (after they have graduated from the PA program). Standard KNN uses the same value of K-the number of matches-for each student's score prediction. It would then treat the mean of those first K matches as the predicted value for every student. Our adaptive approach differs because it uses a different value of K for each prediction. In the example above, our approach uses a different value of K-based on either the minimum of means or minimum match-to make a prediction for Student X and each of their classmates. Other adaptive KNN procedures have been used in different contexts. [32] [33] [34] The maximum possible value of K was set at 20 in our study. This means that anywhere between 1 and 20 nearest neighbors could be used to make a prediction. This is further discussed in the Data and Code Appendix. The "adaptive minimum match" modification to KNN makes our predicted PANCE scores lower than with standard KNN (and other commonly-used predictive models). This decision was deliberately made to prioritize the detection of students who might fail the PANCE, since standard approaches tend to predict that most students will pass.

We used 181 students' data in our first four cohorts for training and cross validation. We then used our fifth and most recent cohort of 43 students for final validation of our best model (only 42 students are used with variable preparation B, due to missing data for one student). We evaluated the results of all predictive models using leave-one-out cross validation (LOOCV), which has been used or recommended before in similar situations with small sample sizes. 15, 35 We adhered as closely as possible to recommendations from Rao et al. 36 For each predictive model, to execute LOOCV on our sample of 181 students, we trained the model on 180 students and tested it on the remaining one student. We repeated this 181 times, such that each student was the testing student one time. This gave us a predicted numeric PANCE score for each student that the computer calculated without "knowing" the true PANCE score of that student. We then classified students as predicted to pass (if their predicted score was 350 or greater) or fail (lower than 350). For each predictive model, we first created a standard 2-by-2 confusion matrix to show each student's predicted value (from when they were the testing student) and their actual PANCE result. From the confusion matrices for each model, we calculate the following standard metrics: true positives, false positives, true negatives, false negatives, accuracy, sensitivity, specificity. Definitions for these metrics are in our Data and Code Appendix.

To make our predictions more useful, we further disaggregated them into three groups: predicted to fail (PANCE <350), at-risk of failing (350-375), and likely to pass (>375). This is analogous to the "traffic signal" strategy. 11 We argue that sorting students into these three groups is most useful for our goals of optimizing potential remediation well in advance of the exam. In addition to traditional 2-by-2 confusion matrix data in Table 2 , we present 3-by-3 confusion matrices in Figure 1 that show all three groups. We discuss its practical application in our educational context. Table 2 shows the LOOCV results of multiple predictive models, based on standard 2-by-2 confusion matrices. AMMKNN has the highest accuracy with both variable preparations (0.93 with preparation B and 0.91 with preparation A). Since our priority as educators is to minimize false negatives (students who "fall through the cracks," because they need our help but we fail to detect this), we want to maximize sensitivity while keeping false positives ("unnecessary support" students who are flagged by the model but do not truly need remediation) within reasonable limits. The model with the highest sensitivity is RF(A), which predicts that 35 students will fail, 10 of which are correct predictions and 25 of which are incorrect "unnecessary support" predictions. Providing remediation to 35 students when only 10 of them (less than one-third) need it is not reasonable, we argue. The next highest sensitivity is 0.69, shared by AMMKNN(B) and SVM(B). AMMKNN also has higher accuracy and specificity. AMMKNN(B) predicts that 18 students will fail, with 9 of these predictions being correct and 9 being incorrect. In this scenario, half of all students flagged for extra support would truly require it. SVM(B) predicts that 27 students will fail, with 9 of these predictions being correct and 18 incorrect.

[ Table 2 ]

To more clearly break down the predictions and how we can use them in practice, Figure 1 shows 3-by-3 confusion matrices for selected models. The accuracy for AMMKNN(B) is calculated as the number of correct predictions divided by the number of total students: (9 + 3 + 124)/181 = 0.75. Even though this accuracy is lower than the accuracy of the same model when a 2-by-2 matrix is used (0.93), we argue that the 3-by-3 version-which views results as a spectrum of three classification categories-is more useful when considering the use of the model in practice.

We can build a framework based on the 3-by-3 AMMKNN(B) results: 18 students are predicted to fail the PANCE. Out of these 18 predictions, 9 would be completely correct, 1 would be justifiable (because it would be acceptable to remediate a student who would otherwise almost fail), and 8 would be "unnecessary support" students who receive remediation even though it would not have been needed. The model would identify 26 students as "at risk." Out of these 26 predictions, 5 would be useful (2 students who truly fail and 3 students "at-risk") and 21 would be "unnecessary support." The model would identify 137 students as likely to pass. Out of these 137 predictions, 124 would be "predictable no support" students who pass without risk, 11 would pass with an "at-risk" score, and 2 would "fall through the cracks" because we failed to identify them as needing remediation. The results for AMMKNN(A) are similar, while SVM(B) and standard KNN(B) each have 16 false positives, which is likely too many "unnecessary support" students to reasonably remediate. Our recommended use of these results is to remediate many students with predicted "fail" scores, consider remediating predicted "at-risk" students on a student-by-student basis, and not remediate students with predicted "pass" scores (unless extenuating circumstances or other information suggest otherwise).

[ Figure 1 ] After completing the validation and inspection of results above, the final step is to further validate our models on an entire cohort of students, the way we intend to do in practice in the future. We do this with our fifth cohort of students, which was not used to train or cross validate the models. Figure 2 shows these results in our 3-by-3 framework.

AMMKNN(A) makes the best predictions: Out of 6 students who failed the PANCE, the model correctly predicted that 2 would fail, identified 2 as at-risk, and failed to identify 2 (incorrectly predicting that they would score above 375). Out of 5 students who passed with at-risk scores, the model failed to detect all 5, but this oversight would have been acceptable to us given that the students did not fail the exam. Finally, out of 32 students who passed without risk, 2 were incorrectly predicted to fail, 0 were predicted at-risk, and 30 were correctly predicted to pass.

These results mean that if we had used this model in practice-ignoring other inputs in our discussion below-we would have remediated the 6 students classified as failing or at-risk and we would have been right to remediate 4 out of these 6 students.

AMMKNN(B) is the only other model which identifies 4 of the 6 students who fail as either failing or at-risk. SVM and standard KNN fail to detect 4 and 3 students, respectively, out of the 6 total who fail.

The results from both LOOCV and entire-cohort validation suggest that AMMKNN might be more useful in practice than standard SVM and KNN models, because a) AMMKNN performs better in both A and B preparations of the independent variables, and b) AMMKNN has fewer false positive predictions at desired levels of sensitivity.

[ Figure 2 ]

Our results demonstrate a) the value of an adaptive machine learning method like AMMKNN that has high sensitivity while minimizing false positives, b) the utility of a 3-by-3 confusion matrix approach, and c) that accidental remediation of false positive "unnecessary support" students would be unavoidable if results from predictive models were trusted blindly and without taking into account other relevant factors. We argue that the AMMKNN model and 3-by-3 approach can provide additional information to be used in conjunction with other programmatic information to assist educators in identifying at-risk students. We further argue that the use of this and similar predictive models needs to be considered from three perspectives-educator, student, and program administrator-which together can lead to a practical and ethically-sound student support strategy. Balancing these perspectives involves tradeoffs that remind us that analytics are best used as one among multiple inputs to decision-making.

Our top priority when building a predictive model was to reduce the number of false negatives, meaning students who "fall through the cracks." Accomplishing this comes at the cost of having some false positive "unnecessary support" students also identified by the model. Therefore, it is critical to proceed cautiously when applying results from this or any other predictive model to our educational work. When we apply this analytics process to predict student PANCE outcomes, the predictive model will label our students as likely to fail, be at-risk, or pass. Even though similar educational analytics success stories abound, 2, 11 there is also concerning evidence that labeling students as "failures" can be problematic. 37, 38 Framing of the results when communicating with the students is something that the educator must consider, so that students understand the motivation and uncertainties associated with suggested actions. While providing comprehensive recommendations for this process is beyond the scope of our current work, we note that established scholarship exists related to breaking bad news 39, 40 and remediation 41, 42 in contexts where students are preparing for high-stakes exams. This scholarship lays a foundation for communicating with students about predictive results.

Transparency is a critical component of this process, and we recommend that educators choosing to use predictive models make students aware of the methods used and share with them that false positive predictions are common. An analogy to disease testing could be a simple way to communicate this clearly to health professions students, especially given the current ubiquity of COVID-19 testing. Students can be informed that these predictive models tend to have tradeoffs associated with false positives, even though they can be useful to educators to identify at-risk students.

Another way to act upon predictive results could be to recommend that students engage in a remedial program, but leave the choice up to each student. Students who are not identified at being at-risk could also be given the option of self-selecting into a remediation program.

Given these hazards, we propose a student support strategy that uses predictive results in conjunction with other possible indicators of future student outcomes. We recommend considering the model's predictions along with other processes established by the program to ensure success of students, including programmatic efforts to support students identified as having unmet needs for a variety of reasons or self-identify as requiring extra support. When we apply this method in practice, we plan to individually review all information that we have about students who are predicted to fail or be at-risk.

Then, taking all information into account, we plan to recommend remediation support for only a subset of these students, as they prepare for the PANCE exam. These recommendations are initially made after the first year. Recommendations could be updated part-way through the second year as well.

This study does not include an empirical examination of student reactions to receiving analytics-based results. Therefore, we can only speculate about possible reactions and considerations from the student perspective. We believe that this is important to address, even though it may be incomplete. Further research is required on: a) student reactions to (their own) analytics results and b) student recommendations on how to incorporate analytics.

We hope to incorporate considerations of student mental health and long-term outcomes into our proposed student support strategy, with a focus on transparency. When students are told that they have been identified for receiving remedial support, we anticipate that this could cause distress. We might be able to mitigate some of this distress by telling each student that a) other factors-such as course grades, concern from instructors, or known extenuating circumstances-also factored into our decision to offer remedial support, and b) the predictive model can make mistakes and identify "unnecessary support" (false positive) students accidentally, the same way that a patient can receive a false positive disease test and be subjected to additional subsequent unnecessary testing or treatment. We can also include student and alum feedback in the process.

Additionally, student experiences of framing effects 43, 44 needs to be investigated and optimized. The term "remedial" might sound harsh. We might instead choose to use a term like "extra support" or "supplemental curriculum." Further work is required to solicit and incorporate student input into the use of analytics.

From the perspective of a program administrator, leveraging predictive models to make a successful student support strategy can be challenging. Administrators will need to create streamlined data collection and archiving processes, achieve buy-in from students and employees, and allocate employee and student time for remediation. We used only first-year data when building a predictive model, so that students still have their second year in the program to prepare for the PANCE exam. This one-year period gives program administrators more time to arrange for appropriate support for all students to the extent that resources and logistics allow. Additional analytics could also be conducted with second-year assessment data as students move closer to the exam.

While our analytics approach is useful as an in-house tool and might be applicable to broader educational settings, we acknowledge a number of limitations in our study and results. The ability of our AMMKNN model to generalize to other types of health professions programs is unknown. Therefore, we recommend that others using this approach conduct multiple stages of validation after training the model and work to transparently apply the results when used for making predictions. Given our small sample size of five cohorts of students in a single program, we also cannot yet comment on whether our model's performance will improve or not after additional data are added.

The predictive model depends on collecting the same independent variables on students every year. If curriculum changes or instructional practices lead to non-comparable data being collected across cohorts, our predictive approach might not work as effectively. Furthermore, the cohorts included in our study involved some students that were and others who were not affected by the COVID-19 pandemic. Some students completed the PANCE exam in the middle of the pandemic, while most completed it prior to the pandemic. We do not currently measure and include variables to account for these potential cohort-specific differences.

Our study also has some strengths to build upon in future work. Educational analytics are often plagued by small sample sizes. Our AMMKNN method addresses this problem and was able to make improvements over standard models, without the need for extensive fine tuning of model parameters. Our ability to make reasonable predictions on an entire new cohort of students, to demonstrate application of the method, is another strength. The method now needs to be tested with other data and contexts. Many educational studies have cohorts in these size ranges and our methods reflected this common constraint.

As educators, we hope to create a safety net around our students that can support them as needed, especially in health professions education with a) patient outcomes at stake and b) delays in certification leading to unwanted professional consequences. The "adaptive minimum match" KNN model we have developed and validated serves as one additional tool that-along with already-existing tools-makes this safety net stronger when used responsibly and transparently. We recommend that analytics should coexist with and bolster other approaches to student support, and that educator, student, and program administrator perspectives should all be incorporated for the practical and ethical implementation of educational analytics. Readers can use the code above to install and use the package.

The following code can be run to see more information:

help(package = AdaptiveLearnalytics )

• Below, we set our working directory:

setwd("path/to/working/directory")

The code above is for illustrative purposes. We set our real working directory in a hidden code chunk.

We are unfortunately not able to make our data publicly available. However, the code below combined with the examples within the AdaptiveLearnalytics package should be sufficient to reproduce our analysis methods on a different data set.

2 Prepare and explore data

The code below subsets our data and produces output to help us confirm that the subsetting worked. 

The dependent variable in this study is student results on the Physician Assistant National Certifying Examination (PANCE). The independent variables are the results of quizzes, tests, and other assessments that physician assistant students engage in during their studies to become a physician assistant.

PANCE results by year: dpartial$pass <-ifelse(dpartial$PANCE>349,1,0) dpartial$fail <-ifelse(dpartial$PANCE>349,0,1) library(dplyr) dplyr::group_by(dpartial, Year student began program ) %>% dplyr::summarise( count = n(), Note that in some cohorts, not all students went on to take the PANCE.

In our study, we use the four years of student data from 2015 through 2018 to build and cross-validate our models. We refer to students in the 2015-2018 cohorts as "alumni."

We then use the 2019 cohort-who we often refer to as the "current students," because this model is eventually meant to be applied to a cohort of students who have not yet graduated-for further validation. 2019 students are never used to train a predictive model; they are only used for validation of the various models we create, to simulate the way that we hope to apply our best model(s) to an entire cohort at once.

There are two possible ways to prepare the data, which we will call preparations A and B (these might also be referred to as .a and .b and be used as suffixes within the code). The .a preparation does not use the CAT variables (the final exams for the courses) overall, with one exception, as shown below. The .a preparation then uses a lower correlation threshold for variable selection than the .b preparation.

Courses in final semester before rotations: Based on the summary statistics above, we see that PA770 is likely the hardest final semester class (large range and standard devition), so we include the CAT (final exam) for this class as an independent variable in our model, as an additional indicator from each student's final semester of courses.

Below, we remove a few variables that we know we definitely don't want to include (due to data being unavailable, specific variable coding, or our own hypotheses and experience about which variables will lead to the best predictions). We want to remove variables from the data set that are correlated at lower than 0.1 with the dependent variable PANCE. Following standard practice in machine learning approaches, we tried many different configurations of independent variables and many different correlation thresholds and found 0.1 to yield the best list of variables to make predictions. We decided what list of variables was "best" by comparing predictive model results with different lists of variables, using accuracy, sensitivity, and specificity to decide.

We are not able to show every single attempt in this file, but the two variable configurations that we present in this file demonstrate our general approach: try different sets of variables in all of the machine learning models, then choose only the best performing models (regardless of the variables) to validate with our 2019 cohort of "current students" and use in our own educational work on new students. This method of comparing and selecting models appears to be a standard procedure, also practiced by many of the scholars we cite in our main article's literature review. 

This version of the data preparation, called .b, leaves in the CAT exams (the final exams in the various courses) and has a higher correlation threshold for variable selection than the .a preparation. names(d) <-make.names(names(d)) library(jtools) df<-jtools::standardize(d) dfcopystd <-df # put unstandardized dependent variable back into data df$PANCE <-masteralumnidata$PANCE CorrelationThreshold <-. 19 We want to remove variables from the data set that are correlated at lower than 0.19 with the dependent variable PANCE. 

Import and prepare: current <-read_excel("AI Research Template downloaded 20220207.xlsx") names(current) <-make.names(names(current)) current.copy <-current library(jtools) current<-jtools::standardize(current) Now we have a dataframe called current which contains the latest graduating cohort only, meaning the current students on whom we need to make predictions. We also have other dataframes containing alumni data, which is all of the previous students on whom we will do LOOCV to build and compare models and with whom we will then train a model to run on the cohort of current students.

Below, two versions of the dataframe current will be made, each with the variables in either preparation A or B.

We also made a dataframe called alumniForCurrentPredictions. Two versions of this will also be madde below for preparations A and B. We cannot just use FinalAlumni.a and FinalAlumni.b for the validation on current students because those files were standardized without the inclusion of the 2019 current student validation cohort. Standardization must occur when both the training and testing data are together-before they are divided-which is what we achieve in this section, in preparation for running validation models later in this appendix. 

Main details of our predictive modeling approach are in our main article text. This section contains some supplementary details.

The independent variables that we manually select and then give to the computer to further narrow down are shown in outputs within this document. Here, we provide a few additional notes regarding our independent variable preparation • iRAT quiz scores: As noted in our article, we took the mean of all iRATs within each course for each student (separately for each course), and included that mean iRAT score variable for each course for eligibility in our final model. We did attempt to use individual iRAT scores as separate independent variables in other modeling attempts (not shown) and did not find these to be useful. Just like averaging a patient's blood pressure to get a true picture of their trends and tendencies, averaging iRATs seems to give a better view of a student's academic and test-taking abilities. Furthermore, since iRAT quizzes are given many times to the students each week, they stand to be changed from year to year (both in content and number). Since our goals rely upon data that can be compared year to year, using the mean of the iRAT scores within each course for each student is a simple way to make sure that there will not be any missing data but that we can still include daily or weekly iRAT data as a metric of formative assessment.

• In a few cases, there are courses in the curriculum that were not taken by all students in all years. We did not use data from such courses.

• AMMKNN -Adaptive minimum match k-nearest neighbors. This is the predictive approach that we developed ourselves-a modification to standard K-nearest neighbors-to provide more accurate and sensitive predictions for our data and applied context.

• KNN -K-nearest neighbors. A standard and traditional, matching-based approach to generating predictions.

• RF -Random Forest. A standard and traditional machine learning approach that is based on collections of decision trees.

• SVM -Support Vector Machine. A standard and traditional machine learning approach that draws boundaries between observations to make predictions.

• LOOCV -Leave-one-out cross validation. Defined in the main text of our article.

• PANCE -The dependent variable that we are trying to predict in our study. This is addressed in the main text of our article. Students must score 350 points or higher on PANCE to pass.

Running KNN, RF, and SVM are all standard practices within the field of educational predictive analytics as demonstrated by the literature we cite in the main text of our article.

LOOCV is a common evaluation approach for data with small sample sizes; we cite some of the literature that sets a precendent for this. We also go a step further than much of this literature by holding out our "current" 2019 cohort for further model validation.

In all of the predictive models that we created-other than the adaptive minimum match KNN procedure that we developed to combat this very problem-we find that the model over-estimates PANCE score (our dependent variable). This is highly visible in the results we present below in this appendix. When we take predictions from standard RF, SVM, and KNN models and compare them to students' actual results on the PANCE, the predictions are always too high. We present standard 2-by-2 confusion matrices for all of these scenarios.

When we take these standard models' predictions and classify students predicted to score below 350 as failing and 350 or above as passing, many models are unable to correctly predict even a single student who truly failed the PANCE. To make these standard models' results more meaningful, we manually adjust the predictions made by these models. For example, if a standard KNN model predicts that a student will score 385 points on the PANCE, we recode that score of 385 to be less than 350, such that it now predicts a failing result.

The modification of the prediction threshold that separates failing and passing for the predicted exam outcomes, as described above, makes each model's results more sensitive, meaning that it will classify more students as positives, meaning those who will fail the exam. Overall, this is what we want (to identify those who will fail), but the problem is that increasing the sensitivity also leads to more false positive predictions. In most cases, as shown below in the results, the number of false positives are too many for the model to be usable. Our best predictive approach, AMMKNN, also suffers from false positive predictions, but much fewer than the other approaches. AMMKNN achieves high sensitivity while still keeping the number of false positives within tolerable bounds, as the results later in this file show and as we discuss in the main article.

Our code and results show both untuned results and the modifications we made. As the results show, the results are only meaningful when we treat 390 and often even higher numbers as failing grades. Note that these adjustments are only being made to the predictions of student scores. The true values of their scores are never adjusted, of course. Also note that changing the cutoff between failing and passing from 350 to 390 for predictions is similar to simply subtracting 40 points from everybody's predicted scores.

Our code shows our attempts to fine-tune the results with different artificial cutoff thresholds for the predictions of standard machine learning models and how those yield different results. AMMKNN does not require this type of fine-tuning.

Main details of our evaluation approach of our predictive models are in our main article. This section contains some supplementary details.

Here is how we define all of our key evaluation metrics:

• True positives (TP)-The number of students who truly failed PANCE and were correctly predicted by the model to fail PANCE. To us as educators, these are "predictable support" students who truly need additional remedial support and our model succeeded in identifying them as such.

• False positives (FP)-The number of students who truly passed PANCE but were incorrectly predicted to fail. These are "unnecessary support" students who do not truly need our support but our model tells us that they do need remediation.

• True negatives (TN)-The number of students who truly passed PANCE and were correctly predicted to pass. These "predictable no support" students did not need our support and the model correctly predicted this.

• False negatives (FN)-The number of students who truly failed PANCE but were incorrectly predicted to pass. These students "fell through the cracks" because the model would ideally identify them as needing remedial support, but it did not. We argue that it is our duty to reduce this number as much as possible.

• Accuracy-The total number of correct predictions divided by the total number of students, (TP+TN)/181. Sensitivity-The proportion of students who truly failed who were correctly predicted to fail, TP/(TP+FN).

• Specificity-The proportion of students who truly passed who were correctly predicted to pass, TN/(TN+FP).

Note that we define "positive" and "negative" outcomes this way to maintain consistency with the use of these terms in healthcare: A positive outcome is unwanted, like failing a test or having a disease; a negative outcome is desired, meaning passing a test or not having a disease.

When developing/testing our predictive models, we attempted to follow the same approach that we would use when applying predictive modeling in practice. In practice, we plan to use the following procedure:

1. Use alumni data to train predictive models. 2. Determine if predictive models on alumni data are accurate, sensitive, and specific enough to trust to make predictions on new cohorts of students. 3. Use the best predictive model(s) to make predictions on new students.

Following this procedure, we treat our first four cohorts of students' data (2015-2018 starting cohorts) as alumni data and we treat our most recent cohort (starting in 2019) as the "current" cohort of students. We train and test (using LOOCV) the best model that we can on the alumni cohorts. We then use the best model to make predictions all at once on the entire 2019 "current" cohort, as if we had done so prior to their taking the PANCE. These two levels of testing/validation (first LOOCV among alumni and then full-cohort validation) show that the model has the potential to hold up and be useful in practice, rather than just as a model-building exercise.

All of the models we build are first tested with LOOCV and then validated on the "current" students cohort, with varying results that are shown below. AMMKNN performs the best for our purposes (best detection of failing and at-risk students in both LOOCV and full-cohort validation), with standard KNN coming close if it is heavily manually fine-tuned/adjusted.

In some cases, we are able to find a potentially-useful result from a standard (non-AMMKNN) model in our current cohort validation, but these often come from trial and error within the current cohort validation itself (instead of following directly from an LOOCV result), meaning that we would have no way of producing this result in practice in the future (when we do not yet know the true PANCE outcomes for the current cohort at the time when we generate the predictions and identify at-risk students).

library(randomForest) df.rf.LOOCV <-crossValidate(FinalAlumni.a, "rf1 <-randomForest(PANCE~., data=traindata, proximity=TRUE,ntree=1000); testdata$rf.pred <-predict(rf1, newdata = testdata)", nrow(FinalAlumni.a)) Above, we see that when we take the results of the RF model as-is and do not make any modifictions, it does not correctly predict any of the 13 students who failed the PANCE.

Manual fine-tuning/adjustment is needed to make the results even remotely useful. Below, we change the pass-fail threshold (only for the predicted values, of course) to 390 (rather than 349 like above): Above, we now see that the model is more sensitive now and is able to detect 5 of the students who failed PANCE. But this is still a very low number, and this change also led to 8 false positive predictions. Therefore, this RF model is not practical to use or explore further. Below, we raise the prediction threshold even higher, which improves the predictions for those who truly failed, but also leads to many more false positives. Above, we see that if we validate on our current students and manually make our model so sensitive that anyone predicted by the random forest model to score 420 has their score reduced to 350, then we can detect 1 of the 6 students who fail the exam. This comes at the expense of 4 false positives.

library(randomForest) df.rf.LOOCV <-crossValidate(FinalAlumni.b, "rf1 <-randomForest(PANCE~., data=traindata, proximity=TRUE,ntree=1000); testdata$rf.pred <-predict(rf1, newdata = testdata)", nrow(FinalAlumni.b)) Above, we see that if we validate on our current students and manually make our model so sensitive that anyone predicted by the random forest model to score 420 has their score reduced to 350, then we can detect 3 of the 6 students who fail the exam. This comes at the expense of 5 false positives. The SVM results above are not promising.

library(e1071) df.svm.LOOCV <-crossValidate(FinalAlumni.b, svm1 <-svm(PANCE~., data = traindata, kernel = "polynomial", gamma =1, cost = 10, scale = FALSE); testdata$svm1.pred <-predict(svm1, newdata = testdata) , nrow(FinalAlumni.b)) Since the result above, with a 390 prediction threshold, has 9 true positives, we will look at it in a 3-by-3 matrix:

Actual Values <-cut(df.svm.LOOCV$PANCE, breaks=c(-Inf,350,375,Inf), labels=c("<350","350-375",">375")) Predicted Values <-cut(df.svm.LOOCV$svm1.pred, breaks=c(-Inf,390,400,Inf), labels=c("<350","350-375",">375"))

(cm1<- These SVM results with Preparation B are better than with Preparation A, but we still have too low sensitivity (too few of those who truly fail are detected correctly) and too many false-positives.

Note that in a future section in this document, when we run adaptive minimum match KNN (AMMKNN), in the dataframe adaKNNminMatch.LOOCV, we have a variable called EucMatchDVMean.12 which should be the predicted value for standard KNN with K=12 for each observation. And we also have a dataframe adaKNNminMatch.current with the variable EucMatchDVMean.12, which would be the predictions from KNN with K=12 for the validation cohort of students. So, making a KNN model using the FNN package is not essential, given that we will be making an AMMKNN model as well and our AMMKNN results include K=12 results for standard KNN, as explained.

However, for the sake of completeness, we present standard KNN in a more conventional way below.

Train model with LOOCV, k = 12:

library(dplyr) library(FNN) df.knn12.LOOCV <-crossValidate(FinalAlumni.a, dtrain.x <-traindata %>% select(-PANCE); dtrain.y <-traindata %>% select(PANCE); dtest.x <-testdata %>% select(-PANCE); dtest.y <-testdata %>% select(PANCE); pred.a <-FNN::knn.reg(dtrain.x, dtest.x, dtrain.y, k = 12); testdata$knn12.pred <-pred.a$pred , nrow(FinalAlumni.a)) The result above on the validation cohort of current students is actually quite good. But cutoff points as high as 410 or 420 with this model during LOOCV showed much too many false positive predictions, so there would be no way to know that cutoff points of 415 and 420 (the ones used above) might be useful for a cohort of new students (because in practice, true PANCE results will not be known for the current student cohort at the time when the model is applied; therefore the LOOCV results are the only guide we have regarding which model to use on a new cohort). Since there is no systematic way to arrive at the result above, it is not presented in our main article.

Train model with LOOCV, k = 12:

library(dplyr) library(FNN) df.knn12.LOOCV <-crossValidate(FinalAlumni.b, dtrain.x <-traindata %>% select(-PANCE); dtrain.y <-traindata %>% select(PANCE); dtest.x <-testdata %>% select(-PANCE); dtest.y <-testdata %>% select(PANCE); pred.b <-FNN::knn.reg(dtrain.x, dtest.x, dtrain.y, k = 12); testdata$knn12.pred <-pred.b$pred , nrow(FinalAlumni.b)) Above-when k=12 and we manually change the prediction cutoffs to 400 and 410 for the predicted valuesthe predictions are not bad, but 16 students who truly score above 375 are predicted to fail, which is too many. Adaptive Minimum Match KNN-our selected best model-predicts fewer students in this category. Furthermore, the standard KNN result above required careful manual adjustment of the prediction cutoffs before the results were useful. Adaptive Minimum Match KNN doesn't require such adjustment. These characteristics were also present in the disaggregated confusion matrix when k was set to a variety of numbers, as low as 6 and as high as 18, for other standard KNN models (which are not shown).

The reason for the manually adjusted prediction cutoffs above is that the standard KNN model gives everybody or nearly everybody a predicted score above 350. By changing the cutoff for passing or failing from 350 to 400 or more, we essentially subtract 50 points from each student's predicted score. Only after this subtraction do the results have some utility (although there are still too many false-positive predictions).

Below, we try to disaggregate again, with cutoffs at 415 and 420:

Actual Values <-cut(df.knn12.LOOCV$PANCE, breaks=c(-Inf,350,375,Inf), labels=c("<350","350-375",">375")) Predicted Values <-cut(df.knn12.LOOCV$knn12.pred, breaks=c(-Inf,415,420,Inf), labels=c("<350","350-375",">375")) # (cm<-addmargins( Above, we see that there are 45 false positives which is too many for this version of our standard KNN result to be viable.

Prepare training data:

dtrain.x <-alumniForCurrentPredictions.b %>% select(-PANCE) # remove DV dtrain.y <-alumniForCurrentPredictions.b %>% select(PANCE) # keep only DV

dtrain.x <-as.data.frame(dtrain.x) dtrain.y <-as.data.frame(dtrain.y)

Prepare validation data: View results, 3x3:

Predicted Values <cut(dtest.y$knn12.pred, breaks=c(-Inf,400,410,Inf), labels=c("<350","350-375",">375")) Make 3x3 confusion matrix, when actual values are known:

Actual Values <cut(dtest.y$PANCE, breaks=c(-Inf,350,375,Inf), labels=c("<350","350-375",">375")) Now we repeat the table above with cutoffs for predictions at 415 and 420:

Predicted Values <cut(dtest.y$knn12.pred, breaks=c(-Inf,415,420,Inf), labels=c("<350","350-375",">375")) Actual Values <cut(dtest.y$PANCE, breaks=c(-Inf,350,375,Inf), labels=c("<350","350-375",">375")) The result above is decent when we change the cutoff values to 415 and 420, but there is no way that we would have known to do this before knowing the true PANCE scores of the current students. Actual PANCE # abline(v=375, col="green") # abline(h=375, col="green")

In the two scatterplots above, we see that adaptive minimum match KNN gives us more useful predictions than the PACKRAT (a nationally benchmarked cumulative exam commonly used to prepare for the PANCE).

True PANCE results are shown on the vertical axis for the 2019 current student cohort. Predicted results (either from adaptive minimum match KNN or PACKRAT) are on the horizontal axis. Our goal is to draw a vertical line which has students who fail to the left and those who pass to the right. With adaptive minimum match KNN, we are able to draw this vertical line where the predicted PANCE score is equal to 375. This allows us to identify (to the left of the line) most of the students who fail without making too many mistakes. However, if we attempt to do the same on the second scatterplot with PACKRAT scores, it is not possible to draw a vertical line anywhere that identifies most of the students who fail without making too many wrong predictions.

Train model with LOOCV:

Prepare training data:

dtrain. 

Machine Learning in Education -a Survey of Current Research Trends

The Promise and Peril of Predictive Analytics in Higher Education: A Landscape Analysis

AdaptiveLearnalytics: Adaptive Predictive Learning Analytic Tools

Use of machine learning techniques for educational proposes: a decision support system for forecasting students' grades

Assessing Intervention Timing in Computer-Based Education Using Machine Learning Algorithms

Using learning analytics to develop early-warning system for at-risk students

Comparative Study of Prediction Models on High School Student Performance in Mathematics

Mikic-Fonte FA. Predictors and early warning systems in higher education-A systematic literature review

Predicting student dropout in higher education. ArXiv Prepr ArXiv160606364

Predicting academic performance of students from VLE big data using deep learning models

Course signals at Purdue: using learning analytics to increase student success

Why We Needn't Fear the Machines: Opportunities for Medicine in a

Machine Learning: The Next Paradigm Shift in Medical Education

AI-ssessment: Towards Assessment As a Sociotechnical System for Learning

Using Data Mining for the Early Identification of Struggling Learners in Physician Assistant Education

Exploring viewing behavior data from whole slide images to predict correctness of students' answers during practical exams in oral pathology

How learning analytics can early predict under-achieving students in a blended medical education course

The relation of online learning analytics, approaches to learning and academic achievement in a clinical skills course

Artificial intelligence distinguishes surgical training levels in a virtual reality spinal task

Machine Learning Identification of Surgical and Operative Factors Associated With Surgical Expertise in Virtual Reality Simulation

Automated essay scoring and the future of educational assessment in medical education

Machine Scoring of Medical Students' Written Clinical Reasoning: Initial Validity Evidence

Development and Validation of a Machine Learning-Based Decision Support Tool for Residency Applicant Screening and Review

Applications and Challenges of Implementing Artificial Intelligence in Medical Education: Integrative Review

The role of data science and machine learning in Health Professions Education: practical applications, theoretical contributions, and epistemic beliefs

Ten caveats of learning analytics in health professions education: A consumer's perspective

Team-based learning

Team-based learning at ten medical schools: two years later

Geographic Trends in Team-based Learning (TBL) Research and Implementation in Medical Schools

The essential elements of team-based learning

An Introduction to Medical Teaching

Adaptive k-Nearest-Neighbor Classification Using a Dynamic Number of Nearest Neighbors

An adaptive k-nearest neighbor algorithm

Adaptive kNN using expected accuracy for classification of geo-spatial data

Prediction of Student's performance by modelling small dataset size

On the Dangers of Cross-Validation. An Experimental Evaluation

Academic deficiency: Student experiences of institutional labeling

Cultures of learning in developing education systems: Government and NGO classrooms in India

Communicating sad, bad, and difficult news in medicine

Communication Strategies and Cultural Issues in the Delivery of Bad News

Twelve tips for developing and maintaining a remediation program in medical education

Guidelines: The dos, don'ts and don't knows of remediation in medical education

The Framing of Decisions and the Psychology of Choice

On the Elicitation of Preferences for Alternative Therapies

## [9

Grade" ## [18

Grade" ## [28

## [34

Raw.Score" ## [52] "PANCE" CorSelectedVars.b <-names(FinalAlumni.b) saveRDS(CorSelectedVars.b, file = "CorSelectedVars.b.rds") (cm

We would like to thank Valay Maskey for his involvement in preparing and formatting this article as well as assistance with putting the AdaptiveLearnalytics package online.

This research was approved by the Mass General Brigham institutional review board in 2020 (protocol 2020P000514).

This article was submitted to the following pre-print servers: https://edarxiv.org/wtuv6/ & https://arxiv.org/abs/2108.07709

adaKNNminMatch.LOOCV <-crossValidate( FinalAlumni.a, dtrain.x <-traindata %>% select(-PANCE); dtrain.y <-traindata %>% select(PANCE); dtest.x <-testdata %>% select(-PANCE); pred.a <-adaptiveMinMatchKNNregression( dtrain.x, dtrain.y, dtest.x, maxK = 20); testdata <-pred.a , nrow(FinalAlumni.a))• In this and many other situations, the value of maxK does not appear matter too much, as long as it is high enough. We suspect that this is because the mean of all matches will stabilize as the number of matches increases. In other words: the difference between the mean of the first 20 matches and the mean of the first 19 matches is likely smaller than the difference between the mean of the first 11 matches and the mean of the first 10 matches. x <-traindata %>% select(-PANCE); dtrain.y <-traindata %>% select(PANCE); dtest.x <-testdata %>% select(-PANCE); pred.b <-adaptiveMinMatchKNNregression( dtrain.x, dtrain.y, dtest.x, maxK = 20); testdata <-pred.b , nrow(FinalAlumni.b))• In this and many other situations, the value of maxK does not appear matter too much, as long as it is high enough. We suspect that this is because the mean of all matches will stabilize as the number of matches increases. In other words: the difference between the mean of the first 20 matches and the mean of the first 19 matches is likely smaller than the difference between the mean of the first 11 matches and the mean of the first 10 matches.