Learning Patterns of University Student Retention

Ashutosh Nandeshwara,∗, Tim Menziesb, Adam Nelsonc

aKent State University, 126 Lowry Hall, Kent, OH 44242 Phone: 330-672-82222
bWest Virginia University, 841a Engineering Sciences Building, Morgantown, WV 26505

cWest Virginia University, Engineering Sciences Building, Morgantown, WV 26505

Abstract

Learning predictors for student retention is very difficult. After reviewing the literature,
it is evident that there is considerable room for improvement in the current state of the
art. As shown in this paper, improvements are possible if we (a) explore a wide range
of learning methods; (b) take care when selecting attributes; (c) assess the efficacy of
the learned theory not just by its median performance, but also by the variance in that
performance; (d) study the delta of student factors between those who stay and those who
are retained. Using these techniques, for the goal of predicting if students will remain
for the first three years of an undergraduate degree, the following factors were found to
be informative: family background and family’s social-economic status, high school GPA
and test scores.

Keywords: data mining, student retention, predictive modeling, financial aid

1. Introduction

This article uses data mining to find patterns of student retention at American Uni-
versities. Such an analysis is urgently required. In our work, we have seen a disconnect
between accepted best practices and the data available to support those practices:

• Based on our discussion with the university administrators, we assert that there is
much informal agreement on the factors that influence retention (for the most part,
the financial status of the student is considered to be the most important factor
after the student’s high-school GPA).

• However, as shown below, when we look recent experiments with student records,
we find little clear support for that informal belief. In fact, we know of many
universities that try to improve retention with a wide range of programs such as:

– attracting students with high performance indicators (such as their test scores)

∗Corresponding author
Email addresses: anandesh@kent.edu (Ashutosh Nandeshwar), tim@menzies.us (Tim Menzies),

rabituckman@gmail.com (Adam Nelson)
URL: www.nandeshwar.info (Ashutosh Nandeshwar), www.menzies.us (Tim Menzies)

Preprint submitted to Elsevier May 27, 2011


– developing student success programs (such as first-year experience)

– or encouraging tenured-faculty to teach undergraduates

Given the large levels of public support allocated to universities, it is important that
we check the validity of these informal intuitions as well as the utility of the various
retention programs such as those conducted at Kent State.

This article applies data mining methods to the problem of studying student retention.
Our general conclusions will be:

• It is possible to find patterns of student retention, using data mining;

• Previous data mining studies on these student records can be greatly improved us-
ing discretization, attribute selection and cross-validation over various algorithms.

More specifically, we will show that data mining can uncover a rich level of detail about
particular universities. For example, while mining data from Kent State, we found:

• A small and specific population of students at high risk of dropping out of university.

• That the above programs (using tenured-faculty for lecturing and focusing on stu-
dent performance data) is far less important than the financial status of a student.

Hence, we would recommend:

• Focusing more resources on the high-risk group of students, in order to improve
their chances of completing a university degree.

• Discontinuing the retention programs that primarily focus on student performance
indicators or that advocate using tenured-faculty for lecturing.

While these conclusions are specific to Kent State, the method for finding them is quite
general and could be applied to other universities in order to find their most specific and
most important student retention patterns. We welcome contacts from other researchers
who wish to repeat our analysis on their local data.

2. Literature Review

It is no news that higher education institutions are facing the problem of student
retention, which affects graduation rates as well. Colleges with higher freshmen reten-
tion rate tend to have higher graduation rates within four years. The average national
retention rate is close to 55% and in some colleges fewer than 20% of incoming student
cohort graduate (Druzdzel & Glymour, 1994), and approximately 50% of students enter-
ing in an engineering program leave before graduation (Scalise et al., 2000). Tinto (1982)
reported national dropout rates and BA degree completions rates for the past 100 years
to be constant at 45 and 52 percent respectively, except for the World War II period
(see Figure 1 for the completion rates from 1880 to 1980). Tillman & Burns (2000) at
Valdosta State University (VSU) projected lost revenues per 10 students, who do not
persist their first semester, to be $326,811. Although gap between private institutions
and public institutions in terms of first-year students returning to second year is closing,
the retention rates have been constant for both types of institutions (ACT, 2007, see

2


Figure 2). National Center for Public Policy and Higher Education (NCPPHE) reported
the U.S. average retention rate for the year 2002 to be 73.6% (NCPPHE, 2007). This
problem is not only limited to the U.S. institutions, but also for the institutions in other
countries such as U.K and Belgium. The U.K. national average freshmen retention for
the year 1996 was 75% (Lau, 2003), and Vandamme (2007) found that 60% of the first
generation first-year students in Belgium fail or dropout.

[Figure 1 about here.]

[Figure 2 about here.]

Various researchers have studied this problem extensively, using theoretical models
(Tinto, 1975, 1988; Spady, 1970, 1971; Bean, 1980), traditional models (Terenzini & Pas-
carella, 1980; Pascarella & Terenzini, 1979, 1980), and data mining techniques (Druzdzel
& Glymour, 1994; Sanjeev & Zytkow, 1995; Massa & Puliafito, 1999; Stewart & Levin,
2001; Veitch, 2004; Barker et al., 2004; Salazar et al., 2004; Superby et al., 2006; Sujit-
parapitaya, 2006; Herzog, 2006; Atwell et al., 2006; Yu et al., 2007; DeLong et al., 2007).
As shown below, we can improve those prior results by augmenting standard data mining
with discretization, attribute selection, and cross-validation over various algorithms.

As documented in Adam & Gaither (2005), the literature on retention in higher
education is extensive, and although various researchers have tested theoretical models
and noted attributes critical to student retention, these theories need to be tested from
time-to-time. New generation of data miners make the testing easier, and possibly can
find new theories or reject old theories using state-of-the art learning algorithms. In
this section, we focus on the literature relating to data mining and the student retention
problem. The lesson of this section is that learning patterns of student retention is very
difficult and, despite decades of effort, there is much room for improvement in the current
state of the art.

2.1. Data Mining for Student Retention

Druzdzel & Glymour (1994) were among the first researchers to apply knowledge dis-
covery algorithm to study the student retention problem. The authors applied TETRAD
II, a casual discovery program developed at Carnegie Mellon University, to the U.S. news
college ranking data to find the factors that influenced student retention, and they found
that the main factor of retention was the average test score. Using linear regression,
the authors found that test scores alone explained 50.5% of the variance in freshmen
retention rate. In addition, they concluded that other factors such as student-faculty
ratio, faculty salary, and university’s educational expense per student were not casually
(directly) related to student retention; and suggested that to increase student retention
universities should increase the student selectivity.

Sanjeev & Zytkow (1995) used 49er, a pattern discovery process developed by Żytkow
& Zembowicz (1993), to find patterns in the form of regularities from student databases
related to retention and graduation. The authors found that academic performance in
high school was the best predictor of persistence and better performance in college, and
that the high school GPA was a better predictor than the ACT composite score. In
addition, they found that no amount of financial aid influenced students to enroll for
more terms.

3


Massa & Puliafito (1999) applied Markov chains modelling technique to create pre-
dictive models for the student dropout problem. By tracking the students for 15 years,
the authors created state variables for the number of exams appeared, average marks
obtained, and the continuation decision. Using data mining, Stewart & Levin (2001)
studied the effects of student characteristics to persistence and success in an academic
program at a community college. They found that the student’s GPA, cumulative hours
attempted, and cumulative hours completed were the significant predictors of persistence,
and that young males were a high risk group.

Veitch (2004) used decision trees (CHAID) to study the high school dropouts. Using
25-fold cross-validation, the overall misclassification rates of drop-outs who were classified
as non-dropouts were 15.79% and 10.36%. In this study, GPA was the most significant
predictor of persistence. Salazar et al. (2004) used clustering algorithms and C4.5 to
study graduate student retention at Industrial University of Santander, Colombia. The
authors found that the high marks in the national pre-university test predicted a good
academic performance, and that the younger students had higher probabilities of a good
academic performance.

Barker et al. (2004) used neural networks and Support Vector Machines (SVM) to
study graduation rates; the first-year advising center (University College at University of
Oklahoma) collected data via a survey given to all incoming freshman. It is worthwhile
to note that Barker et al. (2004) excluded all the missing data from the study, which
constituted for approximately 31% of the total data. Overall misclassification rate was
approximately 33% for various dataset combinations. The authors used principal com-
ponent analysis to reduce the number of variables from 56 to 14, however, reported that
the results using the reduced datasets were “much worse” than the complete datasets.

Superby et al. (2006) applied discriminant analysis, neural networks, random forests,
and decisions trees to survey data at the University of Belgium to classify new students
in low-risk, medium-risk, and high-risk categories. The authors found that the scholastic
history and socio-family background were the most significant predictors of risk. The
overall classification rates for decision trees, random forests, neural networks, and linear
discriminant analysis were 40.63%, 51.78%, 51.88%, and 57.35% respectively.

Using the National Student Clearinghouse (NSC) data, Sujitparapitaya (2006) differ-
entiated between stopout, retained, and transfer students. The overall classification rates
for the validation sets using logistic regression, neural networks, C5.0 were 80.7%, 84.4%,
and 82.1% respectively. Herzog (2006) used American College Test’s (ACT) student pro-
file section data, NSC data, and the institutional student information system data for
comparing the results from the decision trees, the neural networks and logistic regression
to predict retention and degree-completion time. The author substituted mean average
ACT scores for missing scores. Decision trees created using C5.0 performed the best with
85% correct classification rate for freshmen retention, 83% correct classification rate for
degree completion time (three years or less), 93% correct classification rate for degree
completion time (six years or more ) for the validation datasets.

Atwell et al. (2006) used University of Central Florida’s student demographic and
survey data to study the retention problem with the help of data mining. In this study,
university retained approximately 82% of the freshmen from the study, and it used 285
variables to create data mining models. The authors used nearest neighbor algorithm
to impute more than 60% observations with missing values. Using decision trees with
the entropy split criterion, the authors obtained precision of 88% for the not-retained

4


outcome using the test data, and the actual retention rate for this test data set was
82.61%.

Yu et al. (2007) studied the data from Arizona State University using decision trees,
and included variables, such as demographic, pre-college academic performance indica-
tors, current curriculum, and academic achievement. Some of the important predictor
variables were accumulated earned hours, in-state residence, and on campus living.

To study the retention problem using data mining for the admissions data, DeLong
et al. (2007) applied various attribute evaluation methods, such as Chi-square gain, gain
ratio, and information gain, to rank the attributes. In addition, the authors tested
various classifiers, such as näıve Bayes, AdaBoost M1, BayesNet, decision trees, and
rules, and noted that AdaBoost M1 with Decision Stump classifier performed the best in
terms of precision and recall, hence, used this classifier for further experimentation. The
authors balanced the class variable (retained and not retained) and obtained over 60%
classification rates for both retained and not retained outcome. The authors concluded
that the number of programs that the student applied to that specific institution and
the student’s order of program admit preference were the most significant predictors of
retention.

Pittman (2008) compared various data mining techniques (artificial neural networks,
logistic regression, Bayesian Classifiers, and decision trees) applied to the student re-
tention problem, and also used attribute evaluators to generate rankings of important
attributes. The author concluded that logistic regression performed the best in terms of
ROC-curve area.

[Table 1 about here.]

2.2. Assessing the State of the Art

Table 1 lists techniques used in the studied literature, where the cohort sizes were
available, along with the reported performance measures. Clearly, there is much room
for improvement in the current state of the art:

• It is a recommended data mining practice to divide the data into a train and test
set, learn on the train set, then assess the learned theory on the test set (Witten
& Frank, 2005). Otherwise, if a theory is tested via the data used to build that
theory, this test can over-estimate theory performance. For example, the Glynn et
al. result of Table 1 seems impressive (a 83% accuracy on a data set with 49.08%
a retention rate); however, that result should be repeated using some hold-out test
set.

• All the regression studies from 1971 to 1999 report R2 values under 0.6. This R2
value is a measure of how well future outcomes are likely to be predicted by the
model. The maximum value of R2 is one and R2 values under 0.6 indicate very
weak predictive abilities.

• The accuracy reports are very close to the ZeroR theoretically lower-bound on
performance. ZeroR is a baseline classifier that simply returns the majority class.
For example, Herzog studied a data set with a 83.5% retention rate (see Table 1).
ZeroR, applied to this data set, would be correct in 83.5% of cases. Therefore, the
85.4% accuracy of Herzog’s data miners is very close to the ZeroR lower-bound;

5


i.e. the sophisticated analysis of that paper could be very nearly replicated using
the dumbest of learners (ZeroR).

The last three results of Table 1 do not report their accuracies. However, these can
be calculated in the following way. Let A, B, C, D be the true negatives, false negatives,
false positives, and true positives respectively of a predictor that some student will attend
some year of university. From Zhang & Zhang (2007) and Menzies et al. (2007), we say
that (A,B,C,D) can be used in the following performance measures:

pd = recall = D(B + D) (1)

pf = false alarm = C/(A + C) (2)

prec = precision = D/(C + D) (3)

acc = accuracy = (A + D)/(A + B + C + D) (4)

neg/pos = (A + C)/(B + D) (5)

Note that all these performance measures assess subtly different aspects of the perfor-
mance of data miner:

• “Recall” measures how much of the target was found.

• The “false alarm” rate measures what fraction of non-targets triggered the learned
theory.

• “Precision” comments on how many targets are found in the data selected by the
theory.

• “Accuracy” comments on how many of the targets and non-targets were accurately
labeled by the learned theory.

In an ideal result, we can obtain high recall, low false alarms, high precision, and high ac-
curacies. However, as discussed by Zhang & Zhang (2007) and Menzies et al. (2007), these
values are inter-related. Hence, the ideal result is not possible. These inter-relationships
are shown below:(

prec =
D

D + C
=

1

1 + C
D

=
1

1 +
neg
pos
· pf
recall

)
⇒
(
pf =

pos

neg
·

(1 − prec)
prec

· recall
)

(6)

If a publication misses a particular performance measure, it is possible to use these
equations to infer the missing value. For example:

D = recall ∗pos (7)
C = pf ∗neg (8)
A = C ∗ 1/(pf − 1) (9)

acc = (A + D)/(neg + pos) (10)

Using these equations, we can comment that the last three results of Table 1 can be
significantly improved:

• In Atwell et al. (2006), the the precision varied from 73% to 88%. Using our
equations, we can estimate false alarm values (pf) ranging from 2% to 8% (assuming
recall values of 65% to 90%). In our experience, it is very rare to achieve such
very low false alarm rates, especially from noisy data relating to student retention.
Hence, the Atwell et al., results are somewhat surprising.

6


• In DeLong et al. (2007), the precision varied from 57% to 60%. From our equations,
we can estimate their false alarm rates in the range of 49% to 63% (assuming recall
values of 65% to 90%). Such high false alarm rates are deprecated.

• In Pittman (2008), the reported precision varied from 44% to 63%. Our equations
comment that these values are numerically unobtainable. For 0.78 ≤ acc ≤ 0.81,
neg = 17139 and pos = 21136 −neg, the equations only solve for prec ≤ 50. That
is, half the precision values reported by Pittman need to be reviewed.

In summary, learning predictors for student retention is very difficult. Despite decades of
work, there is considerable room for improvement in the methods used to find patterns
in student retention. As we show below, such improvements are possible if we augment
standard data miners with some extra pre-processors.

3. Data

Data used in this study were from a mid-size public university, and were extracted
from the student information system on official census dates. These data consisted all
first-year freshmen’s demographic, academic, and financial aid information (more than
100 attributes), as of the census reporting dates (after two weeks of semester starting
date). As the higher education administrators may design effective policies when the
students begin their studies, it is important to note that our emphasis was on detecting
patterns based only on the first-term data, and that too only beginning of the term data.
We created three dependent variables: RET1, if the student returned after one year;
RET2, if the student returned after two years; and RET3, if the student returned after
three years. The overall distribution of these dependent variables is given in Table 2. For
the studied time period, the overall first-year retention rate was 71.3%, the second-year
persistence rate was 60.4%, and the third-year persistence rate was 54.8%.

[Table 2 about here.]

In the Integrated Postsecondary Education Data System (IPEDS), for the U.S. only,
degree-granting, Doctoral degree offering, 4-year and above institutes (excluding Univer-
sity of Phoenix-Online Campus), and cohort size greater than 3,000, we found that the
full-time freshmen retention rate had a range from 59% to 96%, and the cohort size had
a range from 3,117 to 8,025. In this list of institutions, Kent state university ranked 38
in the full-time retention percentage and 26 in the cohort size (Department of Education,
2010). Thus, Kent state data are representative of other similar size universities, and the
data mining approach could be generalized to other universities.

3.1. Attribute Groups

The data mining methods discussed below used attribute selection to prune measure-
ments that are poor predictors for the target class. Therefore, our data miners can be
used to assess various hypotheses relating to student retention:

• If an hypothesize claims that attributes X, Y, Z are important...

• ... and if our learners prune those attributes ...
7


• ... then that is evidence against that hypothesis.

Accordingly, before applying our data miners, we take care to divide our attributes into
the active hypothesis that they supprot:

H1: The financial aid hypothesis. Sanjeev & Zytkow (1995) found that no amount of
financial aid influenced students to enroll for more terms; whereas Herzog (2005)
found that upper-income students had reduced dropout odds compared to those
from middle and lower incomes. According to John (2000), “the research litera-
ture remains ambiguous” regarding the influence financial aid on recruitment and
retention.

H2: The academic performance hypothesis. Although there is no doubt that high school
GPA and high school preparedness has a significant impact on persistence, re-
searchers have often questioned the effects of standardized college entrance exam-
inations (ACT/SAT). Waugh et al. (1994) found that SAT and ACT scores had
no relationship with retention, whereas Murtaugh et al. (1999) found that SAT
scores had some predictive value, although inferior compared to high school GPA.
DesJardins et al. (2002) noted that high GPA lowered the risk of dropout, but the
effect diminished over time, and that the financial aid was an insignificant factor
for increasing graduation, however, it indeed reduced the student stopout. In their
comprehensive literature review, Lotkowski et al. (2004) found that high school
GPA had the strongest relationship with college retention in the academic factors,
but ACT assessment scores had a moderate impact.

H3: The faculty tenure and experience hypothesis. Ehrenberg & Zhang (2005) found
that for every 10 percentage point increase in the percentage of part-time faculty
and not on tenure-track full-time faculty, there was a 3-5 percentage point reduc-
tion in the institution’s graduation rate. Jacoby (2006) found similar results at
community colleges that increase in the ratio of part-time faculty had a negative
impact on the graduation rates.

In the sequel, we will return to these hypotheses to comment on which were most
useful for predicting student retention. Table 7 lists attributes that we grouped together
under each hypothesis.

[Table 3 about here.]

4. Building the Experiment

In Section 2.2, we assert that a good data miner should do better than the simplistic
ZeroR learner. Table 2 tell us that that lower bound is:

• For first year retention: 71.3%.

• For second year retention: 60.4%.

• For third year retention: 54.8%.

As discussed below, we will be able to do much better than some, but not all, of these
targets. This was achieved by

8


• Removing spurious attributes using feature subset selection;

• Exploring a large range of classifiers;

• Assessing the learned theories by their variance, as well as their median perfor-
mance.

• Assessing the learned theories by their variance, as well as their median perfor-
mance.

• Study the delta of student factors between those who stay and those who are re-
tained.

4.1. Feature Subset Selection

Table 7 shows a sample of the 103 attributes used in this study. Our pre-experimental
suspicion was that some of the attributes were “noisy”; i.e. contain signals not related to
the target of prediction retention. Therefore, before we learn a theory, we first explored
attribute selection.

Note that the number of attributes to select is crucial in the analysis of the data,
because it allows us to comment on the hypotheses shown in the last section. If removal
of attributes from an hypothesis does not change the performance of the prediction, then
that hypothesis is spurious.

In this experiment, we ranked the 103 attributes from most informative to least in-
formative. We then built theories using the top n ∈{5, 10, .., 100, 103} ranked attributes.
Attributes were then discarded if adding them in did not improve the performance of our
retention predictors.

The attributes were ranked using one of four methods: CFS, Information Gain, chi-
squared, and One-R. Correlation-based feature selection constructs a matrix of feature
to feature, and feature-to-class correlations (Hall, 2000). CFS uses a best first search by
expanding the best subsets until no improvement is made, in which case the search falls
to the unexpanded subset having the next best evaluation until a subset expansion limit
is met.

Information Gain uses an information theory concept called entropy. Entropy mea-
sures the amount of uncertainty, or randomness, that is associated with a random vari-
able. Thus, high entropy can be seen as a lack of purity in the data. Information gain,
as described in Mitchell (1997) is an expected reduction of the entropy measure that
occurs when splitting examples in the data using a particular attribute. Therefore an
attribute that has a high purity (high information gain) is better at describing the data
than the one that has a low purity. The resulting attributes are then ranked by sorted
their information gain scores in a descending order.

The chi-squared statistic is used in statistical tests to determine how distributions of
variables are different from one another (Moore & Notz, 2006). Note that these variables
must be categorical in nature. Thus, the chi-squared statistic can evaluate an attribute’s
worth by calculating the value of this statistic with respect to a class. Attributes can
then be ranked based on this statistic.

The One-R classifier, described below, can be used to deliver top-ranking attributes.
One-R constructs and scores rules using one attribute. Feature selectors using One-R
sort the attributes based on these scores.

9


4.2. Classifiers

In data mining, classifiers are used to learn connections between independent features
and the dependent feature (called the class). Once these patterns are learned, we can
predict outcomes in new data by reflecting on data that has already been examined.

This study tried six different classifiers: One-R, C4.5, ADTrees, Naive Bayes, Bayes
networks, and radial bias networks. These are some of the well-known and standard
classifiers in the machine learning field, except for ADTrees. One-R, described in Holte
(1993), builds rules from the data by iteratively examining each value of an attribute and
counting the frequency of each class for that attribute-value pair. An attribute-value is
then assigned as the most frequently occurring class. Error rates of each of the rules are
then calculated, and the best rules are ranked based on the lowest error rates.

A radial basis function network (RBFN) is an artificial neural network (ANN) that
utilizes a radial basis function as an activation function (Bors, 2001). An ANN’s acti-
vation function is used in order to offer non-linearity to the network. This is important
for multi-layer networks containing many hidden layers, because their advantages lie in
their ability to learn on non-linearly separable examples.

C4.5 (Quinlan, 1993) is an extension to the ID3 (Quinlan, 1986) algorithm. A decision
tree (shown in Figure 3) is constructed by first determining the best attribute to make
as the root node of the tree (Mitchell, 1997). ID3 decides this root attribute by using
one that best classifies training examples based upon the attribute’s information gain
(described above) (Quinlan, 1986). Then, for each value of the attribute representing any
node in the tree, the algorithm recursively builds child nodes based on how well another
attribute from the data describes that specific branch of its parent node. The learning
stops when the tree perfectly classifies all training examples, or when all attributes used.
C4.5 extends ID3 by making several improvements, such as the ability to operate on both
continuous as well as discrete attributes, training data that contains missing values for
a given attribute(s), and employ pruning techniques on the resulting tree.

[Figure 3 about here.]

ADTrees are decision trees that contain both decision nodes, as well as prediction
nodes (Freund & Mason, 1999). Decision nodes specify a condition, while prediction
nodes contain only a number. Thus, as an example in the data follows paths in the
ADTree, it only traverses branches whose decision nodes are true. The example is
then classified by summing all prediction nodes that are encountered in this traversal.
ADTrees, however, differ from binary classification trees, such as C4.5, where those trees
only traverses a single path down the tree.

[Figure 4 about here.]

A naive Bayes classifier uses Bayes’ theorem to classify training data. Bayes’ the-
orem, as shown in Equation 11, determines the probability P of an event H occurring
given an amount of evidence E. This classifier assumes feature independence; the algo-
rithm examines features independently to contribute to probabilities, as opposed to the
assumption that features depend on other features. Surprisingly, even though feature in-
dependence is an integral part of the classifier, it often outperforms many other learners
(Rish; Domingos & Pazzani, 1997).

10


Pr(H|E) =
Pr(E|H) ∗Pr(H)

Pr(E)
(11)

Bayesian networks, illustrated in Figure 4, are graphical models that use a directed
acyclic graph (DAG) to represent probabilistic relationships between variables. As stated
in Heckerman (1996), Bayesian networks have four important elements to offer:

1. Incomplete data sets can be handled well by Bayesian networks. Because the
networks encode a correlation between input variables, if an input is not observed,
in will not necessarily produce inaccurate predictions, as would other methods.

2. Causal relationships can be learned about via Bayesian networks. For instance, we
can find whether a certain action taken would produce a specific result and to what
degree.

3. Bayesian networks promote the amalgamation of data and domain knowledge by
allowing for a straightforward encoding of causal prior knowledge, as well as the
ability to encode causal relationship strength.

4. Bayesian networks avoid over fitting of data, as “smoothing” can be used in a way
such that all data that is available can be used for training.

4.3. Cross-Validation

The value of different attributes can be assessed using equations one to four. If we
use multiple hold out test sets, we can also discover the variance in these performance
figures. In this experiment, we performed a 5 × 5 cross-validation i.e. we partitioned
the data five times into a testing set consisting of 1

5
-th of the data and a training set

of 4
5
-ths of the data. After the five rounds, we recorded the median values of recall and

false alarm rates.

4.4. Contrast Set Learning

After determining the subset of the attributes that best predict for student retention,
we conducted a contrast set study. Contrast set learners like TAR3 (Menzies & Hu, 2007)
seek attribute ranges that are most different in various outcomes. One way to read these
contrast sets are as treatments that promise if action X was applied to a domain, then
this would favor outcome X over outcome Y. In our case, we used TAR3 in two ways:

• Firstly, we will use TAR3 to find which treatments most select for retention;

• Secondly, we will run TAR3 in the opposite direction to find the treatments that
most select for students leaving university.

. In the first case, TAR3 is being used to find actions that most encourage retention. In
the second case, TAR3 is being used to find the worst possible actions that most increase
the probability of a student leaving.

11


5. Analysis of Experimental Results

5.1. Evaluation Metrics

The evaluation metrics used in this experiment are standard data mining performance
measures of a method. They are:

• Probability of detection (PD);

• Probability of false alarm (PF);

• And variance PD and PF seen over the our cross-validation study.

Variance in these values provides insight into how much reliability a classifier supports
on the data. For example, if a method’s PD values ranges from very low to very high, we
can conclude that the particular method is inconsistent in its probabilities of detection.
For our studies, we rejected anything with a variance greater than ±25%.

The above statistics were collected over 1500 experiments, which were repeated 20
times (to check for conclusion stability). In all, we conducted

5 ∗ 5 ∗ 4 ∗ 6 ∗ 3 ∗ 20 = 36, 000

experiments; i.e. 5 × 5 cross-validation using four feature subset selectors and 6 different
learners, for the 3 data sets of Section 3 (recall from §3 that those three data sets
contained data about first, second, and third year retention). This was repeated 20
times using the top n ∈ 5, 10, 15, .., 100, 103 attributes as found by the feature selector.

5.2. First Results

After rejecting all results with (1) a PD lower than the ZeroR limit; (2) a PD variance
greater than ±25%; and (3) a PF higher than 25%, we found that we had no predictors
for Year1 or Year2 retention. This is the first major finding for this research: it is very
difficult to predict lower year retention. Note that this result is consistent with prior
results discussed above in our literature review.

For the rest of this study, we will focus only on third year retention. The case for
focusing on third year retention is quite clear:

• If the goal is to provide a complete university education for a student, then pre-
dicting survival till second year is less interesting than lasting till third year.

• Third year retention implies second and first year retention.

5.3. Ranking with the Mann-Whitney Test

After pruning results with low PD, high PF, or high PD variance, we ranked the
remaining results via a Mann-Whitney test (95% confidence). We determined the ranks
by counting how many times a combination won compared to another combinations.
The method that won the most number of times was then given the highest rank. The
table in Figure 4 shows the top ten ranking combinations based on a PD performance
measure. Note: we gave identical ranks to those treatments whose win value was equal
in magnitude.

12


[Table 4 about here.]

Since similar results were achieved using 30 or 50 attributes, we applied Occam’s
Razor and focused on the 30 attributes found to be best for oneR/bnet. For these 30
attributes, we studied all their ranges. Figure 4 shows the ranges which, in isolation,
select for retention at a probability greater than the ZeroR limit for (for third year,
that ZeroR limit is 55%). In terms of assessing different hypothesis, the third column of
Figure 4 is most informative:

• The ranges shown at the top of the table are most predictive for third year retention.
Note the dominance of “Financial Aid” attributes from Figure 5.

• Attributes related to student “Performance” are rarer.

• None of the attribute ranges include the “Faculty Type and Experience” attributes
of Figure 5.

From this analysis, we made two tentative conclusions:

• Using experienced faculty-level instructors is not predictive for third year retention.

• Issues relating to financial aid dominate over student performance.

5.4. Ranking with Contrast Set Learning

The counter-case to this conclusion might be that Figure 4 only discusses the effect
of attribute ranges in isolation. It is possible that combination of factors might lead
to different conclusions. The TAR3 treatment learner was used to test this possibility.
We let TAR3 build rules of up to size 10 (i.e. ten combinations of attribute ranges)
from the 30 attributes selected by the best learning combination of Figure 4. It turned
out that this max size of 10 ranges was much larger than necessary: TAR3 never found
combinations larger than three ranges.

[Table 5 about here.]

6. Results

Figure 5 lists the rankings of all attribute ranges which, in isolation, predict for third
year retention at a probability higher than the ZeroR limit (55%), and are supported by
good number of records. The top six attributes affecting third-year retention were from
the financial aid hypothesis: student’s wages, parent’s adjusted gross income, student’s
adjusted gross income, mother’s income, father’s income, and high school percentile. Of
those students who reported their wages, students who made between 7,850 and 9,958
had a 79% retention. Similar rules were found for parent’s income and adjusted gross
income. It means that the students with stronger financial support usually stay in college
than the students with weaker financial support.

After these top six attributes, high school percentile of 81 or greater was an important
attribute with 69% of students returning after three years. Some other “performance”
attributes were ACT scores and ranks. This supports the argument that scores do have
some predictability of student retention.

13


TAR3 results, given in Figure 5, produced simple theories (treatments) that combined
ranges of various attributes that maximized the student retention. For example, the
student retention was very high for students with the AGI in the range from $7,000
to $724,724 and father’s wages were in the range from $56,289 to $999,999. One more
interesting theory that predicted high retention was where father’s education level was 3
(college) and student’s rank amongst the freshmen cohort was between 66.3 and 98.4.

Treatments that predicted student drop-out were based on the total number of classes
student was enrolled, English 10000, an introductory college writing and supplemental
instruction class, and on-campus living. Students who took less than five class, enrolled
in the English 10000 class, and did not live on-campus were at high risk of drop-out.
Chart on the bottom of Figure 5 shows the retention percentage of each treatment. For
example, students enrolled in English 10000 had a 40% retention in their third year.

Key findings were:

• Student’s and parent’s income capacity and levels affected student retention. Third-
year retention was higher for the students with high income than the students with
low income. According to treatment 1, approximately 82% of students who had
at least $7,000 AGI and their fathers’ income was at least $56,289 returned after
three years. Similarly, according to treatment 5, approximately 79% of students
who made at least $5,383 and their parents’ AGI was at least $87,744 returned
after three years.

• Students with better high school performance amongst their peers had higher
chances of retention. According to treatment 2, approximately 81% of students
who had at least $7,000 AGI and had high school percentile of 72 and better re-
turned after three years. Approximately 79% students who had at least 3.34 HS
GPA and whose parents had an AGI of at least $84,744 stayed after three years,
given in treatment 4.

• ACT scores, rank of these scores amongst peers, and COMPASS scores affected
student retention. Students with higher scores and rank had higher chances of
retention. According to treatment 3, approximately 80% of students who had at
least $7,000 AGI and had ACT math score of 21 or better returned after three
years. Similarly, 77% of students who had at least 23 in ACT composite (or SAT
equivalent) and had an income of at least $5,383 and less than $561,500 returned
after three years, given in treatment 6.

• Parent’s education level had a positive effect on student retention. Students whose
parents did not attend college had a lower retention compared to students whose
parents did attend college. As given in treatments 7 and 10, a student was highly
likely (77%) to return after three years: (7) if the mother of that student attended
college, the student had a ACT composite score of 22 or better, the parents’ AGI
was at least $84,744; (10) if the father of that student attended college and the
student’s percentile rank amongst other freshmen in the cohort was at least 66.3.

• Enrolling in fewer classes (less than five), enrolling in English 10000 (an introduc-
tory college writing class), and living off-campus had a negative effect on student
retention, as given in treatments 11, 12, and 13. It is important to note that
enrolling in that English course itself is not a predictor of non-retention, but the

14


sample of the students that attended this class were at high-risk of dropping out.
Given funding for further investigation, we would focus more data collection on
this high-risk group.

[Figure 5 about here.]

6.1. Strategic Actions

This study provides insights in student retention domain using beginning of term
data. These insights can be used to design effective policies and strategic actions, such
as:

• Most of the attributes were related to socio-economic levels and capacities of stu-
dents and their parents; however, this cannot be controlled while admitting stu-
dents, but better support programs and calculated financial-aid packaging for stu-
dents with lower economic capacities can be created.

• First-year students should be encouraged to live on-campus by providing some
incentives, as on-campus students have higher chances of retention.

• Special guidance and supplemental instruction in writing and reading should be
provided to first-generation students. Parents of first-year generation students have
considerably low-incomes than the parents of non-first-generation students, and
according to the results of this study, income of parents is a critical factor in
student retention even if the students had similar academic performance.

• Students are placed in the supplemental instruction classes, such as English 10000,
based on their COMPASS and ACT scores. As these students’ scores indicated lack
of academic preparedness in some areas, academic advisers correctly place students
in such classes; however, if the students fail or perform poorly in such classes, it
leaves a lasting impression and sets the students to for future drop-out, even after
three years. Therefore, it is paramount that advisers not only place students in
supplemental instruction classes, but also ensure the success of students in these
classes and improve the skills that students lack. Out of all classes considered in
this study, English seemed to have the greatest impact. Intuitive as it may be, to
succeed in college, students need good writing and reading skills.

7. Conclusion

Although our techniques could not predict first or second year retention with sig-
nificantly higher accuracies than the baseline, these techniques obtained probability of
detection approximately 15% higher for the class value of Y and 20% higher for the class
value of N than the baseline percentages for third-year retention, based on the first-year
beginning of the term data. In the studied literature, we have not found any studies with
such a significant improvement over the baseline for the third-year retention. In addition,
if policies are designed to improve third-year retention rate (using this predictive model),
not only will they improve first and second year retention rates, but also the six-year
graduation rates.

15


For the studied institution, family background and family’s social-economic status are
critical for student’s third-year persistence. Using feature subset selection methods, we
found that the attributes from the “financial aid” hypothesis were selected the most as
predictors of retention, and although the attributes from the “performance” hypothesis
were selected, their predictability, in isolation, was lesser than the attributes from the “fi-
nancial aid” hypothesis. None of the attributes from the “faculty tenure and experience”
were selected by the feature subset selectors.

These results could very well be true only for the studied institution; however, if the
approach detailed in this study is followed, other institutions can find top performing
classifier and important attributes. We recommend: (a) data discretization; (b) feature
subset selection with cross-validation and evaluation the performance over various learn-
ers; (c) treatment learners, such as TAR3 to find succinct strategic actions in complex
data. We welcome the opportunity to study data from other institutions and willing to
share the experiment platform used in this study.

References

ACT (2007). ACT National Collegiate Retention and Persistence to Degree Rates.
http://www.act.org/research/policymakers/reports/retain.html.

Adam, A. J., & Gaither, G. H. (2005). Retention in higher education: A selective resource guide. New
Directions for Institutional Research, 2005 , 107–122.

Atwell, R. H., Ding, W., Ehasz, M., Johnson, S., & Wang, M. (2006). Using data mining techniques to
predict student development and retention. In Proceedings of the National Symposium on Student
Retention.

Barker, K., Trafalis, T., & Rhoads, T. R. (2004). Learning from student data. Systems and Information
Engineering Design Symposium, (pp. 79–86).

Bean, J. P. (1980). Dropouts and turnover: The synthesis and test of a causal model of student attrition.
Research in Higher Education, 12 , 155–187.

Bors, A. (2001). Introduction of the radial basis function (rbf) networks. In Online Symposium for
Electronics Engineers (pp. 1–7). volume 1.

Bresciani, M. J., & Carson, L. (2002). A study of undergraduate persistence by unmet need and
percentage of gift aid. NASPA Journal , 40 , 104–123.

DeLong, C., Radcliffe, P. M., & Gorny, L. S. (2007). Recruiting for retention: Using data mining and
machine learning to leverage the admissions process for improved freshman retention. In Proceedings
of the National Symposium on Student Retention.

Department of Education (2010). Integrated Postsecondary Education Data System (IPEDS).
http://nces.ed.gov/ipeds/datacenter/.

DesJardins, S. L., Ahlburg, D. A., & McCall, B. P. (2002). A temporal investigation of factors related
to timely degree completion. The Journal of Higher Education, 73 , 555–581.

Dey, E. L., & Astin, A. W. (1993). Statistical alternatives for studying college student retention: A
comparative analysis of logit, probit, and linear regression. Research in Higher Education, 34 , 569–
581.

Domingos, P., & Pazzani, M. J. (1997). On the optimality of the simple bayesian classifier under zero-one
loss. Machine Learning , 29 , 103–130.

Druzdzel, M. J., & Glymour, C. (1994). Application of the TETRAD II program to the study of student
retention in u.s. colleges. In Working notes of the AAAI-94 Workshop on Knowledge Discovery in
Databases (KDD-94) (pp. 419–430). Seattle, WA.

Ehrenberg, R., & Zhang, L. (2005). Do Tenured and Tenure-Track Faculty Matter? Journal of Human
Resources, 40 , 647.

Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In In Machine
Learning: Proceedings of the Sixteenth International Conference (pp. 124–133). Morgan Kaufmann.

Glynn, J., Sauer, P., & Miller, T. (2003). Signaling student retention with prematriculation data. NASPA
Journal , 41 , 41–67.

16


Hall, M. (2000). Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning.
In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000): June
29-July 2, 2000, Stanford University (p. 359). Morgan Kaufmann.

Heckerman, D. (1996). A Tutorial on Learning With Bayesian Networks. Technical Report Learning in
Graphical Models.

Herzog, S. (2005). Measuring determinants of student return vs. dropout/stopout vs. transfer: A first-
to-second year analysis of new freshmen. Research in Higher Education, 46 , 883–928.

Herzog, S. (2006). Estimating student retention and degree-completion time: Decision trees and neural
networks vis--vis regression. New Directions for Institutional Research, 131 .

Holte, R. (1993). Very simple classification rules perform well on most commonly used datasets. Machine
Learning , 11 , 63.

Jacoby, D. (2006). Effects of part-time faculty employment on community college graduation rates.
Journal of Higher Education, 77 , 1081–1103.

John, E. P. (2000). The impact of student aid on recruitment and retention: What the research indicates.
New Directions for Student Services, (pp. 61–76).

Lau, L. K. (2003). Institutional factors affecting student retention. Education, 124 , 126–137.
Lotkowski, V., Robbins, S., & Noeth, R. (2004). The role of academic and non-academic factors in

improving college retention. ACT Office of Policy Research, .
Massa, S., & Puliafito, P. (1999). An application of data mining to the problem of the university

students’ dropout using markov chains. In Principles of Data Mining and Knowledge Discovery.
Third European Conference, PKDD’99 (pp. 51–60). Prague, Czech Republic.

Menzies, T., Dekhtyar, A., Distefano, J., & Greenwald, J. (2007). Problems with precision. IEEE
Transactions on Software Engineering , . http://menzies.us/pdf/07precision.pdf.

Menzies, T., & Hu, Y. (2007). Just enough learning (of association rules): The TAR2 treatment learner.
In Artificial Intelligence Review . Available from http://menzies.us/pdf/07tar2.pdf.

Mitchell, T. M. (1997). Machine Learning . New York: McGraw-Hill.
Moore, D., & Notz, W. (2006). Statistics: concepts and controversies. WH Freeman & Co.
Murtaugh, P. A., Burns, L. D., & Schuster, J. (1999). Predicting the retention of university students.

Research in Higher Education, 40 , 355–371.
NCPPHE (2007). Retention rates - first-time college freshmen returning their second year (ACT).
Pascarella, E. T., & Terenzini, P. T. (1979). Interaction effects in spady and tinto’s conceptual models

of college attrition. Sociology of Education, 52 , 197–210.
Pascarella, E. T., & Terenzini, P. T. (1980). Predicting freshman persistence and voluntary dropout

decisions from a theoretical model. The Journal of Higher Education, 51 , 60–75.
Pittman, K. (2008). Comparison of data mining techniques used to predict student retention. Ph.D.

thesis Nova Southeastern University.
Quinlan, J. R. (1986). Induction of decision trees. (1st ed.).
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine

Learning). (1st ed.). Morgan Kaufmann.
Rish, I. (). An empirical study of the naive bayes classifier. In IJCAI-01 workshop on ”Empirical

Methods in AI”. http://www.intellektik.informatik.tu-darmstadt.de/ tom/IJCAI01/Rish.pdf.
Salazar, A., Gosalbez, J., Bosch, I., Miralles, R., & Vergara, L. (2004). A case study of knowledge

discovery on academic achievement, student desertion and student retention. Information Technology:
Research and Education, 2004. ITRE 2004. 2nd International Conference on, (pp. 150–154).

Sanjeev, A., & Zytkow, J. (1995). Discovering enrolment knowledge in university databases. In First
International Conference on Knowledge Discovery and Data Mining (pp. 246–51). Montreal, Que.,
Canada.

Scalise, A., Besterfield-Sacre, M., Shuman, L., & Wolfe, H. (2000). First term probation: models for
identifying high risk students. In 30th Annual Frontiers in Education Conference (pp. F1F/11–16
vol.1). Kansas City, MO, USA: Stripes Publishing.

Spady, W. G. (1970). Dropouts from higher education: An interdisciplinary review and synthesis.
Interchange, 1 , 64–85.

Spady, W. G. (1971). Dropouts from higher education: Toward an empirical model. Interchange, 2 ,
38–62.

Stage, F. (1989). Motivation, Academic and Social Integration, and the Early Dropout. American
Educational Research Journal , 26 , 385–402.

Stewart, D. L., & Levin, B. H. (2001). A model to marry recruitment and retention: A case study of
prototype development in the new administration of justice program at blue ridge community college.

Sujitparapitaya, S. (2006). Considering student mobility in retention outcomes. New Directions for

17


Institutional Research, 2006 .
Superby, J. F., Vandamme, J. P., & Meskens, N. (2006). Determination of factors influencing the

achievement of the first-year university students using data mining methods. In 8th International
Conference on Intelligent Tutoring Systems (ITS 2006) (pp. 37–44). Jhongli, Taiwan.

Terenzini, P. T., & Pascarella, E. T. (1980). Toward the validation of tinto’s model of college student
attrition: A review of recent studies. Research in Higher Education, 12 , 271–282.

Tillman, C., & Burns, P. (2000). Presentation on First Year Experience.
http://www.valdosta.edu/ cgtillma/powerpoint.ppt.

Tinto, V. (1975). Dropout from higher education: A theoretical synthesis of recent research. Review of
Educational Research, 45 , 89–125.

Tinto, V. (1982). Limits of Theory and Practice in Student Attrition. The Journal of Higher Education,
53 , 687–700.

Tinto, V. (1988). Stages of student departure: Reflections on the longitudinal character of student
leaving. Journal of Higher Education, 59 , 438–455.

Vandamme, J. (2007). Predicting Academic Performance by Data Mining Methods. Education Eco-
nomics, 15 , 405–419.

Veitch, W. R. (2004). Identifying characteristics of high school dropouts: Data mining with a decision
tree model.

Waugh, G., Micceri, T., & Takalkar, P. (1994). Using ethnicity, SAT/ACT scores, and high school GPA
to predict retention and graduation rates.

Witten, I., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. (2nd
ed.). San Francisco: Morgan Kaufmann Publishers.

Yu, C. H., DiGangi, S., Jannasch-Pennell, A., Lo, W., & Kaprolet, C. (2007). A data-mining approach
to differentiate predictors of retention between online and traditional students.

Zhang, H., & Zhang, X. (2007). Comments on ’data mining static code attributes to learn defect
predictors’. IEEE Transactions on Software Engineering , .

Żytkow, J., & Zembowicz, R. (1993). Database exploration in search of regularities. Journal of Intelligent
Information Systems, 2 , 39–81.

18


Figure 1: BA Degree Completion Rates for the period 1880 to 1980, where Percent Completion is the
Number of BAs Divided by the Number of First-time Degree Enrollment Four Years Earlier (Tinto,
1982)

19


Figure 2: Percentage of First-Year Students at Four-Year Colleges Who Return for Second Year (ACT,
2007)

20


Figure 3: A decision tree consists of a root node and descending children nodes who denote decisions
to make in the tree’s structure. This tree, for example, was constructed in an attempt to optimize
investment portfolios by minimizing budgets and maximizing pay-offs. The top-most branch represents
the best selection in this example.

21


Figure 4: In this simple Bayesian network, the variable Sprinkler is dependent upon whether or not it’s
raining; the sprinkler is generally not turned on when it’s raining. However, either event is able to cause
the grass to become wet - if it’s raining, or if the sprinkler is caused to turn on. Thus, Bayesian networks
excel at investigating information relating to relationships between variables.

22


# Treatment

1 7000 ≤ FinAidSTUDENT AG < 724,724 and 56,289 ≤ FinAidFATHER WAG < 999,999
2 7,000 ≤ FinAidSTUDENT AG < 724,724 and HS PERCENT ≥ 72
3 7,000 ≤ FinAidSTUDENT AG < 724,724 and 21 ≤ ACT1 MATH < 36
4 84,744 ≤ FinAidPARENT AGI < 999,999 and HS GPA ≥ 3.34
5 84,744 ≤ FinAidPARENT AGI < 999,999 and 5383 ≤ FinAidSTUDENT WA < 561,500
6 23 ≤ MaxACT < 35 and 5383 ≤ FinAidSTUDENT WA < 561,500
7 22 ≤ ACT1 COMP < 35 and 84,744 ≤ FinAidPARENT AGI < 999,999 and

FinAidMOTHER ED=3
8 5383 ≤ FinAidSTUDENT WA < 561,500 and 21 ≤ ACT1 MATH < 36
9 HS GPA ≥ 3.34 and 32,570 ≤ FinAidMOTHER WAG < 533,395

10 FinAidFATHER ED=3 and 66.3 ≤ PercentileRankHSGPA < 98.4
11 1 ≤ TotalClass≤ 5
12 ENG10=Y
13 LIVE.ON.CAMP=N

 30

 40

 50

 60

 70

 80

 0  1  2  3  4  5  6  7  8  9  10  11  12  13

Treatment #

%stay

Figure 5: Treatments 1 to 10 are the top ten treatments found by this analysis that increases the third
year retention rates. Treatments 11,12,13 are the worst three treatments found by this analysis that
most decrease the third year retention rates. The effects of each treatment, is shown on the bottom plot.

23


A
u

t
h

o
r

(
Y

e
a
r
)

N
o
t
e
s

C
o
h

o
r
t

S
iz

e
R

e
t
a
in

e
d

(
#

)
R

e
t
a
in

e
d

(
%

)
M

e
a
s
u

r
e

o
f

A
c
c
u

-
r
a
c
y

C
o
e
ff

e
s

U
s
e
d

?
T

e
c
h

n
iq

u
e
s

U
s
e
d

S
p

a
d

y
(1

9
7
1
)

6
8
3

6
1
5

9
0
.0

4
%

R
2

o
f
.3

1
3
2

fo
r

m
e
n

a
n

d
.3

8
7
9

fo
r

w
o
m

e
n

Y
e
s

M
u

ltip
le

re
g
re

ssio
n

B
e
a
n

(1
9
8
0
)

9
0
6

7
6
9

8
4
.8

8
%

R
2

o
f

.2
2

fo
r

w
o
m

e
n

a
n

d
0
.0

9
fo

r
m

e
n

Y
e
s

M
u

ltip
le

re
g
re

ssio
n

T
e
re

n
z
in

i
(1

9
8
0
)

stu
d

y
1

3
7
9

6
0

1
5
.8

0
%

R
2

o
f

.2
4
6

Y
e
s

d
isc

rim
in

a
te

a
n

a
ly

se
s

stu
d

y
3

5
1
8

4
2
8

8
2
.6

3
%

R
2

o
f

.2
5
6

Y
e
s

M
u

ltip
le

re
g
re

ssio
n

stu
d

y
5

7
6
3

6
7
3

8
8
.2

0
%

R
2

o
f

.3
0
9

Y
e
s

d
isc

rim
in

a
te

a
n

a
ly

se
s

stu
d

y
6

7
6
3

6
7
3

8
8
.2

0
%

R
2

o
f

.4
7
6

fo
r

m
e
n

a
n

d
.5

5
3

fo
r

w
o
m

e
n

Y
e
s

d
isc

rim
in

a
te

a
n

a
ly

se
s

S
ta

g
e

(1
9
8
9
)

3
2
3

2
9
4

9
1
.0

0
%

Y
e
s

L
o
g
istic

re
g
re

ssio
n

D
e
y

&
A

stin
(1

9
9
3
)

9
4
7

1
5
2

1
6
.0

0
%

M
u

ltip
le

R
0
.3

5
4
,

0
.3

5
1
,

a
n

d
0
.3

2
3

Y
e
s

lo
g
it,

p
ro

b
it,

a
n

d
re

g
re

ssio
n

M
u

rta
u

g
h

e
t

a
l.

(1
9
9
9
)

8
6
6
7

5
2
0
0

6
0
%

e
stim

a
te

d
re

t
p

ro
b

5
9
.3

%
Y

e
s

S
u

rv
iv

a
l

A
n

a
ly

sis/
H

a
z
a
rd

re
-

g
re

ssio
n

B
re

sc
ia

n
i

&
C

a
rso

n
(2

0
0
2
)

3
5
3
5

3
1
2
1

8
8
.3

0
%

R
2

o
f

0
.0

2
2

Y
e
s

L
o
g
istic

re
g
re

ssio
n

G
ly

n
n

e
t

a
l.

(2
0
0
3
)

a
n
y

d
ro

p
o
u

t;
n

o
t

o
n

ly
fi

rst-y
e
a
r;

a
c
c
u

ra
c
ie

s
b

a
se

d
o
n

th
e

tra
in

in
g

d
a
ta

3
2
4
4

1
5
9
2

4
9
.0

8
%

o
v
e
ra

ll
a
c
c
u

ra
c
y

o
f

8
3
%

Y
e
s

L
o
g
istic

re
g
re

ssio
n

H
e
rz

o
g

(2
0
0
5
)

5
2
6
1

4
0
1
4

7
6
.3

0
%

7
7
.4

%
a
c
c
u

ra
c
y

Y
e
s

L
o
g
istic

re
g
re

ssio
n

4
2
9
8

3
3
1
4

7
7
.1

0
%

Y
e
s

4
6
7
1

4
0
4
0

8
3
.5

0
%

8
5
.4

%
a
c
c
u

ra
c
y

Y
e
s

S
u

jitp
a
ra

p
ita

y
a

(2
0
0
6
)

2
,4

4
4

1
9
4
3

7
9
.5

0
%

8
1
.6

%
a
c
c
u

ra
c
y

o
n

tra
in

in
g
;

8
0
.7

%
o
n

v
a
lid

a
tio

n

Y
e
s

L
o
g
istic

re
g
re

ssio
n

2
,4

4
5

1
9
9
4

7
9
.5

0
%

8
3
.9

%
a
c
c
u

ra
c
y

o
n

tra
in

in
g
;

8
2
.1

%
o
n

v
a
lid

a
tio

n

N
e
u

ra
l

N
e
tw

o
rk

2
,4

4
5

1
9
9
4

7
9
.5

0
%

8
5
.5

%
o
n

tra
in

in
g
;

8
4
.4

%
o
n

v
a
lid

a
tio

n
C

4
.5

H
e
rz

o
g

(2
0
0
6
)

8
,0

1
8

6
0
3
7

7
5
.2

9
%

a
c
c
u

ra
c
y

c
lo

se
to

7
5
%

N
e
u

ra
l

N
e
tw

o
rk

s;
C

H
A

ID
,

C
4
.5

,
C

R
&

T
;

L
o
g
istic

re
g
re

s-
sio

n

A
tw

e
ll

e
t

a
l.

(2
0
0
6
)

tra
in

in
g

3
,8

2
9

3
1
4
9

8
2
.2

4
%

p
re

c
isio

n
fo

r
d

ro
p

-o
u

ts
9
1
,

8
4
,

8
4
,

7
8

d
e
c
isio

n
tre

e
s

(e
n
tro

p
y
,

ch
i-sq

,
g
in

i)
a
n

d
lo

g
istic

re
g
re

ssio
n

te
st

5
,9

9
0

4
,8

8
1

8
1
.4

9
%

p
re

c
isio

n
fo

r
d

ro
p

-o
u

ts
8
8
,

8
2
,8

2
,7

3
D

e
L

o
n

g
e
t

a
l.

(2
0
0
7
)

5
0
%

p
re

c
isio

n
v
a
rie

d
fro

m
5
7
%

to
6
0
%

A
d

a
B

o
o
st

M
1

w
ith

D
e
c
isio

n
S

tu
m

p
s

P
ittm

a
n

(2
0
0
8
)

2
1
1
3
6

1
7
1
3
9

8
1
.1

0
%

o
v
e
ra

ll
a
c
c
u

ra
c
y

o
f

7
8
-

8
1
%

;
n

o
t-re

ta
in

e
d

p
re

-
c
isio

n
fro

m
4
4
-6

3
%

L
o
g
istic

re
g
re

ssio
n

,
n

e
u

ra
l

n
e
t-

w
o
rk

,
b

a
y
e
s,

J
4
8

T
a
b

le
1
:

T
e
ch

n
iq

u
e
s

a
n

d
A

c
c
u

ra
c
ie

s
R

e
p

o
rte

d
in

L
ite

ra
tu

re

24


RET1 RET2 RET3
Count Percentage Count Percentage Count Percentage

retained=Y 24,039 71.3% 18,055 60.4% 14,362 54.8%
retained=N 9,673 28.7% 11,857 39.6% 11,854 45.2%

Total 33,712 100% 29,912 100% 26,216 100%

Table 2: Distribution of Dependent Variables

25


C
a
t
e
g
o
r
y

F
in

a
n
c
ia

l
A

id
C
a
t
e
g
o
r
y

P
e
rfo

rm
a
n
c
e

In
d
ic

a
to

rs
C
a
t
e
g
o
r
y

F
a
c
u
lty

T
y
p

e
&

E
x
p

e
rie

n
c
e

A
t
t
r
ib

u
t
e

D
e
s
c
r
ip

t
io

n
A
t
t
r
ib

u
t
e

D
e
s
c
r
ip

t
io

n
A
t
t
r
ib

u
t
e

D
e
s
c
r
ip

t
io

n
F

in
A

id
A

w
a
rd

T
y
p

e
G

F
in

a
n
c
ia

l
a
id

a
m

o
u
n
t

o
f

g
ra

n
ts

A
C

T
C

O
M

P
A

C
T

c
o
m

p
re

h
e
n
s
iv

e
s
c
o
re

(o
ld

)
F
a
c
E

x
p
L
T

1
C

n
t

C
o
u
n
t

o
f

c
o
u
rs

e
s

ta
u
g
h
t

b
y

fa
c
u
lty

[C
C

T
F

]
w

/
le

s
s

th
a
n

1
y
r

e
x
p

e
rie

n
c
e

F
in

A
id

A
w

a
rd

T
y
p

e
J

F
in

a
n
c
ia

l
a
id

a
m

o
u
n
t

o
f

jo
b
s

A
C

T
E

N
G

L
A

C
T

e
n
g
lis

h
s
c
o
re

(o
ld

)
F
a
c
E

x
p
L
T

1
R

a
tio

R
a
tio

o
f

c
o
u
rs

e
s

ta
u
g
h
t

b
y

fa
c
u
lty

[R
C

T
F

]
w

/
le

s
s

th
a
n

1
y
r

e
x
p

e
rie

n
c
e

to
th

e
to

ta
l

c
o
u
rs

e
s

F
in

A
id

A
w

a
rd

T
y
p

e
L

F
in

a
n
c
ia

l
a
id

a
m

o
u
n
t

o
f

lo
a
n
s

A
C

T
M

A
T

H
A

C
T

m
a
th

s
c
o
re

(o
ld

)
F
a
c
E

x
p
1
to

5
C

n
t

C
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

1
&

5
F

in
A

id
A

w
a
rd

T
y
p

e
S

F
in

a
n
c
ia

l
a
id

a
m

o
u
n
t

o
f

s
c
h
o
la

r-
s
h
ip

A
C

T
1

C
O

M
P

A
C

T
c
o
m

p
re

h
e
n
s
iv

e
s
c
o
re

(n
e
w

)
F
a
c
E

x
p
1
to

5
R

a
tio

R
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

1
&

5
to

th
e

to
ta

l
c
o
u
rs

e
s

F
in

A
id

A
w

a
rd

T
y
p

e
W

F
in

a
n
c
ia

l
a
id

a
m

o
u
n
t

o
f

w
a
iv

e
r

A
C

T
1

E
N

G
L

A
C

T
e
n
g
lis

h
s
c
o
re

(n
e
w

)
F
a
c
E

x
p
6
to

1
0
C

n
t

C
C

T
F

w
/

e
x
p

e
rie

n
c
A

e
b

e
tw

e
e
n

6
&

1
0

F
in

A
id

D
E

P
E

N
D

E
N

C
Y

D
e
p

e
n
d
e
n
c
y

s
ta

tu
s

A
C

T
1

M
A

T
H

A
C

T
m

a
th

s
c
o
re

(n
e
w

)
F
a
c
E

x
p
6
to

1
0
R

a
tio

R
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

6
&

1
0

to
th

e
to

ta
l

c
o
u
rs

e
s

F
in

A
id

F
A

T
H

E
R

E
D

F
a
th

e
r’s

e
d
u
c
a
tio

n
le

v
e
l

A
C

T
E

Q
U

IV
A

C
T

e
q
u
iv

a
le

n
t

o
f

th
e

s
a
t

s
c
o
re

F
a
c
E

x
p
1
1
to

1
5
C

n
t

C
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

1
1

&
1
5

F
in

A
id

F
A

T
H

E
R

W
A

G
F
a
th

e
r’s

in
c
o
m

e
M

a
x
A

C
T

M
a
x

o
f

A
C

T
s
c
o
re

a
n
d

A
C

T
e
q
u
iv

a
-

le
n
t

F
a
c
E

x
p
1
1
to

1
5
R

a
tio

R
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

1
1

&
1
5

to
th

e
to

ta
l

c
o
u
rs

e
s

F
in

A
id

M
O

T
H

E
R

E
D

M
o
th

e
r’s

e
d
u
c
a
tio

n
le

v
e
l

C
O

M
P

R
E

A
D

C
o
m

p
a
s
s

re
a
d

s
c
o
re

F
a
c
E

x
p
1
6
to

2
0
C

n
t

C
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

1
6

&
2
0

F
in

A
id

M
O

T
H

E
R

W
A

G
M

o
th

e
r’s

in
c
o
m

e
C

O
M

P
W

R
IT

E
C

o
m

p
a
s
s

w
rite

s
c
o
re

F
a
c
E

x
p
1
6
to

2
0
R

a
tio

R
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

1
6

&
2
0

to
th

e
to

ta
l

c
o
u
rs

e
s

F
in

A
id

O
ff

e
re

d
In

d
F

in
a
n
c
ia

l
a
id

o
ff

e
re

d
in

d
ic

a
to

r
S
A

T
T

O
T

S
A

T
to

ta
l

s
c
o
re

F
a
c
E

x
p
2
1
to

2
5
C

n
t

C
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

2
1

&
2
5

F
in

A
id

P
A

R
E

N
T

A
G

I
P

a
re

n
t’s

a
d
ju

s
te

d
g
ro

s
s

in
c
o
m

e
S
A

T
V

E
R

B
S
A

T
v
e
rb

a
l

s
c
o
re

F
a
c
E

x
p
2
1
to

2
5
R

a
tio

R
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

2
1

&
2
5

to
th

e
to

ta
l

c
o
u
rs

e
s

F
in

A
id

P
A

R
E

N
T

H
O

U
P

a
re

n
ts

’s
h
o
u
s
e
h
o
ld

s
iz

e
H

S
C

O
D

E
H

ig
h

s
c
h
o
o
l

c
o
d
e

F
a
c
E

x
p
2
5
to

3
0
C

n
t

C
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

2
5

&
3
0

F
in

A
id

P
A

R
E

N
T

M
A

R
P

a
re

n
ts

’s
m

a
rtia

l
s
ta

tu
s

H
S

G
P

A
H

ig
h

s
c
h
o
o
l

g
p
a

F
a
c
E

x
p
2
5
to

3
0
R

a
tio

R
C

T
F

w
/

e
x
p

e
rie

n
c
e

b
e
tw

e
e
n

2
5

&
3
0

to
th

e
to

ta
l

c
o
u
rs

e
s

F
in

A
id

P
A

R
E

N
T

T
A

X
P

a
re

n
ts

’s
ta

x
fo

rm
ty

p
e

H
S

P
E

R
C

E
N

T
H

ig
h

s
c
h
o
o
l

p
e
rc

e
n
tile

F
a
c
E

x
p
G

T
3
1
C

n
t

C
C

T
F

w
/

e
x
p

e
rie

n
c
e

g
re

a
te

r
th

a
n

3
1

y
e
a
rs

F
in

A
id

S
P

O
U

S
E

W
A

G
S
p

o
u
s
e
’s

w
a
g
e
s

H
S

R
A

N
K

H
ig

h
s
c
h
o
o
l

ra
n
k

F
a
c
E

x
p
G

T
3
1
R

a
tio

R
C

T
F

w
/

e
x
p

e
rie

n
c
e

g
re

a
te

r
th

a
n

3
1

y
e
a
rs

to
th

e
to

ta
l

c
o
u
rs

e
s

F
in

A
id

S
T

U
D

E
N

T
A

G
S
tu

d
e
n
t’s

a
d
ju

s
te

d
g
ro

s
s

in
c
o
m

e
H

S
S
IZ

E
H

ig
h

s
c
h
o
o
l

c
la

s
s

s
iz

e
N

o
T

e
n
u
re

F
a
c
C

n
t

C
o
u
n
t

o
f

c
o
u
rs

e
s

ta
u
g
h
t

b
y

n
o

ra
n
k

fa
c
u
lty

F
in

A
id

S
T

U
D

E
N

T
H

O
S
tu

d
e
n
t’s

h
o
u
s
e
h
o
ld

s
iz

e
R

a
n
k
H

S
G

P
A

P
e
rc

e
n
tile

o
f

h
s

g
p
a

a
m

o
n
g

a
ll

fre
s
h
-

m
e
n

N
o
T

e
n
u
re

F
a
c
R

a
tio

R
a
tio

o
f

c
o
u
rs

e
s

ta
u
g
h
t

b
y

n
o

ra
n
k

fa
c
u
lty

to
th

e
to

ta
l

c
o
u
rs

e
s

F
in

A
id

S
T

U
D

E
N

T
M

A
S
tu

d
e
n
t’s

m
a
rtia

l
s
ta

tu
s

R
a
n
k
M

a
x
A

C
T

P
e
rc

e
n
tile

o
f

m
a
x

A
C

T
a
m

o
n
g

a
ll

fre
s
h
m

e
n

N
T

T
F
a
c
C

n
t

C
o
u
n
t

o
f

c
o
u
rs

e
s

ta
u
g
h
t

b
y

n
tt

fa
c
u
lty

F
in

A
id

S
T

U
D

E
N

T
T

A
S
tu

d
e
n
t’s

ta
x

fo
rm

ty
p

e
A

N
T

H
1
8

E
n
ro

lle
d

in
a
n
th

ro
p

o
lo

g
y

c
o
u
rs

e
N

T
T

F
a
c
R

a
tio

R
a
tio

o
f

c
o
u
rs

e
s

ta
u
g
h
t

b
y

n
tt

fa
c
u
lty

to
th

e
to

ta
l

c
o
u
rs

e
s

F
in

A
id

S
T

U
D

E
N

T
W

A
S
tu

d
e
n
t’s

w
a
g
e

B
S
C

I1
0

E
n
ro

lle
d

in
b
io

lo
g
ic

a
l

s
c
ie

n
c
e

c
o
u
rs

e
T

T
F
a
c
C

n
t

C
o
u
n
t

o
f

c
o
u
rs

e
s

ta
u
g
h
t

b
y

te
n
u
re

d
/
te

n
u
re

-
tra

c
k

fa
c
u
lty

F
irs

tG
e
n
In

d
F

irs
t

g
e
n
e
ra

tio
n

in
d
ic

a
to

r
C

H
E

M
1
0

E
n
ro

lle
d

in
c
h
e
m

is
try

c
o
u
rs

e
T

T
F
a
c
R

a
tio

R
a
tio

o
f

c
o
u
rs

e
s

ta
u
g
h
t

b
y

te
n
u
re

d
/
te

n
u
re

-
tra

c
k

fa
c
u
lty

to
th

e
to

ta
l

c
o
u
rs

e
s

T
o
ta

lF
in

A
id

O
ff

e
re

d
T

o
ta

l
fi
n
a
n
c
ia

l
a
id

o
ff

e
re

d
E

N
G

1
0

E
n
ro

lle
d

in
E

n
g
lis

h
c
o
u
rs

e
E

N
G

1
1

E
n
ro

lle
d

in
E

n
g
lis

h
c
o
u
rs

e
G

E
O

L
1
1

E
n
ro

lle
d

in
g
e
o
lo

g
y

c
o
u
rs

e
L

E
S
T

1
6

E
n
ro

lle
d

in
le

is
u
re

s
tu

d
ie

s
c
o
u
rs

e
M

A
T

H
1
0

E
n
ro

lle
d

in
m

a
th

1
0
0

le
v
e
l

c
o
u
rs

e
M

A
T

H
1
1

E
n
ro

lle
d

in
m

a
th

1
1
0

le
v
e
l

c
o
u
rs

e
M

A
T

H
1
2

E
n
ro

lle
d

in
m

a
th

1
2
0

le
v
e
l

c
o
u
rs

e
M

A
T

H
1
4

E
n
ro

lle
d

in
m

a
th

1
4

le
v
e
l

c
o
u
rs

e
P

H
Y

1
1

E
n
ro

lle
d

in
p
h
y
s
ic

s
1
1

le
v
e
l

c
o
u
rs

e
P

E
P

1
5

E
n
ro

lle
d

in
p
h
y
s
ic

a
l
e
d

1
5

le
v
e
l
c
o
u
rs

e

T
a
b

le
3
:

L
ist

o
f

A
ttrib

u
te

s
b
y

S
ta

te
d

H
y
p

o
th

e
se

s

26


Rank Number of Attributes FSS Classifier
61 30 oneR bnet
61 50 cfs adtree
57 50 oneR adtree
56 30 oneR adtree
55 30 cfs adtree
52 50 oneR bnet
51 30 infogain adtree
51 30 cfs bnet
48 50 infogain adtree

Table 4: The top ten ranking treatments for third year retention. Ranks represent how many times a
particular treatment wins over all other treatments in the experiment.

27


# P (Ret3|X) Support X (Feature = Range)
(percent) (#students) Hypothesis Feature Range

1 79 1,752 Financial Aid Student’s Wage 7850 to 9958
2 73 3,751 Financial Aid Parent’s Adjusted Gross Income 96636 to inf
3 71 4,152 Financial Aid Student’s Adjusted Gross Income 4830 to 7916
4 71 3,148 Financial Aid Mother’s Income 42957 to inf
5 70 5,873 Financial Aid Father’s Income 52366 to inf
6 69 5,622 Financial Aid Student’s Wage 4093 to 7851
7 69 5,838 Performance High School Percentile 81 to inf
8 68 2,523 Financial Aid Student’s Dependency Status I
9 68 7,502 Financial Aid Father’s Education Level 3
10 67 6,045 Financial Aid Parent’s Adjusted Gross Income 58551 to 96636
11 66 12,215 Financial Aid Student’s Tax Form 2
12 66 7,710 Financial Aid Mother’s Education Level 3
13 66 2,057 Financial Aid Student’s Wage 1.5 to 1000
14 66 10,370 Financial Aid First Generation Student N
15 65 7,082 Performance ACT Math Score (new) 23 to inf
16 65 2,780 Financial Aid Student’s Adjusted Gross Income 3336 to 4830
17 65 4,676 Performance ACT English Score (new) 25 to inf
18 65 5,669 Performance ACT Comprehensive Score (new) 24 to inf
19 65 13,101 Financial Aid Parent’s Tax Form 1
20 65 11,328 Financial Aid Parent’s Marital Status M
21 64 6,658 Performance Percentile Of Max ACT Among Freshmen 71 to inf
22 64 6,952 Performance Max Of ACT Score And ACT Equivalent 24 to inf
23 64 2,697 Financial Aid Student’s Tax Form 1
24 63 17,254 Financial Aid Student’s Marital Status U
25 63 13,063 Financial Aid Mother’s Wages -inf to 42957
26 63 3,126 Financial Aid Parent’s Tax Form 2
27 63 1,022 Financial Aid Student’s Adjusted Gross Income 16714 to inf
28 62 15,154 Financial Aid Dependency D
29 62 9,459 Financial Aid Father’s Income -inf to 52366
30 61 8,792 Financial Aid Mother’s Education Level 2
31 61 8,461 Financial Aid Father’s Education Level 2
32 61 4,176 Financial Aid Student’s Wage 1904 to 4093
33 60 14,523 Total Enrolled Hours 15 to 19
34 60 2,540 Financial Aid Student’s Adjusted Gross Income 1895 to 3336
35 59 3,780 Performance Compass Writing Score -inf to 10
36 59 8,271 Performance ACT English Score (new) 20 to 25
37 59 7,311 FirstGenInd Y
38 59 6,980 Performance HS PERCENT 61 to 81
39 59 15,021 Performance Total Number of Enrolled Classes 6 to inf
40 59 5,598 Financial Aid Parent’s Adjusted Gross Income 1838 to 58551
41 58 3,127 Financial Aid Parent’s Marital Status S
42 58 8,667 Performance ACT Composite 20 to 24
43 58 4,767 Performance ACT Math Score (new) 20 to 23
44 58 13,887 Performance Compass Writing Score 74 to inf
45 58 5,769 Performance High School GPA 3.02 to 3.4
46 58 10,281 Performance RankMaxACT 31 to 71
47 58 10,044 Performance MaxACT 20 to 24
48 57 20,087 On-Campus Indicator Y
49 57 11,36 Financial Aid Father’s Education Level 4
50 56 24,826 Age of Student at Matriculation -inf to 19.5
51 56 24,407 Performance Enrolled in English Courses N
52 56 3,660 Performance Percentile Of Hs Gpa Among Freshmen 46 to 60

Table 5: Ranking all attribute ranges which, in isolation, predict for third year retention at a probability
higher than the ZeroR limit (55%). From the above, the strongest predictor for third year retention is
a student’s wage (at 79%). On the other hand, the bottom line of this table says that the percentile of
a student amongst their Freshmen cohort is little better than ZeroR (at 56%).

28