doi:10.1016/j.eswa.2006.12.030


Available online at www.sciencedirect.com
www.elsevier.com/locate/eswa

Expert Systems with Applications 34 (2008) 1220–1226

Expert Systems
with Applications
Input data for decision trees

Selwyn Piramuthu *

Decision and Information Sciences University of Florida, Gainesville, FL 32611–7169, United States
Abstract

Data Mining has been successful in a wide variety of application areas for varied purposes. Data Mining itself is done using several
different methods. Decision Trees are one of the popular methods that have been used for Data Mining purposes. Since the process of
constructing these decision trees assume no distributional patterns in the data (non-parametric), characteristics of the input data are usu-
ally not given much attention. We consider some characteristics of input data and their effect on the learning performance of decision
trees. Preliminary results indicate that the performance of decision trees can be improved with minor modifications of input data.
� 2006 Elsevier Ltd. All rights reserved.

Keywords: Decision trees; Data characteristics
1. Introduction

Data Mining has been successful in a wide variety of
application areas, including marketing, for varied purposes
(Adomavicius & Tuzhilin, 2001; Kushmerick, 1999; van der
Putten, 1999; Shaw, Subramaniam, Tan, & Welge, 2001;
Thearling, 1999). Data Mining itself is done using several
different methods, depending on the type of data as well
as the purpose of Data Mining (Ansari, Kohavi, Mason,
& Zheng, 2000; Cooley, Tan, & Srivastava, 1999; Srivas-
tava, Cooley, Deshpande, & Tan, 2000). For example, if
the purpose is classification using real data, feed-forward
neural networks might be appropriate (Ragavan & Pira-
muthu, 1991). Decision trees might be appropriate if the
purpose is classification using nominal data (Quinlan,
1993). Further, if the purpose is to identify associations
in data, association rules might be appropriate (Brijs, Swin-
nen, Vanhoof, & Wets, 1999).

Decision Trees are one of the popular methods that have
been used for Data Mining purposes. Decision trees can be
constructed using a variety of methods. For example, C4.5
(Quinlan, 1993) uses information-theoretic measures and
CART (Breiman, Friedman, Olshen, & Stone, 1984) uses
0957-4174/$ - see front matter � 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2006.12.030

* Tel.: +1 352 392 8882; fax: +1 352 392 5438.
E-mail address: selwyn@ufl.edu
statistical methods. The usefulness as well as classification
and computational performance of Data Mining frame-
works incorporating decision trees can be improved by
(1) appropriate preprocessing of input data, (2) fine-tuning
the decision tree algorithm itself, and (3) better interpreta-
tion of output. There have been several studies that have
addressed each of these scenarios.

Input data can be preprocessed (1) to reduce the com-
plexity of data for ease of learning, and (2) to reduce effects
due to unwanted characteristics of data. The former
includes such techniques as feature selection and feature
construction as well as other data modifications (see, for
example, Brijs & Vanhoof, 1998; Kohavi, 1995; Ragavan
& Piramuthu, 1991). The latter includes removal of noisy,
redundant, and irrelevant data used as input to decision
tree learning.

We consider some characteristics of input data and its
effect on the learning performance of decision trees. Specif-
ically, we consider the effects of non-linearity, outliers, het-
eroschedasticity, and multicollinearity in data. These have
been shown to have significant effects on regression analy-
sis. However, there has not been any published study that
deals with these characteristics and their effects on the learn-
ing performance of decision trees. Using a few small data
sets that are available over the Internet, we consider each of
these characteristics and compare their effects on regression

mailto:selwyn@ufl.edu


S. Piramuthu / Expert Systems with Applications 34 (2008) 1220–1226 1221
analysis as well as decision trees. The results from regres-
sion analysis are from the Internet. The contribution of this
paper is in studying the effects of these characteristics on
decision trees, specifically See-5 (2001). Preliminary results
suggest that the performance of decision trees can be
improved with minor modifications of input data.

The rest of the paper is organized as follows: Evaluation
of some input data characteristics and their effects on the
learning performance of decision trees is provided in the
next section. Experimental results are also included in Sec-
tion 2. Section 3 concludes the paper with a brief discussion
of the results from this study and their implications as well
as future extensions to this study.

2. Evaluation of input data characteristics for decision
trees

Traditional statistical regression analysis assumes cer-
tain distribution (e.g., Gaussian) of input data, as well as
Source SS df MS Number of obs = 100
F(3,96) = 2.21

Model 5649.25003 3 1883.08334 Prob > F = 0.0915
Residual 81668.75 96 850.716146 R2 = 0.0647

Adj R2 = 0.0355
Total 87318.00 99 882.00 Root MSE = 29.167

y Coef. Std. Err. t P > jtj (95% Conf. interval)
x1 .1134093 .3626687 0.313 0.755 �.06064824 .833301
x2 �.0089643 .3757505 �0.024 0.981 �.7548232 .7368946
x3 .5932696 .2560351 2.317 0.023 .0850430 1.101495
-cons 20.09967 11.61974 1.730 0.087 �2.965335 43.16468
other characteristics of data such as the data being inde-
pendent and identically distributed. In most real world
data, some of these assumptions are often violated. And,
there are several means to at least partially rectify some
of the consequences that arise from these violations. We
consider a few of these situations: non-linearity in the
data, the presence of outliers in the data, the presence of
heteroschedasticity in the data, and the presence of multi-
collinearity in the data. The data sets used in this
study are known to have these characteristics. The follow-
Source SS df MS

Model 76669.37633 4 19167.3441
Residual 10648.6237 95 112.0907

Total 87318.00 99 88.2

y Coef. Std. Err. t

x1 .1706278 .1316641 1.296
x2cent �.0262898 .1363948 �0.193
x2centsq .2954615 .11738 25.171
x3 .2584843 .0938846 2.753
-cons �.8589132 1.3437 �0.639
ing subsections address each of these scenarios in turn. We
use See-5 as the decision tree generator throughout this
study.
2.1. Non-linearity in input data

Non-linearity is a problem in linear regression simply
because it is hard to fit a linear model on a non-linear data.
Therefore, non-linear transformations are made to the
data before running regressions on these data. We con-
sider the effects of non-linear data on decision trees both
before and after the appropriate data transformations are
made.

This data set contains four variables – the inde-
pendent variables x1, x2, and x3 and the dependent
variable y. Result using ordinary least squares (OLS)
regression to predict y using x1, x2, and x3 are provided
below.
The presence of higher order trend effects are identified
to be present in the data using the omitted variable test (ovt-
est with the rhs option in the statistical analysis software
Stata). Higher order trend effects are also present. On
inspection of scatter plots of the data, the presence of
non-linear trend patterns in the data in the variable x2 is
confirmed. We substitute x2 with its centered (x2cent) value
(i.e., subtract its mean from every value) and the square
(x2centsq) of the centered value. The results from this
regression are provided below:
Number of obs = 100
F(4,95) = 171.00
Prob > F = 0.0000

75 R2 = 0.08780
Adj R2 = 0.8729
Root MSE = 10.587

P > jtj (95% Conf. interval)
0.198 �.0907856 .4320141
0.848 �.2970677 .244488
0.000 .2721586 .3187645
0.007 .0720997 .4448688
0.524 �3.526496 1.808669


Table 1
Results using non-linear data

Input variables Decision tree size Prediction error (%)

x1, x2, x3 3.2 (0.6) 28.0 (4.2)
x1, x2cent, x2centsq, x3 3.0 (0.5) 11 (2.8)

1222 S. Piramuthu / Expert Systems with Applications 34 (2008) 1220–1226
On further testing for higher order terms, the result
turns out to be negative. Here, by including the squared
term, it is shown that the new term is indeed statistically
significant. The resulting model also fits the data better as
shown by the increase in the R2 value.

Now, let us consider the same two sets of data, both
before and after incorporating the squared term, and eval-
uate its effects on the performance of decision tree learned.
We use 10-fold cross-validation in See-5 to reduce any bias
due to sample selection. Both the mean values and the stan-
dard deviation values (in parentheses) are provided for the
resulting decision trees.
Source SS df MS Number of obs = 100
F(3,96) = 14.12

Model 6358.64512 3 2119.54837 Prob > F = 0.0000
Residual 14406.3149 96 150.06578 R2 = 0.3062

Adj R2 = 0.2845
Total 20764.96 99 209.747071 Root MSE = 12.25

y Coef. Std. Err. t P > jtj (95% Conf. interval)
x1 .1986327 .1523206 1.304 0.195 �.1037212 .5009867
x2 .576853 .1578149 3.655 0.000 .2635928 .8901132
x3 .3533915 .1075346 3.286 0.001 .1399371 .5668459
-cons 32.33932 1.229643 26.300 0.000 29.8985 34.78014
Here (Table 1), the addition of the two transformed x2
variables has resulted in a small reduction in the size of
the decision trees and a significant decrease in the predic-
tion error. The prediction error is the classification error
on unseen (during generation of decision trees) examples.

2.2. Presence of outliers in input data

The presence of outliers in input data is a problem in
any learning application because most methods used for
Source SS df MS Number of obs = 100
F(3,96) = 20.27

Model 5863.70256 3 1954.56752 Prob > F = 0.0000
Residual 9255.28744 96 96.4092442 R2 = 0.3878

Adj R2 = 0.3687
Total 15118.99 99 152.717071 Root MSE = 9.8188

y Coef. Std. Err. t P > jtj (95% Conf. interval)
x1 .325315 .1220892 2.665 0.009 .08297 .5676601
x2 .4193103 .126493 3.315 0.001 .1682236 .670397
x3 .3448104 0.861919 4.000 0.000 .1737208 .5159
-cons 31.28053 .9855929 31.738 0.000 29.32415 33.23692
learning patterns in the data over-fit the outliers. Depending
on how serious the outliers are the resulting patterns learned
could be significantly different from the actual patterns with-
out the outliers. We consider the effects of outliers on deci-
sion trees both before and after the outliers are removed.
This data set used here contains four variables – the
independent variables x1, x2, and x3 and the dependent var-
iable y. Result using OLS regression to predict y using x1,
x2, and x3 are provided below.
We start out to examine for the presence of outliers
by looking at the scatter plot of the data. On inspec-
tion, we identify a single point that stands out from
the rest. After several attempts at evaluating this point,
we find that this point indeed is an outlier. This
data point is due to a typographical error during data
input, and therefore the error that led to this situation
is corrected. The result from the OLS regression run
on the data with the outlier problem fixed is provided
below.
Now, let us consider the same two sets of data, both
before and after fixing the problem of outlier data, and
evaluate its effects on the performance of decision tree
learned. Again, we use 10-fold cross-validation in See-5
to reduce any bias due to sample selection.


Table 2
Results using outlier data

Input variables Decision tree size Prediction error (%)

x1, x2, x3 (with outlier) 4.8 (0.6) 46.0 (5.0)
x1, x2, x3 (without outlier) 4.4 (0.6) 38 (4.2)

Table 3
Results using heteroschedasticity data

Input variables Decision tree size Prediction error (%)

x1, x2, x3 (dep var: y) 4.9 (0.6) 27.0 (3.0)
x1, x2, x3 (dep. var: ln(y)) 5.5 (0.4) 26 (4.0)

S. Piramuthu / Expert Systems with Applications 34 (2008) 1220–1226 1223
Here (Table 2), the removal of the input error has
resulted in a small reduction in the size of the decision trees
and a significant decrease in the prediction error.

2.3. Heteroschedasticity in input data

Data where the error term variance is not constant has
Heteroschedasticity. The presence of heteroschedasticity
in input data is a problem in regression analysis because
the modeling depends on the assumption that that the error
term variance is a constant. We consider the effects of het-
eroschedasticity on decision trees both before and after the
problem has been alleviated in the input data.

This data set contains four variables – the independent
variables x1, x2, and x3 and the dependent variable y.
Result using OLS regression to predict y using x1, x2,
and x3 are provided below.
Source SS df MS Number of obs = 100
F(3,96) = 65.68

Model 8933.72373 3 2977.90791 Prob > F = 0.0000
Residual 4352.46627 96 45.3381903 R2 = 0.6724

Adj R2 = 0.6622
Total 13286.19 99 134.203939 Root MSE = 6.7334

y Coef. Std. Err. t P > jtj (95% Conf. interval)
x1 .2158539 .83724 2.578 0.011 .0496631 .3820447
x2 .7559357 .086744 8.715 0.000 .5837503 .9281211
x3 .3732164 .0591071 6.314 0.000 .2558898 .490543
-cons 33.23969 .6758811 49.180 0.000 31.89807 34.5813
We use the hettest command in Stata to test for hetero-
schedasticity, and find that the results are indeed hetero-
schedastic. We then try to stabilize the variance by using
a natural logarithmic transformation of the dependent var-
iable. The OLS regression results after this transformation
is provided below:
Source SS df MS

Model 8.17710164 3 2.7257
Residual 3.74606877 96 .0390

Total 11.9231704 99 .1204

y Coef. Std. Err. t

x1 .0054677 .0024562 2.22
x2 .0230303 .0025448 9.05
x3 0.118223 .001734 68.18
-cons 3.445503 .0198285 173.76
Now, let us consider the same two sets of data, both
before and after fixing the problem of heteroschedasticity,
and evaluate its effects on the performance of decision tree
learned. Again, we use 10-fold cross-validation in See-5 to
reduce any bias due to sample selection.

Here (Table 3), the transformation of the dependent var-
iable has resulted in a small increase in the size of the decision
trees and an insignificant decrease in the prediction error.
2.4. Multicollinearity in input data

Data where the independent variables are highly corre-
lated is said to have multi-collinearity. In regression analysis,
multicollinearity is a problem when we are interested in the
exact values of the coefficients of the independent variables.
When multicollinearity is present, this is not possible. Mul-
ticollinearity is identified by (1) the presence of high pair-
wise correlation among independent variables, (2) a high
R

2 value with low t-statistics, and (3) the coefficients change
when variables are added and dropped from the model.
Multicollinearity is not a problem when the only purpose of
regression analysis is forecasting. However, if the analysis is
to determine and evaluate the coefficients, multicollinearity
Number of obs = 100
F(3,96) = 69.85

0055 Prob > F = 0.0000
2155 R2 = 0.6858

Adj R2 = 0.6760
36065 Root MSE = .19754

P > jtj (95% Conf. interval)
6 0.028 .0005921 .0103432
0 0.000 .0179788 .0280817

0.000 .0083803 .0152643
5 0.000 3.406144 3.484862


Table 4
Results using multicollinearity data

Input variables Decision tree size Prediction error (%)

x1, x2, x3, x4 3.2 (0.5) 34.0 (3.1)
x1, x2, x3 4.8 (0.5) 45.0 (2.7)

1224 S. Piramuthu / Expert Systems with Applications 34 (2008) 1220–1226
is a problem. One of the ways to alleviate this problem is to
drop the variable with the highest pair-wise correlation val-
ues among the independent variables. We consider the effects
of multicollinearity on decision trees both before and after
the problem has been alleviated in the input data.

This data set contains five variables – the independent
variables x1, x2, x3, and x4 and the dependent variable y.
Result using OLS regression to predict y using x1, x2, x3,
and x4 are provided below.
Source SS df MS Number of obs = 100
F(3,96) = 16.37

Model 5995.66253 4 1498.91563 Prob > F = 0.0000
Residual 8699.33747 95 91.5719733 R2 = 0.4080

Adj R2 = 0.3831
Total 14695.00 99 148.434343 Root MSE = 9.5693

y Coef. Std. Err. t P > jtj (95% Conf. interval)
x1 1.118277 1.024484 1.092 0.278 �.9155806 3.152135
x2 1.286694 1.042406 1.234 0.220 �.7827429 3.356131
x3 1.191635 1.05215 1.133 0.260 �.8971469 3.280417
x4 �.8370988 1.038979 �0.806 0.422 �2.899733 1.225535
-cons 31.61912 .9709127 32.566 0.000 29.69161 33.54662
Here, the R2 value is significant while the t-statistics are
not. We then consider the pair-wise correlations among the
independent variables. The resulting matrix is given below:
x1 x2 x3 x4

x1 1.0000
x2 0.3553 1.0000
x3 0.3136 0.2021 1.0000
x4 0.7281 0.6516 0.7790 1.0000
Clearly, x4 is highly correlated with the rest of the inde-
pendent variables. This variable is then removed from the
data. The results from the OLS regression run without x4
is given below:
Source SS df MS Number of obs = 100
F(3,96) = 21.69

Model 5936.21931 3 1978.73977 Prob > F = 0.0000
Residual 8758.78069 96 91.2372989 R2 = 0.4040

Adj R2 = 0.3853
Total 14695.00 99 148.434343 Root MSE=9.5518

y Coef. Std. Err. t P > jtj (95% Conf. interval)
x1 .298443 .1187692 2.513 0.014 .0626879 .5341981
x2 .4527284 .1230534 3.679 0.000 .2084695 .6969874
x3 .3466306 .0838481 4.134 0.000 .1801934 .5130679
-cons 31.50512 .9587921 32.859 0.000 29.60194 33.40831
Here, all the variables turn out to be significant.
Now, let us consider the same two sets of data, both

before and after removing x4 from the data, and evaluate its
effects on the performance of decision tree learned. Again,
we use 10-fold cross-validation in See-5 to reduce any bias
due to sample selection.
Here (Table 4), the transformation of the dependent var-
iable has resulted in a significant increase in the size of the
decision trees and a significant increase in the prediction
error. Here, removal of the variable does not seem to help
the performance of decision trees.

2.5. Data reduction

Data reduction is an important preprocessing step in any
pattern recognition method. Benefits of data reduction
include removal of irrelevant attributes from data, alleviate
effects due to outliers in data, reduce effects due to noise,
resulting parsimony in decision-making, reducing data
complexity for learning algorithms, among others. These
are even more critical when dealing with huge data sets,
as is the case with most data mining applications. Data
reduction essentially involves dimensionality reduction
and/or example reduction, among others. In a majority of


Table 5
Summary results for data reduction

# of training examples Training error Testing error Tree size

120 2 (0.94) 4.66 (3.81) 4.6(0.55)
60 3.68 (2.17) 0 (0) 3.4 (0.55)
30 3.32 (2.37) 4 (2.81) 3 (0)
15 0 (0) 0 (0) 3 (0)
9 0 (0) 0 (0) 3 (0)
6 0 (0) 5.34 (7.69) 3 (0)

S. Piramuthu / Expert Systems with Applications 34 (2008) 1220–1226 1225
cases, when dealing with data used as input to learning algo-
rithms, the former deals with reducing the number of rele-
vant attributes and the latter involves effectively reducing
the number of examples. In this study, we present and eval-
uate a framework to reduce the effective number of exam-
ples that are used as input to pattern learning algorithms.

Most example reduction methods use some form of sam-
pling (e.g., random, stratified) to select examples to be con-
sidered further. The underlying assumptions are different
for each of these sampling methods. For example, in simple
random sampling, it is assumed that every unit in the pop-
ulation provides the same amount of information. There is
a vast amount of literature on example reduction methods
(e.g. Ishibuchi, Nakashima, & Nil, 2001; Liu & Motoda,
2001; Provost & Kolluri, 1999; Provost, Jensen, & Oates,
1999).

We utilize clustering as a pre-processing step for learn-
ing applications. Specifically, we use (k-means) clustering
for data reduction in the number of example dimension,
and as a pre-processing step for decision trees. We use
the fuzzy thresholds option in See-5 since we are using
real-numbered variables in the data set. See-5 partitions
real-numbered variable at a given threshold point, and
small movements in the variable value near the threshold
can change the branch taken. The Fuzzy thresholds option
softens this knife-edge behavior for decision trees by con-
structing an interval close to the threshold. Within this
interval, both branches of the tree are explored and the
results combined to give a predicted class.

We use the Iris plants database (Anderson, 1935; Fisher,
1936) to illustrate the proposed method. We chose this
database simply because it is among the best known pat-
tern recognition databases (e.g. Duda & Hart, 1973), its
simplicity, and any results generated using this database
can be readily compared with those generated through
other methods. The database consists of 150 examples cov-
ering three iris plant types (Iris-Setosa, Iris-Versicolor, and
Iris-Virginica), with 50 examples from each type. Data
from one of the type is linearly separable from the other
two, and the latter two are not linearly separable from each
other. The four independent attributes (sepal length, sepal
width, petal length, and petal width) are numeric. There are
no missing attribute values.

Since we want to compare the results for the case where
clustering is used with that where clustering was not used as
a pre-processing step, we randomly divided the data set
into five parts (a,b,c,d,e). We did this to alleviate problems
due to sampling errors. The five data sets were then used in
a leave-one-out fashion. I.e., in the first case, the first four
parts were used to generate the clusters (abcd), which were
then used as input to See-5. The resulting decision tree was
tested using the last part (e). We repeated this five times for
each case, where different parts were used to test the result-
ing decision tree.

Table 5 summarizes the results based on the number of
examples used as input to the decision tree generator (See-
5). For training error, testing error and tree size, the mean
value is followed by standard deviation values in parenthe-
ses. As can be seen neither the training error nor the testing
error seem to be uni-modal functions. However, there is a
significant decrease in both training (pair-wise two-tailed t-
test with p < 0.005) and testing (pair-wise two-tailed t-test
with p = 0.026) error as the number of clusters decreases.
This could possibly be because the benefits of clustering
become apparent as the number of examples used to form
any given cluster is increased. In the example data set used,
the best results are for data sets with 15 and 9 clusters.
Then again, as the number of clusters is reduced to its min-
imum (here, two for each iris type, resulting in 6 clusters
overall), there tends to be a loss in information content
simply because of the complexity of data dictating the
necessity for more points (here, clusters) to draw bound-
aries among examples belonging to different categories
(here, type of iris). The decision tree size also decreases
(statistically significant with a pair-wise two-tailed t-test
with p < 0.002) as the number of clusters is decreased.
3. Discussion

Even though decision trees constructed using informa-
tion-theoretic measures are considered non-parametric,
the distribution of data does influence the classification
performance of these decision trees. Preliminary results
indicate that the performance of decision trees can be
improved by considering the effects due to non-linearity,
outliers, heteroschedasticity, and multicollinearity in input
data as well as data reduction. Both non-linearity and the
presence of outliers did affect the classification performance
of decision trees. The presence of heteroschedasticity did
not affect the classification performance of decision trees
significantly. And, the presence of multicollinearity is not
of concern for decision trees. The attempt to remove mul-
ticollinearity resulted in poor classification performance.
Data reduction resulted in improved performance both in
terms of the resulting tree-size and classification.

We are currently in the process of evaluating the results
we have thus far using larger and more data sets. We are
also in the process of studying why these data characteris-
tics affect the classification performance of decision trees.
Also, in this study, we were only interested in the size of
the decision trees and their classification accuracy. The
computational cost of this process is also important, and
this is left as an exercise for a future study. In addition
to the characteristics presented in this paper, we are


1226 S. Piramuthu / Expert Systems with Applications 34 (2008) 1220–1226
also evaluating other data characteristics including non-
independence and non-normality of data.

We presented one possible means to improve the classi-
fication performance of decision trees. This, along with
other pre-processing methods (such as feature selection
and feature construction), methods for fine-tuning decision
trees, and those that enhance interpretability of results,
would help improve the overall performance of these deci-
sion support tools incorporating decision trees.
References

Adomavicius, G., & Tuzhilin, A. (2001). Using data mining methods to
build customer profiles. IEEE Computer (February), 74–82.

Anderson, E. (1935). The Irises of the Gaspe Peninsula. Bulletin of the
American Iris Society, 59, 2–5.

Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2000). Integrating E-
commerce and data mining: architecture and challenges. WEB-
KDD’2000 workshop on Web mining for E-commerce – challenges and

opportunities, August.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984).

Classification and regression. Wadsworth.
Brijs, T., & Vanhoof, K. (1998). Principles of data mining and knowledge

discovery. In J. M. Zytkow (Ed.), PKDD ’98. Lecture notes in artificial
intelligence (Vol. 1510, pp. 102–110). Berlin: Springer.

Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999). Using association
rules for product assortment decisions: a case study. In Proceedings
of the 5th international conference on knowledge discovery and

data mining (KDD’99) (pp. 254–260). San Diego, CA, August 15–18,
1999.

Cooley, R., Tan, P.-N. & Srivastava, J. (1999). Discovery of interesting
usage patterns from Web data. Technical Report, Department of
Computer Science and Engineering, University of Minnesota.
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis.
John Wiley & Sons (p. 218).

Fisher, R. A. (1936). The use of multiple measurements in taxonomic
problems. Annual Eugenics, 7(Part II), 179–188.

Ishibuchi, H., Nakashima, T., & Nil, M. (2001). Genetic-algorithm-based
instance and feature selection. In H. Liu & H. Motoda (Eds.), Instance
selection and construction for data mining. Kluwer Academic.

Kohavi, R. (1995). Wrappers for performance enhancement and oblivious
decision graphs, Ph.D. Dissertation, Computer Science Department,
Stanford University.

Kushmerick, N. (1999). Learning to remove Internet advertisements.
Third International Conference on Autonomous Agents.

Liu, H., & Motoda, H. (Eds.). (2001). Instance selection and construction
for data mining. Kluwer Academic.

Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling.
In Proceedings of the 5th international conference on knowledge
discovery and data mining (pp. 23–32). AAAI Press.

Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up
inductive algorithms. Data Mining and Knowledge Discovery, 3(2),
131–169.

Quinlan, J. R. (1993). C4.5 programs for machine learning. Morgan
Kaufman.

Ragavan, H., & Piramuthu, S. (1991). The Utility of Feature Construction
in Back-propagation. Proceedings of the twelfth IJCAI, 844–848.

See-5. (2001). Rulequest research data mining tools.
Shaw, M., Subramaniam, C., Tan, G. W., & Welge, M. E. (2001).

Knowledge management and data mining for marketing. Decision
Support Systems, 31, 127–137.

Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.-N. (2000). Web
usage mining: discovery and applications of usage patterns from Web
data. SIGKDD Explorations, 1(2), 12–23, January.

Thearling, K. (1999). Data mining and CRM: zeroing in on your best
customers. dmDitrect (December), 20.

van der Putten, P. (1999). Data mining in direct marketing databases. In
W. Baets (Ed.), Complexity and management: a collection of essays.
Singapore: World Scientific Publishers.


	Input data for decision trees
	Introduction
	Evaluation of input data characteristics for decision	trees
	Non-linearity in input data
	Presence of outliers in input data
	Heteroschedasticity in input data
	Multicollinearity in input data
	Data reduction

	Discussion
	References