A regression tree approach using mathematical programming


Expert Systems With Applications 78 (2017) 347–357 

Contents lists available at ScienceDirect 

Expert Systems With Applications 

journal homepage: www.elsevier.com/locate/eswa 

A regression tree approach using mathematical programming 

Lingjian Yang a , Songsong Liu a , b , Sophia Tsoka c , Lazaros G. Papageorgiou a , ∗

a Centre for Process Systems Engineerig, Department of Chemical Engineering, University College London, Torrington Place, London WC1E 7JE, UK 
b School of Management, Swansea University, Bay Campus, Fabian Way, Swansea SA1 8EN, UK 
c Department of Informatics, King’s College London, Strand, London WC2R 2LS, UK 

a r t i c l e i n f o 

Article history: 

Received 13 July 2016 

Revised 4 February 2017 

Accepted 5 February 2017 

Available online 9 February 2017 

Keywords: 

Regression analysis 

Surrogate model 

Regression tree 

Mathematical programming 

Optimisation 

a b s t r a c t 

Regression analysis is a machine learning approach that aims to accurately predict the value of contin- 

uous output variables from certain independent input variables, via automatic estimation of their latent 

relationship from data. Tree-based regression models are popular in literature due to their flexibility to 

model higher order non-linearity and great interpretability. Conventionally, regression tree models are 

trained in a two-stage procedure, i.e. recursive binary partitioning is employed to produce a tree struc- 

ture, followed by a pruning process of removing insignificant leaves, with the possibility of assigning 

multivariate functions to terminal leaves to improve generalisation. This work introduces a novel method- 

ology of node partitioning which, in a single optimisation model, simultaneously performs the two tasks 

of identifying the break-point of a binary split and assignment of multivariate functions to either leaf, 

thus leading to an efficient regression tree model. Using six real world benchmark problems, we demon- 

strate that the proposed method consistently outperforms a number of state-of-the-art regression tree 

models and methods based on other techniques, with an average improvement of 7–60% on the mean 

absolute errors (MAE) of the predictions. 

© 2017 The Authors. Published by Elsevier Ltd. 

This is an open access article under the CC BY license. ( http://creativecommons.org/licenses/by/4.0/ ) 

1

 
r  

i  

r  

p  

t  

m  

e  

m  

Z  

&  

O  

&  

(

 
i  

v  

s

l

i  

l  

a  

a  

o  

n  

t  

t  

o

 
m  

(  

C  

t  

t  

p  

b  

s  

e  

h

0

. Introduction 

In machine learning, regression analysis seeks to estimate the

elationships between output variables and a set of independent

nput variables by automatically learning from a number of cu-

ated samples ( Sen & Srivastava, 2012 ). The primary goal of ap-

lying a regression analysis is usually to obtain precise predic-

ion of the level of output variables for new samples. Examples of

ethodologies for regression analysis in the literature include lin-

ar regression ( Seber & Lee, 2012 ), automated learning of algebraic

odels for optimisation (ALAMO) ( Cozad, Sahinidis, & Miller, 2014;

hang & Sahinidis, 2013 ), support vector regression (SVR) ( Smola

 Schlkopf, 2004 ), multilayer perception (MLP) ( Hill, Marquez,

’Connor, & Remus, 1994 ), K-nearest neighbour (KNN) ( Korhonen

 Kangas, 1997 ), multivariate adaptive regression splines (MARS)

 Friedman, 1991 ), Kriging ( Kleijnen, 2015 ), and regression tree. 

Quite often, one would like to also gain some useful insights

nto the underlying relationship between the input and output

ariables, in which case the interpretability of a regression method
∗ Corresponding author. 
E-mail addresses: lingjian.yang.10@ucl.ac.uk (L. Yang), 

ongsong.liu@swansea.ac.uk (S. Liu), sophia.tsoka@kcl.ac.uk (S. Tsoka), 

.papageorgiou@ucl.ac.uk (L.G. Papageorgiou). 

c  

i  

i  

l  

t  

ttp://dx.doi.org/10.1016/j.eswa.2017.02.013 

957-4174/© 2017 The Authors. Published by Elsevier Ltd. This is an open access article u
s also of great interest. Regression tree is a type of the machine

earning tools that can satisfy both good prediction accuracy

nd easy interpretation, and therefore have received extensive

ttention in the literature. Regression tree uses a tree-like graph

r model and is built through an iterative process that splits each

ode into child nodes by certain rules, unless it is a terminal node

hat the samples fall into. A regression model is fitted to each

erminal node to get the predicted values of the output variables

f new samples. 

The Classification and Regression Tree (CART) is probably the

ost well known decision tree learning algorithm in the literature

 Breiman, Friedman, Olshen, & Stone, 1984 ). Given a set of samples,

ART identifies one input variable and one break-point, before par-

itioning the samples into two child nodes. Starting from the en-

ire set of available training samples (root node), recursive binary

artition is performed for each node until no further split is possi-

le or a certain terminating criteria is satisfied. At each node, best

plit is identified by exhaustive search, i.e. all potential splits on

ach input variable and each break-point are tested, and the one

orresponding to the minimum deviations by respectively predict-

ng two child nodes of samples with their mean output variables

s selected. After the tree growing procedure, typically an overly

arge tree is constructed, resulting in lack of model generalisation

o unseen samples. A procedure of pruning is employed to remove
nder the CC BY license. ( http://creativecommons.org/licenses/by/4.0/ ) 

http://dx.doi.org/10.1016/j.eswa.2017.02.013
http://www.ScienceDirect.com
http://www.elsevier.com/locate/eswa
http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2017.02.013&domain=pdf
http://creativecommons.org/licenses/by/4.0/
mailto:lingjian.yang.10@ucl.ac.uk
mailto:songsong.liu@swansea.ac.uk
mailto:sophia.tsoka@kcl.ac.uk
mailto:l.papageorgiou@ucl.ac.uk
http://dx.doi.org/10.1016/j.eswa.2017.02.013
http://creativecommons.org/licenses/by/4.0/


348 L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 

 
i  

d  

t

 
s  

w  

Y  

e  

p  

a  

S  

a  

v  

t  

v

 
(  

L  

p  

t  

i  

t  

s  

s  

o  

u

 
e  

M  

i  

(  

2

 
t  

t  

v  

c  

1  

o  

(  

t  

r  

c  

a  

e  

c

 
o  

a  

t  

a  

s  

a  

o  

a  

c  

w  

a  

s  

q  

s  

f  

i  

e  

t  

l

sequentially the splits contributing insufficiently to training accu-

racy. The tree is pruned from the maximal-sized tree all the way

back to the root node, resulting in a sequence of candidate trees.

Each candidate tree is tested on an independent validation sam-

ple set and the one corresponding to the lowest prediction error is

selected as the final tree ( Breiman, 20 01; Wu et al., 20 08 ). Alter-

natively, the optimal tree structure can be identified via cross val-

idation. After building a tree, an enquiry sample is firstly assigned

into one of the terminal leaves (non-splitting leaf nodes) and then

predicted with the mean output value of the samples belonging to

the leaf node. Despite its simplicity, good interpretation and wide

applications ( Antipov & Pokryshevskaya, 2012; Bayam, Liebowitz,

& Agresti, 2005; Bel, Allard, Laurent, Cheddadi, & Bar-Hen, 2009;

Li, Sun, & Wu, 2010; Molinaro, Dudoit, & van der Laan, 2004 ), the

simple rule of predicting with mean values at the terminal leaves

often means prediction performance is compromised ( Loh, 2011 ). 

The conditional inference tree (ctree) tackles the problem of re-

cursive partitioning in a statistical framework ( Hothorn, Hornik, &

Zeileis, 2006 ). For each node, the association between each inde-

pendent input feature and the output variable is quantified, using

permutation test and multiple testing correction. If the strongest

association passes a statistical threshold, binary split is performed

in that corresponding input variable; otherwise the current node

is a terminal node. Ctree is shown to avoid the problem of build-

ing biased tree towards input variables with many distinct levels

of values while ensuring the similar prediction performance. 

Since almost all the tree-based learning models are constructed

using recursive partitioning, an efficient yet essentially locally op-

timal approach, the evtree implements an evolutionary algorithm

for learning globally optimal classification and regression trees

( Grubinger, Zeileis, & Pfeiffer, 2014 ), and is considered an alterna-

tive to the conventional methods by globally optimising the tree

construction. Evtree searches a tree structure that takes into ac-

count the accuracy and complexity, defined as the number of ter-

minal leaves. Due to the exponentially growing size of the prob-

lem, evolutionary methods are employed to identify a quality fea-

sible solution. 

M5’, also knows as M5P, is considered as an improved version

of CART ( Quinlan, 1992; Wang & Witten, 1997 ). The tree growing

process is the same as that of the CART, while several modifica-

tions have been introduced in tree pruning process. After the full

size tree is produced, a multiple linear regression model is fitted

for each node. A metric of model generalisation is defined in the

original paper taking into account training error, the numbers of

samples and model parameters. The constructed linear regression

function for each node is then simplified by removing insignificant

input variables using a greedy algorithm in order to achieve locally

maximal model generalisation metric. Tree pruning starts from the

bottom of the tree and is implemented for each non-leaf nodes. If

the parent node offers higher model generalisation than the sum of

the two child nodes, then the child nodes are pruned away. When

predicting new samples, the value computed at the corresponding

terminal node is adjusted by taking into account the other pre-

dicted values at the intermediate nodes along the path from the

terminal to the root node. The fitting of linear regression functions

at leaf nodes improves the prediction accuracy of the regression

tree learning model. 

M5’ is then further extended into Cubist ( RuleQuest, 2016 ), a

commercially available rule-based regression model, which has re-

ceived increasing popularity recently ( Kobayashi, Tsend-Ayush, &

Tateishi, 2013; Minasny & McBratney, 2008; Moisen et al., 2006;

Peng et al., 2015; Rossel & Webster, 2012 ). M5’ is employed to grow

a tree first, which is then collapsed into a smaller set of if-then

rules by removing and combining paths from the root to the ter-

minal nodes. It is noted here that the if-then rules resulted from

Cubist method can be overlapping, i.e. a sample can be assigned
nto multiple rules, where all the predictions are averaged to pro-

uce a final value. This ambiguity decreases the interpretability of

he rule model. 

The Smoothed and Unsmoothed Piecewise-Polynomial Regres-

ion Trees (SUPPORT) is another regression tree learning algorithm,

hose foundation is based on statistics ( Chaudhuri, Huang, Loh, &

ao, 1994 ). Given a set of samples, SUPPORT fits a multiple lin-

ar regression function and computes the deviation of each sam-

le. The samples with positive deviations and negative deviations

re respectively assigned into two classes. For each input variable,

UPPORT compares the distribution of the two classes of samples

long this input variable by applying two-sample t test. The input

ariable corresponding to the lowest P value is selected as split-

ing node and the average of the two class mean on this splitting

ariable is taken as break-point. 

The Generalised, Unbiased, Interaction Detection and Estimation

GUIDE) adopts similar philosophy as the SUPPORT ( Loh, 2002;

oh, He, & Man, 2015 ). Given a node, the same step of fitting sam-

les with a linear regression model and separating samples into

wo classes based on the sign of deviations is employed. For each

nput variable, its numeric values are binned into a number of in-

ervals before a chi-square test is used to determine its level of

ignificance. The most significant input variable is used for binary

plit. In terms of break-point determination, either a greedy search

r median of the two class mean on this splitting variable can be

sed. 

More other variants of the above regression tree models also

xist in the literature, including SECRET ( Dobra & Gehrke, 2002 ),

ART ( Elish, 20 09; Friedman, 20 02 ), SMOTI ( Malerbao, Espos-

to, Ceci, & Appice, 2004 ), MAUVE ( Vens & Blockeel, 2006 ), BART

 Chipman, George, & McCulloch, 2010 ) and SERT ( Chen & Hong,

010 ), etc. 

In the above classic regression tree methodologies, the tradi-

ional means of node splitting are dominated by either exhaus-

ively searching the candidate split corresponding to the maximum

ariance reduction by predicting of mean output values in two

hild nodes ( Breiman et al., 1984; Quinlan, 1992; Wang & Witten,

997 ), or examining distribution of sample deviations from fitting

ne linear regression function to all the samples in the parent node

 Chaudhuri et al., 1994; Loh, 2002 ). However, it is noticed that for

hose algorithms where terminal leaf nodes are fitted with linear

egression functions ( Quinlan, 1992; Wang & Witten, 1997 ), the

hoice of splitting variable, break-point and regression coefficients

re done sequentially, i.e. the splitting variable and break-point are

stimated during tree growing procedure while regression coeffi-

ients for each child node are computed at pruning step. 

A theoretically better node splitting strategy is to simultane-

usly determine the splitting feature, the position of break-point

nd the regression coefficients for each child node. In this case,

he quality of a split can be directly calculated as the sum of devi-

tions of all samples in either subset. A straightforward exhaustive

earch algorithm for this problem can be: for each input variable

nd each break-point, samples are separated into two subsets and

ne multiple linear regression is fitted for each subset. After ex-

mining all possible splits, the optimal split is chosen as the one

orresponding to the minimum sum of deviations. The problem

ith this approach is, however, that as the numbers of samples

nd input variables grow, the quantity of multiple linear regres-

ion functions need to be evaluated increases exponentially, re-

uiring excessive computational time. For example, given a regres-

ion problem of 500 samples and 10 input variables, we assume

or each input variable, each sample takes a unique value. Then

t requires construction of 9980 ( = 499 × 10 × 2) multiple lin-
ar regression functions in order to find the optimal split for only

he root node, which will only become worse as the tree grows

arger. 


L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 349 

 
g  

2  

n  

p  

f  

o  

u  

f  

p

 
w  

f  

c  

m  

o  

d  

v  

p  

b  

o  

f

2

 
g  

m  

p  

i  

l  

o  

a  

s  

g  

a  

f

 
n  

a  

s  

n  

a  

m  

n  

d  

o  

c

2

 
t  

c  

b

 
g  

d  

d  

g  

w  

g  

i  

s  

t  

s  

n  

a  

f  

i  

m  

b  

i  

l  

β  
n  

r  

p  

a  

t

2

 
t  

t  

s  

b  

h  

a

I

 
In this work, we adopt a recently proposed mathematical pro-

ramming optimisation model ( Yang, Liu, Tsoka, & Papageorgiou,

016 ), which solves the problem of splitting a node into two child

odes to global optimality in affordable computational time. In our

roposed framework, tree leaf nodes are fitted with polynomial

unctions and recursive partition is permitted when the amount

f reduction in deviation achieved by node splitting is above a

ser-specific value, which is also the only tuning parameter in our

ramework. Since the size of the tree is controlled via the tuning

arameter, the pruning procedure is not implemented. 

The rest of the paper is structured as follows: In Section 2 ,

e describe the main features of the optimisation model adopted

rom literature and introduces the framework of our proposed de-

ision tree building process. In Section 3 , a number of bench-

ark regression problems are employed to test the performance of

ur proposed method. A comprehensive sensitivity analysis is con-

ucted to evaluate how prediction accuracy varies with different

alues of the tuning parameter. Later, prediction accuracy of our

roposed method is compared against a number of decision tree

ased algorithms and some other state-of-the-art regression meth-

ds. Section 4 presents our main conclusions and discusses some

uture directions. 

. Method 

In our previous work ( Yang et al., 2016 ), we have proposed a re-

ression method based on piece-wise linear functions, named seg-

ented regression. Segmented regression identifies multiple break-

oints on a single independent variable and partitions the samples

nto multiple regions, each one of which is fitted with a multiple

inear regression function so as to minimise the absolute deviation

f the samples. The core element of the segmented regression is

 mathematical programming optimisation model that, given one

ingle input variable as splitting variable and the number of re-

ions, simultaneously optimises the positions of the break-points

nd the regression coefficients of one multiple linear regression

unction for each region. 

In this work, we adopt this optimisation model to optimise bi-

ary splitting of nodes. Given a node and a single input variable

s splitting variable, the optimisation model is solved to find the

ingle break-point and the regression coefficients for the two child

odes. The model is solved when each input variable in turn serves

s splitting variable once, and the input variable giving the mini-

um absolute deviation is selected for splitting the current parent

ode. Recursive node splitting terminates when the reduction in

eviation drops below a user-specific threshold value. Below, the

verview of the regression approach, and the detailed mathemati-

al programming model for node partitioning are presented. 

.1. Regression tree approach 

As for other regression tree learning algorithms, recursive split-

ing is used to grow the tree from root node until a split of node

annot yield sufficient reduction in deviation. The pseudocode for

uilding a tree is given below. 

Proposed regression tree algorithm 

Step 1. Fit a polynomial regression function of order 2 to root node 

minimising absolute deviation, recorded as ERROR root . 

Step 2. Start from the root node as the current node, and let 

E RROR current = E RROR root . 
Step 3. In each current root, for each input variable m, specify it as splitting

variable ( m = m ∗ ) and solve the proposed Optimal Piece-wise Linear
Regression Analysis model ( OPLRA ). The deviation is noted as 

ERROR 
split 
m . 
Step 4. Identify the best split corresponding to the minimum absolute 

deviation, noted as E RROR split = min 
m 

E RROR 
split 
m . 

Step 5. If E RROR current − E RROR split ≥ β × E RROR root , the current node is split; 
otherwise the current node is finished as a terminal node. 

Step 6. Apply step 3–5 to each remaining child node in turn. 

Given training samples, the first step of our proposed tree

rowing strategy is to fit a polynomial regression function of or-

er of 2 to the entire set of training samples minimising absolute

eviation, which is noted as ERROR root . The used polynomial re-

ression function can provide higher prediction accuracy. Note that

hen the coefficient of the quadratic term is zero, the obtained re-

ression model is a linear function. The absolute deviation is min-

mised here, due to its simplicity and ease of optimisation. The ab-

olute deviation of root node, multiplied by a scaling parameter β,
aking value between 0 and 1 , is specified as the condition for node

plitting. In other words, the current node is split into two child

odes, only if the optimal split of the node results in reduction of

bsolute deviation being greater than β × ERROR root . Then starting
rom the root node as the current node, each feature m is specified

n turn as splitting feature m ∗ once, while solving model OPLRA
inimising the sum of absolute deviations of two child nodes. The

est split of the current node is identified as the one correspond-

ng to minimum absolute error. If the best split brings down abso-

ute deviation from the current node ( ERROR current ) by more than

× ERROR root , then the split takes place; otherwise the current
ode is finalised as terminal leaf node. Note that the tuning pa-

ameter β determines the size of the developed tree, and an ap-
ropriate value of β can avoid the overfitting on the training data,
nd achieve good prediction accuracy for testing. The flowchart of

he whole procedure is illustrated in Fig. 1 . 

.2. Mathematical programming model for node partitioning 

For a given current node n and one feature m * for potential par-

ition, the proposed mathematical programming model for the op-

imal node split, OPLRA , is presented in this section. The indices,

ets, parameters and variables associated with the model are listed

elow. For better separation between the parameters and variables,

ere lower case letters are for parameters, while upper case letters

re for variables: 

ndices 

c child node of the current parent node n; c = l represents left
child node,and c = r represents right child node 

m feature/independent input variable, m = 1 , 2 , . . . , M
m ∗ the feature where sample partition takes place 
n the current parent node 

s samples in the data set, s = 1 , 2 , . . . , S

Sets 

C n set of child nodes of the current parent node n 

S n set of samples in the current parent node n 

Parameters 

a sm numeric value of sample s on feature m 

y s real output value of sample s 

u a suitably large positive number 

� a suitably small positive number 

Continuous variables 

B c intercept of regression function in child node c 

D s absolute deviation between predicted output and real

output for sample s 

P c s predicted output for sample s in child node c 

W 1 c m , W 2 
c 
m regression coefficients for feature m in child node c 

X m ∗ break-point on partition feature m 
∗


350 L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 

Fig. 1. Flowchart of the proposed regression tree approach. 

 
c

w  

s  

s  

n

c

 
p

P  

 
v  
Binary variables 

F c s 1 if sample s falls into child node c; 0 otherwise 

Binary variables F c s , taking value of either 0 or 1 , are introduced

to model if sample s belongs to child node c or not. Modelling of

which sample belongs to either child node is achieved with the

following constraints: 

a sm ∗ ≤ X m ∗ − � + u (1 − F c s ) ∀ s ∈ S n , c = l, m ∗ (1)

X m ∗ + � − u (1 − F c s ) ≤ a sm ∗ ∀ s ∈ S n , c = r, m ∗ (2)
When sample s is assigned into left child node (i.e. F c s = 1 when

c = l), Eq. (1) becomes A sm ∗ ≤ X m ∗ − � while Eq. (2) becomes re-
dundant. On the other hand, when sample s is assigned into right
hild node (i.e. F c s = 1 when c = r), Eq. (2) becomes A sm ∗ ≥ X m ∗ + �
hile Eq. (1) is redundant. The insertion of � is to ensure strict

eparation of the samples into two child nodes. The following con-

traints restrict that each sample belongs to one and only one child

ode: 
∑ 

∈ C n 
F c s = 1 ∀ s ∈ S n (3)

For each child node c , polynomial functions of order 2 is em-

loyed to predict the value of samples ( P c s ): 

 
c 
s = 

∑ 

m 

a 2 sm W 2 
c 
m + 

∑ 

m 

a sm W 1 
c 
m + B r ∀ s ∈ S n , c ∈ C n (4)

For any sample s , its training error is equal to the absolute de-

iation between the real output and the predicted output for the


L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 351 

c  

w

D  

D  

 
i

m

 
a  

c  

t  

b  

m  

c  

B  

t  

p  

t

2

 
q  

a  

d  

l  

b  

f  

n

 
m  

n  

d

3

 
h  

s  

t  

c  

r  

p  

o

 
b  

2  

t  

f  

v  

t  

b  

n  

p  

o  

s  

c  

s  

m  

b  

n  

Table 1 

Summary of benchmark data sets. 

Case study Number of samples Number of features 

Yacht hydrodynamics 308 6 

Concrete strength 1030 8 

Energy efficiency heating 768 8 

Energy efficiency cooling 768 8 

Airfoil 1503 5 

White wine quality 4898 11 

g  

t  

m  

f  

a  

m  

f  

A  

o  

i  

c  

p  

a  

T

 
d  

m  

t  

h  

m  

o  

i  

s  

f  

a  

a  

w  

f  

1  

d

 
i  

t  

i  

M  

c  

C  

‘  

K  

e  

l  

b  

M  

i  

b  

w  

t  

m  

i  

a  

0  

7  

W

hild node c where it belongs to (i.e. F c s = 1 ), and can be expressed
ith the following two equations: 

 s ≥ y s − P c s − u (1 − F c s ) ∀ s ∈ S n , c ∈ C n (5)

 s ≥ P c s − y s − u (1 − F c s ) ∀ s ∈ S n , c ∈ C n (6)
The objective function is to minimise the sum of absolute train-

ng errors of splitting the current node n into its child nodes: 

in 
∑ 

s ∈ S n 
D s (7) 

The final OPLRA model consists of a linear objective function

nd several linear constraints, and the presence of both binary and

ontinuous variables define an MILP problem, which can be solved

o global optimality by standard solution algorithms, for example

ranch and bound. The optimisation model simultaneously opti-

ises the break-point ( X m ∗ ), the allocation of samples into two
hild nodes ( F c s ) and the regression coefficients ( W 1 

c 
m , W 2 

c 
m and

 
c ) to achieve the least absolute deviation. Another advantage of

his optimisation model is that there is no need to pre-process in-

ut variable, i.e. input variables do not need to be binned into in-

ervals for analysis. 

.3. Prediction for new samples 

After the regression tree is determined, prediction of new en-

uiry samples can easily be performed. A new sample is firstly

ssigned to one of the terminal leaf node, before yielding a pre-

iction using the multivariate function derived for that particu-

ar node. The predicted output value, if lies outside the interval

ounded by the minimum and maximum of fitted output values

or training samples in that particular node, is then adjusted to the

earest bound. 

The proposed regression tree approach, referred to as Mathe-

atical Programming Tree (MPTree) in this paper, is applied to a

umber of real world benchmark data sets in the next section to

emonstrate its applicability and efficiency. 

. Results and discussion 

In this section, we aim to comprehensively evaluate the be-

aviour of the proposed MPTree using real world benchmark data

ets. We first conduct a comprehensive sensitivity analysis for the

uning parameter β in order to identify a robust value that gives
onsistently good prediction accuracy. After that, prediction accu-

acy comparison is performed to evaluate MPTree against certain

opular regression tree learning algorithms in literature and some

ther regression methodologies. 

A total number of 6 real world regression data sets have

een downloaded from UCI machine learning repository ( Lichman,

013 ). The first regression problem Yacht Hydrodynamics predicts

he residuary resistance of sailing yachts at the initial design stage

rom 6 independent features describing the hull dimensions and

elocity of the boat, including longitudinal position of the cen-

re of buoyancy, prismatic coefficient, length-displacement ratio,

eam-draught ratio, length-beam ratio and Froude number. The

ext example, Concrete Strength ( Yeh, 1998 ), studies how com-

ressive strength of different concrete are affected by attributes

f the concretes. There are 1030 samples with 8 input attributes,

uch as cement, blast furnace slag, fly ash, water, superplasticizer,

oarse aggregate, fine aggregate and age. Energy Efficiency data

ets ( Tsanas & Xifara, 2012 ) are obtained by running simulation

odel. There are 768 samples, with each corresponding to one

uilding shape, described by 8 features including relative compact-

ess, surface area, wall area, root area, overall height, orientation,
lazing area and glazing area distribution. The aims are to establish

he relationship between either heating or cooling load require-

ent of the building and the characteristics of these building. Air-

oil data set concerns how the different frequencies, chord lengths,

ngles of attack, free-stream velocities and suction side displace-

ent thicknesses can predict the sound pressure level of an air-

oil. The last case study, White Wine Quality ( Cortez, Cerdeira,

lmeida, Matos, & Reis, 2009 ), aims to associate expert preference

f white wine taste with 11 physicochemical features of the wines,

ncluding fixed acidity, volatile acidity, citric acid, residual sugar,

hlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sul-

hates and alcohol. The details of these data sets are provided

s the supplementary material, and their sizes are summarised in

able 1 . 

For each regression problem, we employ a 5-fold cross vali-

ation to estimate the predictive accuracy of various regression

ethods. Given a data set, 5-fold cross validation randomly splits

he samples into 5 subsets of roughly equal size. One subset is

old out as testing set, while the other 4 subsets of samples are

erged to form training set. MPTree constructs a regression tree

n the training set, whose prediction accuracy is estimated us-

ng the holdout testing set. The process continues until each sub-

et is hold out once as testing set. We conduct 10 rounds of 5-

old cross validation by performing different random sample splits,

nd the mean absolute errors (MAE) of the prediction are aver-

ged over 50 testing sets as the final error. For each data set,

e normalise each independent input variable with the following

ormula so that the scaled input data take value between 0 and

: A sm = A 
′ 
sm −min s A ′ sm 

max s A 
′ 
sm −min s A ′ sm 

∀ s, m, where A ′ sm denotes the raw input
ata. 

To assess the relative competitiveness of the proposed MPTree

n terms of prediction accuracy, we compare the proposed MPTree

o a number of popular regression methods in literature, includ-

ng CART, M5’, Cubist, linear regression, SVR, MLP, Kriging, KNN,

ARS, segmented regression ( Yang et al., 2016 ) and ALAMO. CART,

tree, evtree and Cubist are implemented in R ( R Development

ore Team, 2008 ) using the packages ‘rpart’, ‘party’, ‘evtree’ and

Cubist’, respectively. M5’, linear regression, SVR, MLP, kriging and

NN are implemented in WEKA machine learning software ( Hall

t al., 2009 ). For KNN, the number of nearest neighbours is se-

ected as 5, while for other methods their default settings have

een retained. We use the MATLAB toolbox called ARESlab for

ARS. ALAMO is reproduced using the General Algebraic Model-

ng System (GAMS) ( GAMS Development Corporation, 2014 ), and

asis function forms including polynomial of degrees up to 3, pair-

ise multinomial terms of equal exponents up to 3, exponen-

ial and logarithmic forms are provided for each data set. Seg-

ented regression and the proposed MPTree are also implemented

n GAMS. ALAMO, segmented regression and our proposed MPTree

re solved using CPLEX MILP solver, with optimality gap set as

. All computational runs were performed on a 64-bit Windows

 based machine with 3.20 GHz six-core Intel Xeon processor

3670 and 12.0 GB RAM. 


352 L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 

Fig. 2. Sensitivity analysis for β of all data sets. 

 
t  

t  

l  

A  

s  

s  

S  

c  

M

 
r  

a  

f  

o  

a  

a  

v  

M  

fi

3.1. Sensitivity analysis for β

In this section, we first perform a comprehensive sensitivity

analysis on the single tuning parameter β in the proposed MPTree.
Recall in the tree growing procedure, β controls termination of re-
cursive node splitting. A node is split into two child nodes if the

optimal split leads to reduction of absolute training deviation be-

ing more than a threshold value, defined as the amount of abso-

lute training deviation of a multiple linear regression analysis on

the entire set of training samples ERROR root multiplied by the scal-

ing parameter β. The tree grows larger as β decreases. Identifying
a suitable value for β is a non-trivial problem as an excessively
high value would terminate the node splitting prematurely with-

out adequately describing the data, while a very small value can

over-fit the unseen samples by constructing very large trees. In this

work, we test a series of values, including 0.005, 0.01, 0.015, 0.025,

0.05 and 0.1 . The results of the sensitivity analysis are presented in

Fig. 2 . 

According to Fig. 2 , we can clearly observe a phenomenon that

as β is reduced from 0.1 to 0.015 , prediction error almost mono-
onically drops. This improved prediction performance can be at-

ributed to the fact that decreased β allows the tree to grow
arger, and thus better describing the latent pattern in the data.

s β further lowers down, MAE can even further decreases for
ome examples, including Energy Efficiency Heating and Airfoil. For

ome other examples, including Yacht Hydrodynamics, Concrete

trength, Energy Efficiency Cooling and White Wine Quality, more

omplicating trees do not predict unseen testing samples well, as

AE worsens. 

It is well known that in data mining, parameter fine tuning is

equired for a particular method to reach optimal performance for

 specific data set. Thus, it is our interest here to identify a value

or β that corresponds to robust prediction accuracy for a range
f different tested benchmark examples. In this study, β = 0.015
ppears to yield overall robust and accurate prediction as it usu-

lly leads to lowest or second lowest MAE among all the tested

alues. Higher values of β are shown to give significantly higher
AE, while smaller value of β sometimes leads to noticeable over-

tting, thus compromising the robustness of its performance. 


L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 353 

Table 2 

Prediction accuracy comparison across different regression methods, in terms of MAE. The proposed MPTree method is high- 

lighted in italic, and the best prediction accuracy of each data set is given in bold. 

Yacht Concrete Energy efficiency Energy efficiency Airfoil White wine 

hydrodynamics strength heating cooling quality 

Tree-based methods 

MPTree 0 .58 3 .85 0 .35 0 .80 0 .015 0 .52 

CART 1 .61 7 .22 2 .00 2 .38 0 .035 0 .60 

Ctree 0 .81 5 .99 0 .63 1 .40 0 .029 0 .58 

Evtree 1 .05 6 .44 0 .56 1 .59 0 .032 0 .59 

M5’ 0 .96 4 .72 0 .69 1 .21 0 .021 0 .56 

Cubist 0 .60 4 .29 0 .35 0 .89 0 .017 0 .56 

Non-tree-based methods 

Linear regression 7 .27 8 .31 2 .09 2 .27 0 .037 0 .59 

SVR 6 .45 8 .21 2 .04 2 .19 0 .037 0 .58 

MLP 0 .81 6 .23 0 .99 1 .92 0 .035 0 .62 

Kriging 4 .32 6 .22 1 .79 2 .04 0 .030 0 .58 

KNN 5 .30 7 .07 1 .94 2 .15 0 .026 0 .54 

MARS 1 .01 4 .87 0 .80 1 .32 0 .035 0 .57 

Segmented regression 0 .71 4 .87 0 .81 1 .28 0 .029 0 .55 

ALAMO 0 .79 8 .04 2 .72 2 .76 0 .032 0 .64 

Fig. 3. Prediction accuracy (MAE) comparison across different regression methods. For each benchmark example, the original MAE achieved by different methods in 

Table 2 are normalised between 0% and 100%, with 0% representing the lowest MAE and 100% representing the highest MAE. 

3

 
r  

t  

a  

t  

e  

m  

w  

i  

o  

d  

b  

i  

a  

m  

h  

n  

i  

a  

4

 
c  

m  

m  

m  

a  

m  

7  
.2. Performance comparison across different regression methods 

After identifying a value (i.e. 0.015 ) for the only user-specific pa-

ameter, β, in the proposed MPTree, we now compare the predic-
ion performance of the MPTree against a number of state-of-the-

rt regression methods. To ensure unbiased comparison, β is set
o 0.015 thorough all examples studied. For each of the benchmark

xamples, we compare the MAE achieved by various competing

ethods. The detailed prediction accuracies reported in Table 2 , in

hich the performance of the proposed MPTree method is shown

n italic, and the best prediction accuracy, i.e. the smallest MAE,

f each data set is given in bold. The proposed has the best pre-

iction accuracy in all data sets, compared to regression methods

ased on a wide range of tree and non-tree methodologies. Only

n the Energy Efficiency Heating data set, Cubist achieves the same
ccuracy as MPTree. Overall, MPTree achieves 7–60% of improve-

ent on MAE compared to each of other regression models. We

ave also implemented MPTree with linear function at each child

ode, the results of which still show great competitiveness as be-

ng either the top or the second best method in all examples and

chieves the overall best performance with MAE values of 0.60,

.16, 0.36, 1.00, 0.014 and 0.55, respectively. 

The comparative results are summarised in Fig. 3 . This Radar

hart is plotted to comprehensively visualise the prediction perfor-

ance of different methods across all 6 data sets. For each bench-

ark example studied, we normalise the MAE achieved by all

ethods in Table 2 to scaled values between 0% and 100%, with 0%

nd 100% respectively denoting the lowest and the highest MAE. To

aintain the readability of the plot, prediction accuracies of only

 methods are plotted. It is clearly observed from Fig. 3 that the


354 L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 

Table 3 

Prediction accuracy comparison across different tree-based regression methods in terms of MSE. The proposed 

MPTree method is highlighted in italic, and the best prediction accuracy of each data set is given in bold. 

Yacht Concrete Energy efficiency Energy efficiency Airfoil White wine 

hydrodynamics strength heating cooling quality 

MPTree 2 .03 43 .88 0 .26 2 .65 0 .0 0 05 0 .59 

CART 5 .41 86 .24 6 .85 9 .40 0 .0020 0 .58 

Ctree 2 .79 63 .72 1 .33 4 .37 0 .0014 0 .55 

Evtree 3 .02 69 .12 1 .00 4 .44 0 .0015 0 .56 

M5’ 3 .08 40 .72 0 .95 3 .26 0 .0 0 08 0 .53 

Cubist 1 .07 37 .77 0 .27 2 .76 0 .0 0 06 0 .51 

Fig. 4. Constructed tree by CART on Energy Efficiency Heating example. Boxes represent the terminal leaf nodes with labels inside, while circles represent other nodes, 

where the symbol inside refers to the feature where the split takes place. The splitting rules are given on the corresponding paths. 

 
M  

c  

p  

m  

M  

r  

s  

r  

p  

f

3

t

 
s  

p  
proposed MPTree forms the smallest area across all data sets, and

performs better than other implemented tree-based learning algo-

rithms, including ctree, evtree, M5’ and Cubist, and non-tree-based

models, including MLP and Segmented regression. Overall, MPTree

demonstrates clear advantage over the counterparts by managing

the lowest MAE value for each and every tested benchmark exam-

ple (including SVR, Kriging and KNN where results are not shown

here). It is undoubtedly that the proposed MPTree, by optimising

simultaneously the position of break-point and regression coeffi-

cients per child node, representing significant improvement com-

pared with other tree models in literature. 

In this work, MAE is adopted as the performance metric of

regression models, which might not be suitable for all the data

sets. Besides, other approaches might provide better fittings over

another performance metric, e.g., mean squared error (MSE), root

mean squared error (RMSE), Akaike Information Criterion, etc.

When we compare the prediction accuracy in terms of MSE of all

tree-based methodologies, Table 3 shows that the post-processed
 b  
SE values from the optimal solutions of MPTree are still very

ompetitive with MSE values from other methodologies, even the

roposed MPTree aims to minimise MAE. Although the perfor-

ance of MPTree is not as dominant as it is considering MAE,

PTree still ranks first on three data sets out of six, and is compa-

able with Cubist, which performs the best for the other three data

ets. These results demonstrate the impact of performance met-

ics on the predication performance, and the consideration of other

erformance metrics in MPTree would be an interesting direction

or future research. 

.3. Comparison of actual constructed trees by different regression 

ree methods 

Last section has demonstrated that the novel MPTree regres-

ion tree learning method offers superior prediction capacity. Com-

ared to certain regression methods whose output models cannot

e interpreted, for example kernel-based SVR and MLP, tree learn-


L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 355 

Fig. 5. Constructed tree by M5’ on Energy Efficiency Heating example. Boxes represent the terminal leaf nodes with labels inside, while circles represent other nodes, where 

the symbol inside refers to the feature where the split takes place. The splitting rules are given on the corresponding paths. 

i  

s  

m  

i  

p  

l  

b  

a  

u  

b  

s  

b  

s  

p

 
s  

s  

c  

m  

(  

i  

t  

t  

s  

n  

a  

n  

p  
ng algorithms are well-known for their easy interpretability. The

equence of the derived rules can be simply visualised as tree,

aking it easily understandable and possible to gain some insights

nto the underlying mechanism of the studied system. The inter-

retability of a constructed tree model decreases as the tree grows

arger. In this section, attention is turned into comparing the num-

er of terminal leaf nodes of the trees constructed by CART, M5’

nd MPTree. Taking Energy Efficiency Heating as an example and

sing all the available samples as training set, the trees grown

y CART, M5’ and MPTree are presented in Figs. 4 , 5 and 6 , re-

pectively, in which the terminal leaf nodes are represented by

oxes, and other nodes in the trees are represented by circles. The

ymbol in each circle represents the feature where the split takes

lace. 
According to Fig. 4 , CART has built a simple tree for the 768-

ample example. On the top of the tree, CART splits the entire

et of samples on feature m1 at break-point of 0.361 into two

hild nodes, which are in turn further split on feature m7 and

1 , respectively. There are a total number of 7 terminal leaf nodes

TN1–TN7) and the depth of the tree is equal to 4. From Fig. 5 ,

t is apparent that M5’ has constructed a much larger tree than

he CART. The top part of the M5’ tree is almost identical as the

ree built by CART, which is not surprising as the two algorithms

hare great similarity during tree growing procedure and only sig-

ificantly different from each other on pruning procedure. Over-

ll, the tree grown by M5’ has a depth of 8 and 24 terminal leaf

odes (TN1–TN24) , which is much harder to understand and inter-

ret. Fig. 6 visualises the actual tree built by our proposed MPTree


356 L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 

Fig. 6. Constructed tree by MPTree on Energy Efficiency Heating example. Boxes represent the terminal leaf nodes with labels inside, while circles represent other nodes, 

where the symbol inside refers to the feature where the split takes place. The splitting rules are given on the corresponding paths. 

Table 4 

The number of terminal leaf nodes of the constructed trees by different regression tree learning methods. 

Yacht Concrete Energy efficiency Energy efficiency Airfoil White wine 

hydrodynamics strength heating cooling quality 

CART 5 13 7 4 18 7 

M5’ 4 10 24 24 44 55 

MPTree 5 14 7 12 14 6 

 
M  

m  

M  

i  

i  

r  

c  

o  

O  

o

 
t  

r  

c  

m  

s  

m  

t  

w  

c  

c

A

 
s  

I  

E  

6  

t  

U

method. The size of the derived tree is similarly small as that of

CART with 7 terminal leaf nodes (TN1–TN7) and a depth of 3, yet

the two trees are quite different as the root nodes of the two trees

are split on different features. MPTree, optimising the node split-

ting, picks feature m3 as partition feature, in contrast to feature m1

selected by CART. Overall on the Energy Efficiency Heating exam-

ple, CART and MPTree appear to build trees that are small in size,

while M5’ outputs a significantly larger tree. 

The same analysis has been repeated on the other 5 benchmark

data sets, and the results of which are available in Table 4 . The

same observation can be made that for the other examples, CART

and MPTree derive trees of similar numbers of terminal leaf nodes,

while M5’ sometimes builds trees of comparable sizes as the other

two (i.e. Yacht Hydrodynamics and Concrete Strength) but more of-

ten outputs trees of several folds larger (i.e. Energy Efficiency Heat-

ing, Energy Efficiency Cooling, Airfoil and White Wine Quality). 

4. Concluding remarks 

Regression analysis is a data-driven computational tool that

aims to predict continuous output variables from a set of indepen-

dent input variables. In this work, we have proposed a novel re-

gression tree learning algorithm, named MPTree. An optimisation

model OPLRA recently published in literature has been adopted

to optimise the binary node splitting. Given a specified splitting

feature, OPLRA simultaneously determines the break-point position

and the coefficients of the polynomial regression function in either

child node so as to minimise residuals. An algorithm is introduced

for recursive partitioning to grow the tree. 

A number of 6 real-world benchmark data sets have been used

to demonstrate the applicability and efficiency of the proposed
PTree. Popular regression learning algorithms have been imple-

ented for comparison, including tree-based CART, ctree, evtree,

5’ and Cubist, and methods based on various other principles,

ncluding MARS, MLP, kriging, segmented regression, etc. Cross val-

dation experiment has been used to estimate the predictive accu-

acy of different methods. The results clearly indicate that MPTree

onsistently offers a much improved prediction accuracy than the

ther competing methods for each of the benchmark data set.

verall, we show that the proposed MPTree builds regression trees

f better quality by optimising the node splitting. 

In the near future, we aim to explore a few aspects to refine

he MPTree method. The existing regression tree learning algo-

ithms, including the proposed MPTree, perform binary splits re-

ursively to keep the tree growing. Splitting a parent node into

ultiple child nodes, instead of two, is likely to better explore the

tructure of the data set. Another potential avenue is to optimise

ultiple levels of splitting simultaneously. Note that most of the

ree building methods consider only splitting one node at a time,

hile a look-ahead scheme that optimises also splitting of grand-

hild nodes would lead to enhanced prediction performance of the

onstructed tree. 

cknowledgement 

Funding from the UK Engineering and Physical Sciences Re-

earch Council (to LY, SL and LGP through the EPSRC Centre for

nnovative Manufacturing in Emergent Macromolecular Therapies,

P/I033270/1), the UK Leverhulme Trust (to ST and LGP, RPG-2012-

86), the European Union (to ST, HEALTH-F2-2011-261366), and

he Centre for Process Systems Engineering (CPSE) at Imperial and

niversity College London (to LY) are gratefully acknowledged. 


L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 357 

S

 
f

R

A  

 
B  

 
B  

 
B  

B  

C  

C  
 

C  
 

C  

 
C  

D  

 
E  
 

F  

F  

G  

G  
 

H  

H  
 

H  

 
K  

K  

 
K  

 
L  
 

L
L  

L  

L  
 

M  

 
M  

 
M  

 
M  

 
P  
 

Q  

R  

R  

 
R  

S  

S  

S  

T  
 

V  

W  

 
W  

Y  

 
Y  

Z  

 
upplementary material 

Supplementary material associated with this article can be

ound, in the online version, at 10.1016/j.eswa.2017.02.013 . 

eferences 

ntipov, E. A., & Pokryshevskaya, E. B. (2012). Mass appraisal of residential apart-

ments: An application of random forest for valuation and a CART-based ap-
proach for model diagnostics. Expert Systems with Applications, 39 (2), 1772–1778.

http://dx.doi.org/10.1016/j.eswa.2011.08.077 . 
ayam, E., Liebowitz, J., & Agresti, W. (2005). Older drivers and accidents: A meta

analysis and data mining application on traffic accident data. Expert Systems

with Applications, 29 (3), 598–629. http://dx.doi.org/10.1016/j.eswa.2005.04.025 . 
el, L., Allard, D., Laurent, J., Cheddadi, R., & Bar-Hen, A. (2009). CART algorithm

for spatial data: Application to environmental and ecological data. Computa-
tional Statistics & Data Analysis, 53 (8), 3082–3093. http://dx.doi.org/10.1016/j.

csda.2008.09.012 . 
reiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16 (3),

199–231 . 

reiman, L. , Friedman, J. H. , Olshen, R. A. , & Stone, C. J. (1984). Classification and
regression trees . Taylor & Francis . 

haudhuri, P. L. , Huang, M. , Loh, W. , & Yao, R. (1994). Piecewise-polynomial regres-
sion trees. Statistica Sinica, 4 , 143–167 . 

hen, A., & Hong, A. (2010). Sample-efficient regression trees (SERT) for semicon-
ductor yield loss analysis. IEEE Transactions on Semiconductor Manufacturing,

23 (3), 358–369. doi: 10.1109/TSM.2010.204 896 8 . 

hipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive
regression trees. The Annals of Applied Statistics, 4 (1), 266–298. doi: 10.1214/

09- AOAS285 . 
ortez, P. , Cerdeira, A. , Almeida, F. , Matos, T. , & Reis, J. (2009). Modeling wine pref-

erences by data mining from physicochemical properties. Decision Support Sys-
tems, 47 (4), 547–553 . 

ozad, A. , Sahinidis, N. V. , & Miller, D. C. (2014). Learning surrogate models for sim-

ulation-based optimization. AIChE Journal, 60 (6), 2211–2227 . 
obra, A. , & Gehrke, J. (2002). SECRET: A scalable linear regression tree algorithm.

In Proceedings of the eighth ACM sigkdd international conference on knowledge
discovery and data mining (pp. 4 81–4 87). New York, NY, USA: ACM . 

lish, M. O. (2009). Improved estimation of software project effort using multiple
additive regression trees. Expert Systems with Applications, 36 (7), 10774–10778.

http://dx.doi.org/10.1016/j.eswa.2009.02.013 . 

riedman, J. (2002). Stochastic gradient boosting. Computational Statistics and Data
Analysis, 38 (4), 367–378. doi: 10.1016/S0167- 9473(01)0 0 065- 2 . Cited By 637. 

riedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statis-
tics, 19 (1), 1–67 . 

AMS Development Corporation (2014). GAMS – A user’s guide . Washington, DC,
USA. 

rubinger, T. , Zeileis, A. , & Pfeiffer, K. (2014). evtree: Evolutionary learning of glob-
ally optimal classification and regression trees in R. Journal of Statistical Soft-

ware, 61 (1), 1–29 . 

all, M. , Frank, E. , Holmes, G. , Pfahringer, B. , Reutemann, P. , & Witten, I. H. (2009).
The WEKA data mining software: An update. SIGKDD Explorations, 11 (1), 10–18 . 

ill, T. , Marquez, L. , O’Connor, M. , & Remus, W. (1994). Artificial neural network
models for forecasting and decision making. International Journal of Forecasting,

10 (1), 5–15 . 
othorn, T. , Hornik, K. , & Zeileis, A. (2006). Unbiased recursive partitioning: A con-

ditional inference framework. Journal of Computational and Graphical Statistics,

15 (3), 651–674 . 
leijnen, J. P. C. (2015). Regression and Kriging metamodels with their experimental

designs in simulation: Review . CentER Discussion Paper Series No. 2015-035. 
obayashi, T. , Tsend-Ayush, J. , & Tateishi, R. (2013). A new tree cover percentage

map in eurasia at 500 m resolution using modis data. Remote Sensing, 6 (1),
209–232 . 

orhonen, K. T. , & Kangas, A. (1997). Application of nearest neighbour regression

for generalizing sample tree information. Scandinavian Journal of Forest Research,
12 (1), 97–101 . 
i, H., Sun, J., & Wu, J. (2010). Predicting business failure using classification and re-
gression tree: An empirical comparison with popular classical statistical meth-

ods and top classification mining methods. Expert Systems with Applications,
37 (8), 5895–5904. http://dx.doi.org/10.1016/j.eswa.2010.02.016 . 

ichman, M. (2013). UCI machine learning repository . http://archive.ics.uci.edu/ml . 
oh, W. Y. (2002). Regression trees with unbiased variable selection and interaction

detection. Statistica Sinica, 12 (2), 361–386 . 
oh, W. Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews:

Data Mining and Knowledge Discovery, 1 (1), 14–23 . 

oh, W. Y. , He, X. , & Man, M. (2015). A regression tree approach to identify-
ing subgroups with differential treatment effects. Statistics in Medicine, 34 (11),

1818–1833 . 
alerbao, D. , Esposito, F. , Ceci, M. , & Appice, A. (2004). Top-down induction of

model trees with regression and splitting nodes. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 26 (5), 612–625 . 

inasny, B., & McBratney, A. B. (2008). Regression rules as a tool for predicting

soil properties from infrared reflectance spectroscopy. Chemometrics and Intelli-
gent Laboratory Systems, 94 (1), 72–79. http://dx.doi.org/10.1016/j.chemolab.2008. 

06.003 . 
oisen, G. G., Freeman, E. A., Blackard, J. A., Frescino, T. S., Zimmermann, N. E.,

& Jr. , T. C. E. (2006). Predicting tree species presence and basal area in utah:
A comparison of stochastic gradient boosting, generalized additive models, and

tree-based methods. Ecological Modelling, 199 (2), 176–187. http://dx.doi.org/10.

1016/j.ecolmodel.2006.05.021 . 
olinaro, A. M., Dudoit, S., & van der Laan, M. J. (2004). Tree-based multivariate re-

gression and density estimation with right-censored data. Journal of Multivariate
Analysis, 90 (1), 154–177. http://dx.doi.org/10.1016/j.jmva.20 04.02.0 03 . 

eng, Y., Xiong, X., Adhikari, K., Knadel, M., Grunwald, S., & Greve, M. H. (2015).
Modeling soil organic carbon at regional scale by combining multi-spectral im-

ages with laboratory spectra. PLoS ONE, 10 (11), 1–22. doi: 10.1371/journal.pone.

0142295 . 
uinlan, R. J. (1992). Learning with continuous classes. In 5th Australian joint con-

ference on artificial intelligence (pp. 343–348). Singapore: World Scientific . 
 Development Core Team (2008). R: A language and environment for statistical com-

puting . R Foundation for Statistical Computing . Vienna, Austria. 
ossel, R. A. V., & Webster, R. (2012). Predicting soil properties from the australian

soil visible near infrared spectroscopic database. European Journal of Soil Science,

63 (6), 848–860. doi: 10.1111/j.1365- 2389.2012.01495.x . 
uleQuest (2016). Data mining with cubist . https://www.rulequest.com/cubist- info.

html . 
eber, G. , & Lee, A. (2012). Linear regression analysis. Wiley Series in Probability and

Statistics . Wiley . 
en, A. , & Srivastava, M. (2012). Regression analysis: Theory, methods, and applications .

Springer New York . 

mola, A. J. , & Schlkopf, B. (2004). A tutorial on support vector regression. Statistics
and Computing, 14 (3), 199–222 . 

sanas, A. , & Xifara, A. (2012). Accurate quantitative estimation of energy perfor-
mance of residential buildings using statistical machine learning tools. Energy

and Buildings, 49 (0), 560–567 . 
ens, C. , & Blockeel, H. (2006). A simple regression based heuristic for learning

model trees. Intelligent Data Analysis, 10 (3), 215–236 . 
ang, Y. , & Witten, I. H. (1997). Induction of model trees for predicting continu-

ous classes. Poster papers of the 9th European conference on machine learning .

Springer . 
u, X. , Kumar, V. , Quinlan, R. J. , Ghosh, J. , Yang, Q. , Motoda, H. , et al. (2008). Top

10 algorithms in data mining. Knowledge and Information Systems, 14 (1), 1–37 . 
ang, L. , Liu, S. , Tsoka, S. , & Papageorgiou, L. G. (2016). Mathematical programming

for piecewise linear regression analysis. Expert Systems with Applications, 44 ,
156–167 . 

eh, I.-C. (1998). Modeling of strength of high-performance concrete using artificial

neural networks. Cement and Concrete Research, 28 (12), 1797–1808 . 
hang, Y. , & Sahinidis, N. V. (2013). Uncertainty quantification in CO 2 sequestration

using surrogate models from polynomial chaos expansion. Industrial and Engi-
neering Chemistry Research, 52 (9), 3121–3132 . 

http://dx.doi.org/10.1016/j.eswa.2017.02.013
http://dx.doi.org/10.1016/j.eswa.2011.08.077
http://dx.doi.org/10.1016/j.eswa.2005.04.025
http://dx.doi.org/10.1016/j.csda.2008.09.012
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0004
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0004
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006
http://dx.doi.org/10.1109/TSM.2010.2048968
http://dx.doi.org/10.1214/09-AOAS285
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0010
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0010
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0010
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0010
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0010
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0011
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0011
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0011
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0011
http://dx.doi.org/10.1016/j.eswa.2009.02.013
http://dx.doi.org/10.1016/S0167-9473(01)00065-2
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0014
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0014
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0015
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0015
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0015
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0016
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0016
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0016
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0016
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0016
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0019
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0019
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0019
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0019
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0019
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0020
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0020
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0020
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0021
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0021
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0021
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0021
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0021
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0022
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0022
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0022
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0022
http://dx.doi.org/10.1016/j.eswa.2010.02.016
http://archive.ics.uci.edu/ml
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0025
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0025
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0026
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0026
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0027
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0027
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0027
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0027
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0027
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028
http://dx.doi.org/10.1016/j.chemolab.2008.06.003
http://dx.doi.org/10.1016/j.ecolmodel.2006.05.021
http://dx.doi.org/10.1016/j.jmva.2004.02.003
http://dx.doi.org/10.1371/journal.pone.0142295
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0033
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0033
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0034
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0034
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0034
http://dx.doi.org/10.1111/j.1365-2389.2012.01495.x
https://www.rulequest.com/cubist-info.html
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0037
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0037
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0037
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0037
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0038
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0038
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0038
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0038
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0039
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0039
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0039
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0039
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0040
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0040
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0040
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0040
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0041
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0041
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0041
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0041
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0042
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0042
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0042
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0042
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0045
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0045
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0046
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0046
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0046
http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0046

	A regression tree approach using mathematical programming
	1 Introduction
	2 Method
	2.1 Regression tree approach
	2.2 Mathematical programming model for node partitioning
	2.3 Prediction for new samples

	3 Results and discussion
	3.1 Sensitivity analysis for 
	3.2 Performance comparison across different regression methods
	3.3 Comparison of actual constructed trees by different regression tree methods

	4 Concluding remarks
	 Acknowledgement
	 Supplementary material
	 References