A regression tree approach using mathematical programming Expert Systems With Applications 78 (2017) 347–357 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa A regression tree approach using mathematical programming Lingjian Yang a , Songsong Liu a , b , Sophia Tsoka c , Lazaros G. Papageorgiou a , ∗ a Centre for Process Systems Engineerig, Department of Chemical Engineering, University College London, Torrington Place, London WC1E 7JE, UK b School of Management, Swansea University, Bay Campus, Fabian Way, Swansea SA1 8EN, UK c Department of Informatics, King’s College London, Strand, London WC2R 2LS, UK a r t i c l e i n f o Article history: Received 13 July 2016 Revised 4 February 2017 Accepted 5 February 2017 Available online 9 February 2017 Keywords: Regression analysis Surrogate model Regression tree Mathematical programming Optimisation a b s t r a c t Regression analysis is a machine learning approach that aims to accurately predict the value of contin- uous output variables from certain independent input variables, via automatic estimation of their latent relationship from data. Tree-based regression models are popular in literature due to their flexibility to model higher order non-linearity and great interpretability. Conventionally, regression tree models are trained in a two-stage procedure, i.e. recursive binary partitioning is employed to produce a tree struc- ture, followed by a pruning process of removing insignificant leaves, with the possibility of assigning multivariate functions to terminal leaves to improve generalisation. This work introduces a novel method- ology of node partitioning which, in a single optimisation model, simultaneously performs the two tasks of identifying the break-point of a binary split and assignment of multivariate functions to either leaf, thus leading to an efficient regression tree model. Using six real world benchmark problems, we demon- strate that the proposed method consistently outperforms a number of state-of-the-art regression tree models and methods based on other techniques, with an average improvement of 7–60% on the mean absolute errors (MAE) of the predictions. © 2017 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license. ( http://creativecommons.org/licenses/by/4.0/ ) 1 r i r p t m e m Z & O & ( i v s l i l a a o n t t o m ( C t t p b s e h 0 . Introduction In machine learning, regression analysis seeks to estimate the elationships between output variables and a set of independent nput variables by automatically learning from a number of cu- ated samples ( Sen & Srivastava, 2012 ). The primary goal of ap- lying a regression analysis is usually to obtain precise predic- ion of the level of output variables for new samples. Examples of ethodologies for regression analysis in the literature include lin- ar regression ( Seber & Lee, 2012 ), automated learning of algebraic odels for optimisation (ALAMO) ( Cozad, Sahinidis, & Miller, 2014; hang & Sahinidis, 2013 ), support vector regression (SVR) ( Smola Schlkopf, 2004 ), multilayer perception (MLP) ( Hill, Marquez, ’Connor, & Remus, 1994 ), K-nearest neighbour (KNN) ( Korhonen Kangas, 1997 ), multivariate adaptive regression splines (MARS) Friedman, 1991 ), Kriging ( Kleijnen, 2015 ), and regression tree. Quite often, one would like to also gain some useful insights nto the underlying relationship between the input and output ariables, in which case the interpretability of a regression method ∗ Corresponding author. E-mail addresses: lingjian.yang.10@ucl.ac.uk (L. Yang), ongsong.liu@swansea.ac.uk (S. Liu), sophia.tsoka@kcl.ac.uk (S. Tsoka), .papageorgiou@ucl.ac.uk (L.G. Papageorgiou). c i i l t ttp://dx.doi.org/10.1016/j.eswa.2017.02.013 957-4174/© 2017 The Authors. Published by Elsevier Ltd. This is an open access article u s also of great interest. Regression tree is a type of the machine earning tools that can satisfy both good prediction accuracy nd easy interpretation, and therefore have received extensive ttention in the literature. Regression tree uses a tree-like graph r model and is built through an iterative process that splits each ode into child nodes by certain rules, unless it is a terminal node hat the samples fall into. A regression model is fitted to each erminal node to get the predicted values of the output variables f new samples. The Classification and Regression Tree (CART) is probably the ost well known decision tree learning algorithm in the literature Breiman, Friedman, Olshen, & Stone, 1984 ). Given a set of samples, ART identifies one input variable and one break-point, before par- itioning the samples into two child nodes. Starting from the en- ire set of available training samples (root node), recursive binary artition is performed for each node until no further split is possi- le or a certain terminating criteria is satisfied. At each node, best plit is identified by exhaustive search, i.e. all potential splits on ach input variable and each break-point are tested, and the one orresponding to the minimum deviations by respectively predict- ng two child nodes of samples with their mean output variables s selected. After the tree growing procedure, typically an overly arge tree is constructed, resulting in lack of model generalisation o unseen samples. A procedure of pruning is employed to remove nder the CC BY license. ( http://creativecommons.org/licenses/by/4.0/ ) http://dx.doi.org/10.1016/j.eswa.2017.02.013 http://www.ScienceDirect.com http://www.elsevier.com/locate/eswa http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2017.02.013&domain=pdf http://creativecommons.org/licenses/by/4.0/ mailto:lingjian.yang.10@ucl.ac.uk mailto:songsong.liu@swansea.ac.uk mailto:sophia.tsoka@kcl.ac.uk mailto:l.papageorgiou@ucl.ac.uk http://dx.doi.org/10.1016/j.eswa.2017.02.013 http://creativecommons.org/licenses/by/4.0/ 348 L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 i d t s w Y e p a S a v t v ( L p t i t s s o u e M i ( 2 t t v c 1 o ( t r c a e c o a t a s a o a c w a s q s f i e t l sequentially the splits contributing insufficiently to training accu- racy. The tree is pruned from the maximal-sized tree all the way back to the root node, resulting in a sequence of candidate trees. Each candidate tree is tested on an independent validation sam- ple set and the one corresponding to the lowest prediction error is selected as the final tree ( Breiman, 20 01; Wu et al., 20 08 ). Alter- natively, the optimal tree structure can be identified via cross val- idation. After building a tree, an enquiry sample is firstly assigned into one of the terminal leaves (non-splitting leaf nodes) and then predicted with the mean output value of the samples belonging to the leaf node. Despite its simplicity, good interpretation and wide applications ( Antipov & Pokryshevskaya, 2012; Bayam, Liebowitz, & Agresti, 2005; Bel, Allard, Laurent, Cheddadi, & Bar-Hen, 2009; Li, Sun, & Wu, 2010; Molinaro, Dudoit, & van der Laan, 2004 ), the simple rule of predicting with mean values at the terminal leaves often means prediction performance is compromised ( Loh, 2011 ). The conditional inference tree (ctree) tackles the problem of re- cursive partitioning in a statistical framework ( Hothorn, Hornik, & Zeileis, 2006 ). For each node, the association between each inde- pendent input feature and the output variable is quantified, using permutation test and multiple testing correction. If the strongest association passes a statistical threshold, binary split is performed in that corresponding input variable; otherwise the current node is a terminal node. Ctree is shown to avoid the problem of build- ing biased tree towards input variables with many distinct levels of values while ensuring the similar prediction performance. Since almost all the tree-based learning models are constructed using recursive partitioning, an efficient yet essentially locally op- timal approach, the evtree implements an evolutionary algorithm for learning globally optimal classification and regression trees ( Grubinger, Zeileis, & Pfeiffer, 2014 ), and is considered an alterna- tive to the conventional methods by globally optimising the tree construction. Evtree searches a tree structure that takes into ac- count the accuracy and complexity, defined as the number of ter- minal leaves. Due to the exponentially growing size of the prob- lem, evolutionary methods are employed to identify a quality fea- sible solution. M5’, also knows as M5P, is considered as an improved version of CART ( Quinlan, 1992; Wang & Witten, 1997 ). The tree growing process is the same as that of the CART, while several modifica- tions have been introduced in tree pruning process. After the full size tree is produced, a multiple linear regression model is fitted for each node. A metric of model generalisation is defined in the original paper taking into account training error, the numbers of samples and model parameters. The constructed linear regression function for each node is then simplified by removing insignificant input variables using a greedy algorithm in order to achieve locally maximal model generalisation metric. Tree pruning starts from the bottom of the tree and is implemented for each non-leaf nodes. If the parent node offers higher model generalisation than the sum of the two child nodes, then the child nodes are pruned away. When predicting new samples, the value computed at the corresponding terminal node is adjusted by taking into account the other pre- dicted values at the intermediate nodes along the path from the terminal to the root node. The fitting of linear regression functions at leaf nodes improves the prediction accuracy of the regression tree learning model. M5’ is then further extended into Cubist ( RuleQuest, 2016 ), a commercially available rule-based regression model, which has re- ceived increasing popularity recently ( Kobayashi, Tsend-Ayush, & Tateishi, 2013; Minasny & McBratney, 2008; Moisen et al., 2006; Peng et al., 2015; Rossel & Webster, 2012 ). M5’ is employed to grow a tree first, which is then collapsed into a smaller set of if-then rules by removing and combining paths from the root to the ter- minal nodes. It is noted here that the if-then rules resulted from Cubist method can be overlapping, i.e. a sample can be assigned nto multiple rules, where all the predictions are averaged to pro- uce a final value. This ambiguity decreases the interpretability of he rule model. The Smoothed and Unsmoothed Piecewise-Polynomial Regres- ion Trees (SUPPORT) is another regression tree learning algorithm, hose foundation is based on statistics ( Chaudhuri, Huang, Loh, & ao, 1994 ). Given a set of samples, SUPPORT fits a multiple lin- ar regression function and computes the deviation of each sam- le. The samples with positive deviations and negative deviations re respectively assigned into two classes. For each input variable, UPPORT compares the distribution of the two classes of samples long this input variable by applying two-sample t test. The input ariable corresponding to the lowest P value is selected as split- ing node and the average of the two class mean on this splitting ariable is taken as break-point. The Generalised, Unbiased, Interaction Detection and Estimation GUIDE) adopts similar philosophy as the SUPPORT ( Loh, 2002; oh, He, & Man, 2015 ). Given a node, the same step of fitting sam- les with a linear regression model and separating samples into wo classes based on the sign of deviations is employed. For each nput variable, its numeric values are binned into a number of in- ervals before a chi-square test is used to determine its level of ignificance. The most significant input variable is used for binary plit. In terms of break-point determination, either a greedy search r median of the two class mean on this splitting variable can be sed. More other variants of the above regression tree models also xist in the literature, including SECRET ( Dobra & Gehrke, 2002 ), ART ( Elish, 20 09; Friedman, 20 02 ), SMOTI ( Malerbao, Espos- to, Ceci, & Appice, 2004 ), MAUVE ( Vens & Blockeel, 2006 ), BART Chipman, George, & McCulloch, 2010 ) and SERT ( Chen & Hong, 010 ), etc. In the above classic regression tree methodologies, the tradi- ional means of node splitting are dominated by either exhaus- ively searching the candidate split corresponding to the maximum ariance reduction by predicting of mean output values in two hild nodes ( Breiman et al., 1984; Quinlan, 1992; Wang & Witten, 997 ), or examining distribution of sample deviations from fitting ne linear regression function to all the samples in the parent node Chaudhuri et al., 1994; Loh, 2002 ). However, it is noticed that for hose algorithms where terminal leaf nodes are fitted with linear egression functions ( Quinlan, 1992; Wang & Witten, 1997 ), the hoice of splitting variable, break-point and regression coefficients re done sequentially, i.e. the splitting variable and break-point are stimated during tree growing procedure while regression coeffi- ients for each child node are computed at pruning step. A theoretically better node splitting strategy is to simultane- usly determine the splitting feature, the position of break-point nd the regression coefficients for each child node. In this case, he quality of a split can be directly calculated as the sum of devi- tions of all samples in either subset. A straightforward exhaustive earch algorithm for this problem can be: for each input variable nd each break-point, samples are separated into two subsets and ne multiple linear regression is fitted for each subset. After ex- mining all possible splits, the optimal split is chosen as the one orresponding to the minimum sum of deviations. The problem ith this approach is, however, that as the numbers of samples nd input variables grow, the quantity of multiple linear regres- ion functions need to be evaluated increases exponentially, re- uiring excessive computational time. For example, given a regres- ion problem of 500 samples and 10 input variables, we assume or each input variable, each sample takes a unique value. Then t requires construction of 9980 ( = 499 × 10 × 2) multiple lin- ar regression functions in order to find the optimal split for only he root node, which will only become worse as the tree grows arger. L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 349 g 2 n p f o u f p w f c m o d v p b o f 2 g m p i l o a s g a f n a s n a m n d o c 2 t c b g d d g w g i s t s n a f i m b i l β n r p a t 2 t t s b h a I In this work, we adopt a recently proposed mathematical pro- ramming optimisation model ( Yang, Liu, Tsoka, & Papageorgiou, 016 ), which solves the problem of splitting a node into two child odes to global optimality in affordable computational time. In our roposed framework, tree leaf nodes are fitted with polynomial unctions and recursive partition is permitted when the amount f reduction in deviation achieved by node splitting is above a ser-specific value, which is also the only tuning parameter in our ramework. Since the size of the tree is controlled via the tuning arameter, the pruning procedure is not implemented. The rest of the paper is structured as follows: In Section 2 , e describe the main features of the optimisation model adopted rom literature and introduces the framework of our proposed de- ision tree building process. In Section 3 , a number of bench- ark regression problems are employed to test the performance of ur proposed method. A comprehensive sensitivity analysis is con- ucted to evaluate how prediction accuracy varies with different alues of the tuning parameter. Later, prediction accuracy of our roposed method is compared against a number of decision tree ased algorithms and some other state-of-the-art regression meth- ds. Section 4 presents our main conclusions and discusses some uture directions. . Method In our previous work ( Yang et al., 2016 ), we have proposed a re- ression method based on piece-wise linear functions, named seg- ented regression. Segmented regression identifies multiple break- oints on a single independent variable and partitions the samples nto multiple regions, each one of which is fitted with a multiple inear regression function so as to minimise the absolute deviation f the samples. The core element of the segmented regression is mathematical programming optimisation model that, given one ingle input variable as splitting variable and the number of re- ions, simultaneously optimises the positions of the break-points nd the regression coefficients of one multiple linear regression unction for each region. In this work, we adopt this optimisation model to optimise bi- ary splitting of nodes. Given a node and a single input variable s splitting variable, the optimisation model is solved to find the ingle break-point and the regression coefficients for the two child odes. The model is solved when each input variable in turn serves s splitting variable once, and the input variable giving the mini- um absolute deviation is selected for splitting the current parent ode. Recursive node splitting terminates when the reduction in eviation drops below a user-specific threshold value. Below, the verview of the regression approach, and the detailed mathemati- al programming model for node partitioning are presented. .1. Regression tree approach As for other regression tree learning algorithms, recursive split- ing is used to grow the tree from root node until a split of node annot yield sufficient reduction in deviation. The pseudocode for uilding a tree is given below. Proposed regression tree algorithm Step 1. Fit a polynomial regression function of order 2 to root node minimising absolute deviation, recorded as ERROR root . Step 2. Start from the root node as the current node, and let E RROR current = E RROR root . Step 3. In each current root, for each input variable m, specify it as splitting variable ( m = m ∗ ) and solve the proposed Optimal Piece-wise Linear Regression Analysis model ( OPLRA ). The deviation is noted as ERROR split m . Step 4. Identify the best split corresponding to the minimum absolute deviation, noted as E RROR split = min m E RROR split m . Step 5. If E RROR current − E RROR split ≥ β × E RROR root , the current node is split; otherwise the current node is finished as a terminal node. Step 6. Apply step 3–5 to each remaining child node in turn. Given training samples, the first step of our proposed tree rowing strategy is to fit a polynomial regression function of or- er of 2 to the entire set of training samples minimising absolute eviation, which is noted as ERROR root . The used polynomial re- ression function can provide higher prediction accuracy. Note that hen the coefficient of the quadratic term is zero, the obtained re- ression model is a linear function. The absolute deviation is min- mised here, due to its simplicity and ease of optimisation. The ab- olute deviation of root node, multiplied by a scaling parameter β, aking value between 0 and 1 , is specified as the condition for node plitting. In other words, the current node is split into two child odes, only if the optimal split of the node results in reduction of bsolute deviation being greater than β × ERROR root . Then starting rom the root node as the current node, each feature m is specified n turn as splitting feature m ∗ once, while solving model OPLRA inimising the sum of absolute deviations of two child nodes. The est split of the current node is identified as the one correspond- ng to minimum absolute error. If the best split brings down abso- ute deviation from the current node ( ERROR current ) by more than × ERROR root , then the split takes place; otherwise the current ode is finalised as terminal leaf node. Note that the tuning pa- ameter β determines the size of the developed tree, and an ap- ropriate value of β can avoid the overfitting on the training data, nd achieve good prediction accuracy for testing. The flowchart of he whole procedure is illustrated in Fig. 1 . .2. Mathematical programming model for node partitioning For a given current node n and one feature m * for potential par- ition, the proposed mathematical programming model for the op- imal node split, OPLRA , is presented in this section. The indices, ets, parameters and variables associated with the model are listed elow. For better separation between the parameters and variables, ere lower case letters are for parameters, while upper case letters re for variables: ndices c child node of the current parent node n; c = l represents left child node,and c = r represents right child node m feature/independent input variable, m = 1 , 2 , . . . , M m ∗ the feature where sample partition takes place n the current parent node s samples in the data set, s = 1 , 2 , . . . , S Sets C n set of child nodes of the current parent node n S n set of samples in the current parent node n Parameters a sm numeric value of sample s on feature m y s real output value of sample s u a suitably large positive number � a suitably small positive number Continuous variables B c intercept of regression function in child node c D s absolute deviation between predicted output and real output for sample s P c s predicted output for sample s in child node c W 1 c m , W 2 c m regression coefficients for feature m in child node c X m ∗ break-point on partition feature m ∗ 350 L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 Fig. 1. Flowchart of the proposed regression tree approach. c w s s n c p P v Binary variables F c s 1 if sample s falls into child node c; 0 otherwise Binary variables F c s , taking value of either 0 or 1 , are introduced to model if sample s belongs to child node c or not. Modelling of which sample belongs to either child node is achieved with the following constraints: a sm ∗ ≤ X m ∗ − � + u (1 − F c s ) ∀ s ∈ S n , c = l, m ∗ (1) X m ∗ + � − u (1 − F c s ) ≤ a sm ∗ ∀ s ∈ S n , c = r, m ∗ (2) When sample s is assigned into left child node (i.e. F c s = 1 when c = l), Eq. (1) becomes A sm ∗ ≤ X m ∗ − � while Eq. (2) becomes re- dundant. On the other hand, when sample s is assigned into right hild node (i.e. F c s = 1 when c = r), Eq. (2) becomes A sm ∗ ≥ X m ∗ + � hile Eq. (1) is redundant. The insertion of � is to ensure strict eparation of the samples into two child nodes. The following con- traints restrict that each sample belongs to one and only one child ode: ∑ ∈ C n F c s = 1 ∀ s ∈ S n (3) For each child node c , polynomial functions of order 2 is em- loyed to predict the value of samples ( P c s ): c s = ∑ m a 2 sm W 2 c m + ∑ m a sm W 1 c m + B r ∀ s ∈ S n , c ∈ C n (4) For any sample s , its training error is equal to the absolute de- iation between the real output and the predicted output for the L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 351 c w D D i m a c t b m c B t p t 2 q a d l b f n m n d 3 h s t c r p o b 2 t f v t b n p o s c s m b n Table 1 Summary of benchmark data sets. Case study Number of samples Number of features Yacht hydrodynamics 308 6 Concrete strength 1030 8 Energy efficiency heating 768 8 Energy efficiency cooling 768 8 Airfoil 1503 5 White wine quality 4898 11 g t m f a m f A o i c p a T d m t h m o i s f a a w f 1 d i t i M c C ‘ K e l b M i b w t m i a 0 7 W hild node c where it belongs to (i.e. F c s = 1 ), and can be expressed ith the following two equations: s ≥ y s − P c s − u (1 − F c s ) ∀ s ∈ S n , c ∈ C n (5) s ≥ P c s − y s − u (1 − F c s ) ∀ s ∈ S n , c ∈ C n (6) The objective function is to minimise the sum of absolute train- ng errors of splitting the current node n into its child nodes: in ∑ s ∈ S n D s (7) The final OPLRA model consists of a linear objective function nd several linear constraints, and the presence of both binary and ontinuous variables define an MILP problem, which can be solved o global optimality by standard solution algorithms, for example ranch and bound. The optimisation model simultaneously opti- ises the break-point ( X m ∗ ), the allocation of samples into two hild nodes ( F c s ) and the regression coefficients ( W 1 c m , W 2 c m and c ) to achieve the least absolute deviation. Another advantage of his optimisation model is that there is no need to pre-process in- ut variable, i.e. input variables do not need to be binned into in- ervals for analysis. .3. Prediction for new samples After the regression tree is determined, prediction of new en- uiry samples can easily be performed. A new sample is firstly ssigned to one of the terminal leaf node, before yielding a pre- iction using the multivariate function derived for that particu- ar node. The predicted output value, if lies outside the interval ounded by the minimum and maximum of fitted output values or training samples in that particular node, is then adjusted to the earest bound. The proposed regression tree approach, referred to as Mathe- atical Programming Tree (MPTree) in this paper, is applied to a umber of real world benchmark data sets in the next section to emonstrate its applicability and efficiency. . Results and discussion In this section, we aim to comprehensively evaluate the be- aviour of the proposed MPTree using real world benchmark data ets. We first conduct a comprehensive sensitivity analysis for the uning parameter β in order to identify a robust value that gives onsistently good prediction accuracy. After that, prediction accu- acy comparison is performed to evaluate MPTree against certain opular regression tree learning algorithms in literature and some ther regression methodologies. A total number of 6 real world regression data sets have een downloaded from UCI machine learning repository ( Lichman, 013 ). The first regression problem Yacht Hydrodynamics predicts he residuary resistance of sailing yachts at the initial design stage rom 6 independent features describing the hull dimensions and elocity of the boat, including longitudinal position of the cen- re of buoyancy, prismatic coefficient, length-displacement ratio, eam-draught ratio, length-beam ratio and Froude number. The ext example, Concrete Strength ( Yeh, 1998 ), studies how com- ressive strength of different concrete are affected by attributes f the concretes. There are 1030 samples with 8 input attributes, uch as cement, blast furnace slag, fly ash, water, superplasticizer, oarse aggregate, fine aggregate and age. Energy Efficiency data ets ( Tsanas & Xifara, 2012 ) are obtained by running simulation odel. There are 768 samples, with each corresponding to one uilding shape, described by 8 features including relative compact- ess, surface area, wall area, root area, overall height, orientation, lazing area and glazing area distribution. The aims are to establish he relationship between either heating or cooling load require- ent of the building and the characteristics of these building. Air- oil data set concerns how the different frequencies, chord lengths, ngles of attack, free-stream velocities and suction side displace- ent thicknesses can predict the sound pressure level of an air- oil. The last case study, White Wine Quality ( Cortez, Cerdeira, lmeida, Matos, & Reis, 2009 ), aims to associate expert preference f white wine taste with 11 physicochemical features of the wines, ncluding fixed acidity, volatile acidity, citric acid, residual sugar, hlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sul- hates and alcohol. The details of these data sets are provided s the supplementary material, and their sizes are summarised in able 1 . For each regression problem, we employ a 5-fold cross vali- ation to estimate the predictive accuracy of various regression ethods. Given a data set, 5-fold cross validation randomly splits he samples into 5 subsets of roughly equal size. One subset is old out as testing set, while the other 4 subsets of samples are erged to form training set. MPTree constructs a regression tree n the training set, whose prediction accuracy is estimated us- ng the holdout testing set. The process continues until each sub- et is hold out once as testing set. We conduct 10 rounds of 5- old cross validation by performing different random sample splits, nd the mean absolute errors (MAE) of the prediction are aver- ged over 50 testing sets as the final error. For each data set, e normalise each independent input variable with the following ormula so that the scaled input data take value between 0 and : A sm = A ′ sm −min s A ′ sm max s A ′ sm −min s A ′ sm ∀ s, m, where A ′ sm denotes the raw input ata. To assess the relative competitiveness of the proposed MPTree n terms of prediction accuracy, we compare the proposed MPTree o a number of popular regression methods in literature, includ- ng CART, M5’, Cubist, linear regression, SVR, MLP, Kriging, KNN, ARS, segmented regression ( Yang et al., 2016 ) and ALAMO. CART, tree, evtree and Cubist are implemented in R ( R Development ore Team, 2008 ) using the packages ‘rpart’, ‘party’, ‘evtree’ and Cubist’, respectively. M5’, linear regression, SVR, MLP, kriging and NN are implemented in WEKA machine learning software ( Hall t al., 2009 ). For KNN, the number of nearest neighbours is se- ected as 5, while for other methods their default settings have een retained. We use the MATLAB toolbox called ARESlab for ARS. ALAMO is reproduced using the General Algebraic Model- ng System (GAMS) ( GAMS Development Corporation, 2014 ), and asis function forms including polynomial of degrees up to 3, pair- ise multinomial terms of equal exponents up to 3, exponen- ial and logarithmic forms are provided for each data set. Seg- ented regression and the proposed MPTree are also implemented n GAMS. ALAMO, segmented regression and our proposed MPTree re solved using CPLEX MILP solver, with optimality gap set as . All computational runs were performed on a 64-bit Windows based machine with 3.20 GHz six-core Intel Xeon processor 3670 and 12.0 GB RAM. 352 L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 Fig. 2. Sensitivity analysis for β of all data sets. t t l A s s S c M r a f o a a v M fi 3.1. Sensitivity analysis for β In this section, we first perform a comprehensive sensitivity analysis on the single tuning parameter β in the proposed MPTree. Recall in the tree growing procedure, β controls termination of re- cursive node splitting. A node is split into two child nodes if the optimal split leads to reduction of absolute training deviation be- ing more than a threshold value, defined as the amount of abso- lute training deviation of a multiple linear regression analysis on the entire set of training samples ERROR root multiplied by the scal- ing parameter β. The tree grows larger as β decreases. Identifying a suitable value for β is a non-trivial problem as an excessively high value would terminate the node splitting prematurely with- out adequately describing the data, while a very small value can over-fit the unseen samples by constructing very large trees. In this work, we test a series of values, including 0.005, 0.01, 0.015, 0.025, 0.05 and 0.1 . The results of the sensitivity analysis are presented in Fig. 2 . According to Fig. 2 , we can clearly observe a phenomenon that as β is reduced from 0.1 to 0.015 , prediction error almost mono- onically drops. This improved prediction performance can be at- ributed to the fact that decreased β allows the tree to grow arger, and thus better describing the latent pattern in the data. s β further lowers down, MAE can even further decreases for ome examples, including Energy Efficiency Heating and Airfoil. For ome other examples, including Yacht Hydrodynamics, Concrete trength, Energy Efficiency Cooling and White Wine Quality, more omplicating trees do not predict unseen testing samples well, as AE worsens. It is well known that in data mining, parameter fine tuning is equired for a particular method to reach optimal performance for specific data set. Thus, it is our interest here to identify a value or β that corresponds to robust prediction accuracy for a range f different tested benchmark examples. In this study, β = 0.015 ppears to yield overall robust and accurate prediction as it usu- lly leads to lowest or second lowest MAE among all the tested alues. Higher values of β are shown to give significantly higher AE, while smaller value of β sometimes leads to noticeable over- tting, thus compromising the robustness of its performance. L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 353 Table 2 Prediction accuracy comparison across different regression methods, in terms of MAE. The proposed MPTree method is high- lighted in italic, and the best prediction accuracy of each data set is given in bold. Yacht Concrete Energy efficiency Energy efficiency Airfoil White wine hydrodynamics strength heating cooling quality Tree-based methods MPTree 0 .58 3 .85 0 .35 0 .80 0 .015 0 .52 CART 1 .61 7 .22 2 .00 2 .38 0 .035 0 .60 Ctree 0 .81 5 .99 0 .63 1 .40 0 .029 0 .58 Evtree 1 .05 6 .44 0 .56 1 .59 0 .032 0 .59 M5’ 0 .96 4 .72 0 .69 1 .21 0 .021 0 .56 Cubist 0 .60 4 .29 0 .35 0 .89 0 .017 0 .56 Non-tree-based methods Linear regression 7 .27 8 .31 2 .09 2 .27 0 .037 0 .59 SVR 6 .45 8 .21 2 .04 2 .19 0 .037 0 .58 MLP 0 .81 6 .23 0 .99 1 .92 0 .035 0 .62 Kriging 4 .32 6 .22 1 .79 2 .04 0 .030 0 .58 KNN 5 .30 7 .07 1 .94 2 .15 0 .026 0 .54 MARS 1 .01 4 .87 0 .80 1 .32 0 .035 0 .57 Segmented regression 0 .71 4 .87 0 .81 1 .28 0 .029 0 .55 ALAMO 0 .79 8 .04 2 .72 2 .76 0 .032 0 .64 Fig. 3. Prediction accuracy (MAE) comparison across different regression methods. For each benchmark example, the original MAE achieved by different methods in Table 2 are normalised between 0% and 100%, with 0% representing the lowest MAE and 100% representing the highest MAE. 3 r t a t e m w i o d b i a m h n i a 4 c m m m a m 7 .2. Performance comparison across different regression methods After identifying a value (i.e. 0.015 ) for the only user-specific pa- ameter, β, in the proposed MPTree, we now compare the predic- ion performance of the MPTree against a number of state-of-the- rt regression methods. To ensure unbiased comparison, β is set o 0.015 thorough all examples studied. For each of the benchmark xamples, we compare the MAE achieved by various competing ethods. The detailed prediction accuracies reported in Table 2 , in hich the performance of the proposed MPTree method is shown n italic, and the best prediction accuracy, i.e. the smallest MAE, f each data set is given in bold. The proposed has the best pre- iction accuracy in all data sets, compared to regression methods ased on a wide range of tree and non-tree methodologies. Only n the Energy Efficiency Heating data set, Cubist achieves the same ccuracy as MPTree. Overall, MPTree achieves 7–60% of improve- ent on MAE compared to each of other regression models. We ave also implemented MPTree with linear function at each child ode, the results of which still show great competitiveness as be- ng either the top or the second best method in all examples and chieves the overall best performance with MAE values of 0.60, .16, 0.36, 1.00, 0.014 and 0.55, respectively. The comparative results are summarised in Fig. 3 . This Radar hart is plotted to comprehensively visualise the prediction perfor- ance of different methods across all 6 data sets. For each bench- ark example studied, we normalise the MAE achieved by all ethods in Table 2 to scaled values between 0% and 100%, with 0% nd 100% respectively denoting the lowest and the highest MAE. To aintain the readability of the plot, prediction accuracies of only methods are plotted. It is clearly observed from Fig. 3 that the 354 L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 Table 3 Prediction accuracy comparison across different tree-based regression methods in terms of MSE. The proposed MPTree method is highlighted in italic, and the best prediction accuracy of each data set is given in bold. Yacht Concrete Energy efficiency Energy efficiency Airfoil White wine hydrodynamics strength heating cooling quality MPTree 2 .03 43 .88 0 .26 2 .65 0 .0 0 05 0 .59 CART 5 .41 86 .24 6 .85 9 .40 0 .0020 0 .58 Ctree 2 .79 63 .72 1 .33 4 .37 0 .0014 0 .55 Evtree 3 .02 69 .12 1 .00 4 .44 0 .0015 0 .56 M5’ 3 .08 40 .72 0 .95 3 .26 0 .0 0 08 0 .53 Cubist 1 .07 37 .77 0 .27 2 .76 0 .0 0 06 0 .51 Fig. 4. Constructed tree by CART on Energy Efficiency Heating example. Boxes represent the terminal leaf nodes with labels inside, while circles represent other nodes, where the symbol inside refers to the feature where the split takes place. The splitting rules are given on the corresponding paths. M c p m M r s r p f 3 t s p proposed MPTree forms the smallest area across all data sets, and performs better than other implemented tree-based learning algo- rithms, including ctree, evtree, M5’ and Cubist, and non-tree-based models, including MLP and Segmented regression. Overall, MPTree demonstrates clear advantage over the counterparts by managing the lowest MAE value for each and every tested benchmark exam- ple (including SVR, Kriging and KNN where results are not shown here). It is undoubtedly that the proposed MPTree, by optimising simultaneously the position of break-point and regression coeffi- cients per child node, representing significant improvement com- pared with other tree models in literature. In this work, MAE is adopted as the performance metric of regression models, which might not be suitable for all the data sets. Besides, other approaches might provide better fittings over another performance metric, e.g., mean squared error (MSE), root mean squared error (RMSE), Akaike Information Criterion, etc. When we compare the prediction accuracy in terms of MSE of all tree-based methodologies, Table 3 shows that the post-processed b SE values from the optimal solutions of MPTree are still very ompetitive with MSE values from other methodologies, even the roposed MPTree aims to minimise MAE. Although the perfor- ance of MPTree is not as dominant as it is considering MAE, PTree still ranks first on three data sets out of six, and is compa- able with Cubist, which performs the best for the other three data ets. These results demonstrate the impact of performance met- ics on the predication performance, and the consideration of other erformance metrics in MPTree would be an interesting direction or future research. .3. Comparison of actual constructed trees by different regression ree methods Last section has demonstrated that the novel MPTree regres- ion tree learning method offers superior prediction capacity. Com- ared to certain regression methods whose output models cannot e interpreted, for example kernel-based SVR and MLP, tree learn- L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 355 Fig. 5. Constructed tree by M5’ on Energy Efficiency Heating example. Boxes represent the terminal leaf nodes with labels inside, while circles represent other nodes, where the symbol inside refers to the feature where the split takes place. The splitting rules are given on the corresponding paths. i s m i p l b a u b s b s p s s c m ( i t t s n a n p ng algorithms are well-known for their easy interpretability. The equence of the derived rules can be simply visualised as tree, aking it easily understandable and possible to gain some insights nto the underlying mechanism of the studied system. The inter- retability of a constructed tree model decreases as the tree grows arger. In this section, attention is turned into comparing the num- er of terminal leaf nodes of the trees constructed by CART, M5’ nd MPTree. Taking Energy Efficiency Heating as an example and sing all the available samples as training set, the trees grown y CART, M5’ and MPTree are presented in Figs. 4 , 5 and 6 , re- pectively, in which the terminal leaf nodes are represented by oxes, and other nodes in the trees are represented by circles. The ymbol in each circle represents the feature where the split takes lace. According to Fig. 4 , CART has built a simple tree for the 768- ample example. On the top of the tree, CART splits the entire et of samples on feature m1 at break-point of 0.361 into two hild nodes, which are in turn further split on feature m7 and 1 , respectively. There are a total number of 7 terminal leaf nodes TN1–TN7) and the depth of the tree is equal to 4. From Fig. 5 , t is apparent that M5’ has constructed a much larger tree than he CART. The top part of the M5’ tree is almost identical as the ree built by CART, which is not surprising as the two algorithms hare great similarity during tree growing procedure and only sig- ificantly different from each other on pruning procedure. Over- ll, the tree grown by M5’ has a depth of 8 and 24 terminal leaf odes (TN1–TN24) , which is much harder to understand and inter- ret. Fig. 6 visualises the actual tree built by our proposed MPTree 356 L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 Fig. 6. Constructed tree by MPTree on Energy Efficiency Heating example. Boxes represent the terminal leaf nodes with labels inside, while circles represent other nodes, where the symbol inside refers to the feature where the split takes place. The splitting rules are given on the corresponding paths. Table 4 The number of terminal leaf nodes of the constructed trees by different regression tree learning methods. Yacht Concrete Energy efficiency Energy efficiency Airfoil White wine hydrodynamics strength heating cooling quality CART 5 13 7 4 18 7 M5’ 4 10 24 24 44 55 MPTree 5 14 7 12 14 6 M m M i i r c o O o t r c m s m t w c c A s I E 6 t U method. The size of the derived tree is similarly small as that of CART with 7 terminal leaf nodes (TN1–TN7) and a depth of 3, yet the two trees are quite different as the root nodes of the two trees are split on different features. MPTree, optimising the node split- ting, picks feature m3 as partition feature, in contrast to feature m1 selected by CART. Overall on the Energy Efficiency Heating exam- ple, CART and MPTree appear to build trees that are small in size, while M5’ outputs a significantly larger tree. The same analysis has been repeated on the other 5 benchmark data sets, and the results of which are available in Table 4 . The same observation can be made that for the other examples, CART and MPTree derive trees of similar numbers of terminal leaf nodes, while M5’ sometimes builds trees of comparable sizes as the other two (i.e. Yacht Hydrodynamics and Concrete Strength) but more of- ten outputs trees of several folds larger (i.e. Energy Efficiency Heat- ing, Energy Efficiency Cooling, Airfoil and White Wine Quality). 4. Concluding remarks Regression analysis is a data-driven computational tool that aims to predict continuous output variables from a set of indepen- dent input variables. In this work, we have proposed a novel re- gression tree learning algorithm, named MPTree. An optimisation model OPLRA recently published in literature has been adopted to optimise the binary node splitting. Given a specified splitting feature, OPLRA simultaneously determines the break-point position and the coefficients of the polynomial regression function in either child node so as to minimise residuals. An algorithm is introduced for recursive partitioning to grow the tree. A number of 6 real-world benchmark data sets have been used to demonstrate the applicability and efficiency of the proposed PTree. Popular regression learning algorithms have been imple- ented for comparison, including tree-based CART, ctree, evtree, 5’ and Cubist, and methods based on various other principles, ncluding MARS, MLP, kriging, segmented regression, etc. Cross val- dation experiment has been used to estimate the predictive accu- acy of different methods. The results clearly indicate that MPTree onsistently offers a much improved prediction accuracy than the ther competing methods for each of the benchmark data set. verall, we show that the proposed MPTree builds regression trees f better quality by optimising the node splitting. In the near future, we aim to explore a few aspects to refine he MPTree method. The existing regression tree learning algo- ithms, including the proposed MPTree, perform binary splits re- ursively to keep the tree growing. Splitting a parent node into ultiple child nodes, instead of two, is likely to better explore the tructure of the data set. Another potential avenue is to optimise ultiple levels of splitting simultaneously. Note that most of the ree building methods consider only splitting one node at a time, hile a look-ahead scheme that optimises also splitting of grand- hild nodes would lead to enhanced prediction performance of the onstructed tree. cknowledgement Funding from the UK Engineering and Physical Sciences Re- earch Council (to LY, SL and LGP through the EPSRC Centre for nnovative Manufacturing in Emergent Macromolecular Therapies, P/I033270/1), the UK Leverhulme Trust (to ST and LGP, RPG-2012- 86), the European Union (to ST, HEALTH-F2-2011-261366), and he Centre for Process Systems Engineering (CPSE) at Imperial and niversity College London (to LY) are gratefully acknowledged. L. Yang et al. / Expert Systems With Applications 78 (2017) 347–357 357 S f R A B B B B C C C C C D E F F G G H H H K K K L L L L L M M M M P Q R R R S S S T V W W Y Y Z upplementary material Supplementary material associated with this article can be ound, in the online version, at 10.1016/j.eswa.2017.02.013 . eferences ntipov, E. A., & Pokryshevskaya, E. B. (2012). Mass appraisal of residential apart- ments: An application of random forest for valuation and a CART-based ap- proach for model diagnostics. Expert Systems with Applications, 39 (2), 1772–1778. http://dx.doi.org/10.1016/j.eswa.2011.08.077 . ayam, E., Liebowitz, J., & Agresti, W. (2005). Older drivers and accidents: A meta analysis and data mining application on traffic accident data. Expert Systems with Applications, 29 (3), 598–629. http://dx.doi.org/10.1016/j.eswa.2005.04.025 . el, L., Allard, D., Laurent, J., Cheddadi, R., & Bar-Hen, A. (2009). CART algorithm for spatial data: Application to environmental and ecological data. Computa- tional Statistics & Data Analysis, 53 (8), 3082–3093. http://dx.doi.org/10.1016/j. csda.2008.09.012 . reiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16 (3), 199–231 . reiman, L. , Friedman, J. H. , Olshen, R. A. , & Stone, C. J. (1984). Classification and regression trees . Taylor & Francis . haudhuri, P. L. , Huang, M. , Loh, W. , & Yao, R. (1994). Piecewise-polynomial regres- sion trees. Statistica Sinica, 4 , 143–167 . hen, A., & Hong, A. (2010). Sample-efficient regression trees (SERT) for semicon- ductor yield loss analysis. IEEE Transactions on Semiconductor Manufacturing, 23 (3), 358–369. doi: 10.1109/TSM.2010.204 896 8 . hipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4 (1), 266–298. doi: 10.1214/ 09- AOAS285 . ortez, P. , Cerdeira, A. , Almeida, F. , Matos, T. , & Reis, J. (2009). Modeling wine pref- erences by data mining from physicochemical properties. Decision Support Sys- tems, 47 (4), 547–553 . ozad, A. , Sahinidis, N. V. , & Miller, D. C. (2014). Learning surrogate models for sim- ulation-based optimization. AIChE Journal, 60 (6), 2211–2227 . obra, A. , & Gehrke, J. (2002). SECRET: A scalable linear regression tree algorithm. In Proceedings of the eighth ACM sigkdd international conference on knowledge discovery and data mining (pp. 4 81–4 87). New York, NY, USA: ACM . lish, M. O. (2009). Improved estimation of software project effort using multiple additive regression trees. Expert Systems with Applications, 36 (7), 10774–10778. http://dx.doi.org/10.1016/j.eswa.2009.02.013 . riedman, J. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38 (4), 367–378. doi: 10.1016/S0167- 9473(01)0 0 065- 2 . Cited By 637. riedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statis- tics, 19 (1), 1–67 . AMS Development Corporation (2014). GAMS – A user’s guide . Washington, DC, USA. rubinger, T. , Zeileis, A. , & Pfeiffer, K. (2014). evtree: Evolutionary learning of glob- ally optimal classification and regression trees in R. Journal of Statistical Soft- ware, 61 (1), 1–29 . all, M. , Frank, E. , Holmes, G. , Pfahringer, B. , Reutemann, P. , & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11 (1), 10–18 . ill, T. , Marquez, L. , O’Connor, M. , & Remus, W. (1994). Artificial neural network models for forecasting and decision making. International Journal of Forecasting, 10 (1), 5–15 . othorn, T. , Hornik, K. , & Zeileis, A. (2006). Unbiased recursive partitioning: A con- ditional inference framework. Journal of Computational and Graphical Statistics, 15 (3), 651–674 . leijnen, J. P. C. (2015). Regression and Kriging metamodels with their experimental designs in simulation: Review . CentER Discussion Paper Series No. 2015-035. obayashi, T. , Tsend-Ayush, J. , & Tateishi, R. (2013). A new tree cover percentage map in eurasia at 500 m resolution using modis data. Remote Sensing, 6 (1), 209–232 . orhonen, K. T. , & Kangas, A. (1997). Application of nearest neighbour regression for generalizing sample tree information. Scandinavian Journal of Forest Research, 12 (1), 97–101 . i, H., Sun, J., & Wu, J. (2010). Predicting business failure using classification and re- gression tree: An empirical comparison with popular classical statistical meth- ods and top classification mining methods. Expert Systems with Applications, 37 (8), 5895–5904. http://dx.doi.org/10.1016/j.eswa.2010.02.016 . ichman, M. (2013). UCI machine learning repository . http://archive.ics.uci.edu/ml . oh, W. Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12 (2), 361–386 . oh, W. Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1 (1), 14–23 . oh, W. Y. , He, X. , & Man, M. (2015). A regression tree approach to identify- ing subgroups with differential treatment effects. Statistics in Medicine, 34 (11), 1818–1833 . alerbao, D. , Esposito, F. , Ceci, M. , & Appice, A. (2004). Top-down induction of model trees with regression and splitting nodes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (5), 612–625 . inasny, B., & McBratney, A. B. (2008). Regression rules as a tool for predicting soil properties from infrared reflectance spectroscopy. Chemometrics and Intelli- gent Laboratory Systems, 94 (1), 72–79. http://dx.doi.org/10.1016/j.chemolab.2008. 06.003 . oisen, G. G., Freeman, E. A., Blackard, J. A., Frescino, T. S., Zimmermann, N. E., & Jr. , T. C. E. (2006). Predicting tree species presence and basal area in utah: A comparison of stochastic gradient boosting, generalized additive models, and tree-based methods. Ecological Modelling, 199 (2), 176–187. http://dx.doi.org/10. 1016/j.ecolmodel.2006.05.021 . olinaro, A. M., Dudoit, S., & van der Laan, M. J. (2004). Tree-based multivariate re- gression and density estimation with right-censored data. Journal of Multivariate Analysis, 90 (1), 154–177. http://dx.doi.org/10.1016/j.jmva.20 04.02.0 03 . eng, Y., Xiong, X., Adhikari, K., Knadel, M., Grunwald, S., & Greve, M. H. (2015). Modeling soil organic carbon at regional scale by combining multi-spectral im- ages with laboratory spectra. PLoS ONE, 10 (11), 1–22. doi: 10.1371/journal.pone. 0142295 . uinlan, R. J. (1992). Learning with continuous classes. In 5th Australian joint con- ference on artificial intelligence (pp. 343–348). Singapore: World Scientific . Development Core Team (2008). R: A language and environment for statistical com- puting . R Foundation for Statistical Computing . Vienna, Austria. ossel, R. A. V., & Webster, R. (2012). Predicting soil properties from the australian soil visible near infrared spectroscopic database. European Journal of Soil Science, 63 (6), 848–860. doi: 10.1111/j.1365- 2389.2012.01495.x . uleQuest (2016). Data mining with cubist . https://www.rulequest.com/cubist- info. html . eber, G. , & Lee, A. (2012). Linear regression analysis. Wiley Series in Probability and Statistics . Wiley . en, A. , & Srivastava, M. (2012). Regression analysis: Theory, methods, and applications . Springer New York . mola, A. J. , & Schlkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14 (3), 199–222 . sanas, A. , & Xifara, A. (2012). Accurate quantitative estimation of energy perfor- mance of residential buildings using statistical machine learning tools. Energy and Buildings, 49 (0), 560–567 . ens, C. , & Blockeel, H. (2006). A simple regression based heuristic for learning model trees. Intelligent Data Analysis, 10 (3), 215–236 . ang, Y. , & Witten, I. H. (1997). Induction of model trees for predicting continu- ous classes. Poster papers of the 9th European conference on machine learning . Springer . u, X. , Kumar, V. , Quinlan, R. J. , Ghosh, J. , Yang, Q. , Motoda, H. , et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14 (1), 1–37 . ang, L. , Liu, S. , Tsoka, S. , & Papageorgiou, L. G. (2016). Mathematical programming for piecewise linear regression analysis. Expert Systems with Applications, 44 , 156–167 . eh, I.-C. (1998). Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28 (12), 1797–1808 . hang, Y. , & Sahinidis, N. V. (2013). Uncertainty quantification in CO 2 sequestration using surrogate models from polynomial chaos expansion. Industrial and Engi- neering Chemistry Research, 52 (9), 3121–3132 . http://dx.doi.org/10.1016/j.eswa.2017.02.013 http://dx.doi.org/10.1016/j.eswa.2011.08.077 http://dx.doi.org/10.1016/j.eswa.2005.04.025 http://dx.doi.org/10.1016/j.csda.2008.09.012 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0004 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0004 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0005 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0006 http://dx.doi.org/10.1109/TSM.2010.2048968 http://dx.doi.org/10.1214/09-AOAS285 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0009 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0010 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0010 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0010 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0010 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0010 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0011 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0011 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0011 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0011 http://dx.doi.org/10.1016/j.eswa.2009.02.013 http://dx.doi.org/10.1016/S0167-9473(01)00065-2 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0014 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0014 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0015 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0015 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0015 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0016 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0016 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0016 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0016 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0016 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0017 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0018 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0019 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0019 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0019 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0019 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0019 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0020 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0020 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0020 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0021 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0021 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0021 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0021 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0021 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0022 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0022 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0022 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0022 http://dx.doi.org/10.1016/j.eswa.2010.02.016 http://archive.ics.uci.edu/ml http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0025 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0025 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0026 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0026 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0027 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0027 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0027 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0027 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0027 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0028 http://dx.doi.org/10.1016/j.chemolab.2008.06.003 http://dx.doi.org/10.1016/j.ecolmodel.2006.05.021 http://dx.doi.org/10.1016/j.jmva.2004.02.003 http://dx.doi.org/10.1371/journal.pone.0142295 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0033 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0033 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0034 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0034 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0034 http://dx.doi.org/10.1111/j.1365-2389.2012.01495.x https://www.rulequest.com/cubist-info.html http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0037 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0037 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0037 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0037 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0038 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0038 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0038 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0038 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0039 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0039 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0039 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0039 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0040 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0040 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0040 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0040 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0041 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0041 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0041 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0041 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0042 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0042 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0042 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0042 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0043 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0044 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0045 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0045 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0046 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0046 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0046 http://refhub.elsevier.com/S0957-4174(17)30095-7/sbref0046 A regression tree approach using mathematical programming 1 Introduction 2 Method 2.1 Regression tree approach 2.2 Mathematical programming model for node partitioning 2.3 Prediction for new samples 3 Results and discussion 3.1 Sensitivity analysis for 3.2 Performance comparison across different regression methods 3.3 Comparison of actual constructed trees by different regression tree methods 4 Concluding remarks Acknowledgement Supplementary material References