Mathematical Programming for Piecewise Linear
Regression Analysis

Lingjian Yanga, Songsong Liua, Sophia Tsokab, Lazaros G. Papageorgioua,∗

aCentre for Process Systems Engineering, Department of Chemical Engineering, University
College London, Torrington Place, London WC1E 7JE, UK

bDepartment of Informatics, School of Natural and Mahtematical Sciences, King’s College
London, Strand, London WC2R 2LS, UK

Abstract

In data mining, regression analysis is a computational tool that predicts con-

tinuous output variables from a number of independent input variables, by ap-

proximating their complex inner relationship. A large number of methods have

been successfully proposed, based on various methodologies, including linear

regression, support vector regression, neural network, piece-wise regression etc.

In terms of piece-wise regression, the existing methods in literature are usu-

ally restricted to problems of very small scale, due to their inherent non-linear

nature. In this work, a more efficient piece-wise regression method is intro-

duced based on a novel integer linear programming formulation. The proposed

method partitions one input variable into multiple mutually exclusive segments,

and fits one multivariate linear regression function per segment to minimise the

total absolute error. Assuming both the single partition feature and the number

of regions are known, the mixed integer linear model is proposed to simultane-

ously determine the locations of multiple break-points and regression coefficients

for each segment. Furthermore, an efficient heuristic procedure is presented

to identify the key partition feature and final number of break-points. 7 real

world problems covering several application domains have been used to demon-

∗Corresponding author: Tel.: +442076792563; Fax.: +442073882348
Email addresses: lingjian.yang.10@ucl.ac.uk (Lingjian Yang), s.liu@ucl.ac.uk

(Songsong Liu), sophia.tsoka@kcl.ac.uk (Sophia Tsoka), l.papageorgiou@ucl.ac.uk
(Lazaros G. Papageorgiou)

Preprint submitted to Journal of Expert Systems with Applications August 12, 2015


strate the efficiency of our proposed method. It is shown that our proposed

piece-wise regression method can be solved to global optimality for datasets of

thousands samples, which also consistently achieves higher prediction accuracy

than a number of state-of-the-art regression methods. Another advantage of

the proposed method is that the learned model can be conveniently expressed

as a small number of if-then rules that are easily interpretable. Overall, this

work proposes an efficient rule-based multivariate regression method based on

piece-wise functions and achieves better prediction performance than state-of-

the-arts approaches. This novel method can benefit expert systems in various

applications by automatically acquiring knowledge from databases to improve

the quality of knowledge base.

Keywords: regression analysis, surrogate model, piecewise linear function,

mathematical programming, optimisation

1. Introduction

In data mining, regression is a type of analysis that predicts continuous out-

put/response variables from several independent input variables. Given a num-

ber of samples, each one of which is characterised by certain input and output

variables, regression analysis aims to approximate their functional relationship.5

The estimated functional relationship can then be used to predict the level of

output variable for new enquiry samples. Generally, regression analysis can be

useful under two circumstances: 1) when the process of interest is a black-box,

i.e. there is limited knowledge of the underlying mechanism of the system. In

this case, regression analysis can accurately predict the output variables from10

the relevant input variables without requiring details of the however compli-

cated inner mechanism (Bai et al., 2014; Venkatesh et al., 2014; Cortez et al.,

2009; Davis & Ierapetritou, 2008). Quite frequently, the user would also like

to gain some valuable insights into the true underlying functional relationship,

which means the interpretability of a regression method is also of importance, 2)15

when the detailed simulation model relating input variables to output variables,

2


usually via some other intermediate variables, is known, yet is too complex and

expensive to be evaluated comprehensively in feasible computational time. In

this case, regression analysis is capable of approximating the overall system be-

haviour with much simpler functions while preserving a desired level of accuracy,20

and can then be more cheaply evaluated (Caballero & Grossmann, 2008; Henao

& Maravelias, 2011, 2010; Viana et al., 2014; Beck et al., 2012).

Over the past years, regression analysis has been established as a powerful tool

in a wide range of applications, including: customer demand forecasting (Levis25

& Papageorgiou, 2005; Kone & Karwan, 2011), investigation of CO2 capture

process (Zhang & Sahinidis, 2013; Nuchitprasittichai & Cremaschi, 2013), opti-

misation of moving bed chromatography (Li et al., 2014b), forecasting of CO2

emission (Pan et al., 2014), prediction of acidity constants for aromatic acids

(Ghasemi et al., 2007), prediction of induction of apoptosis by different chemical30

components (Afantitis et al., 2006) and estimation of thermodynamic property

of ionic liquids (Chen et al., 2014; Wu et al., 2014).

A large number of regression analysis methodologies exist in the literature,

including: linear regression, support vector regression (SVR), kriging, radial35

basis function (RBF) (Sarimveis et al., 2004), multivariate adaptive regression

splines (MARS), multilayer perceptron (MLP), random forest, K-nearest neigh-

bour (KNN) and piecewise regressions. We briefly summarise those regression

methodologies before presenting our proposed method.

40

Linear regression

Linear regression is one of the most classic types of regression analysis, which

predicts the output variables as linear combinations of the input variables.

The regression coefficients of the input variables are usually estimated using

least squared error or least absolute error approaches, and the problems can45

be formulated as either quadratic programming or linear programming prob-

lems, which can be solved efficiently. In some cases when the estimated linear

3


relationship fails to adequately describe the data, a variant of linear regres-

sion analysis, called polynomial regression, can be adopted to accommodate

non-linearity (Khuri & Mukhopadhyay, 2010). In polynomial regression, higher50

degree polynomials of the original independent input variables are added as new

input variables into the regression function, before estimating the coefficients

of the aggregated regression function. Polynomial functions of second-degree

have been most frequently used in literature due to its robust performance and

computational efficiency (Khayet et al., 2008; Minjares-Fuentes et al., 2014).55

Another popular variant of linear regression is called least absolute shrinkage

and selection operator (LASSO) (Tibshirani, 1994). In LASSO, summation of

absolute values of regression coefficients is added as a penalty term into the

objective function. The nature of LASSO encourages some coefficients to equal60

to 0, thus performing implicit feature selection (Tibshirani, 2011).

Automated learning of algebraic models for optimisation (ALAMO) (Cozad

et al., 2014; Zhang & Sahinidis, 2013) is a mathematical programming-based

regression method that proposes low-complexity functions to predict output65

variables. Given the independent input features, ALAMO starts with defining

a large set of potential basis functions, such as polynomial, multinomial, expo-

nential and logarithmic forms of the original input variables. Subsequently an

mixed integer linear programming model (MILP) is solved to select the best

subset of T basis functions that optimally fit the data. The value of T is ini-70

tially set equal to 1 and then iteratively increased until the Akaike information

criterion, which measures the generalisation of the constructed model, starts to

decrease (Miller et al., 2014). The integer programming model is capable of

capturing the synthetic effect of different basis functions, which is considered

more efficient than traditional step-wise feature selection.75

SVR

Support vector machine is a very established statistical learning algorithm,

4


which fits a hyper plane to the data in hand (Smola & Schlkopf, 2004). SVR

minimises two terms in the objective function, one of which is �-insensitive loss80

function, i.e. only sample training error greater than an user-specific threshold,

�, is considered in the loss function. The other term is model complexity, which

is expressed as sum of squared regression coefficients. Controlling model com-

plexity usually ensures the model generalisation, i.e. high prediction accuracy

in testing samples. Another user-specified trade-off parameter balances the sig-85

nificance of the two terms (Chang & Lin, 2011; Bermolen & Rossi, 2009). One

of the most important features that contribute to the competitiveness of SVR

is the kernel trick. Kernel trick maps the dataset from the original space to

higher-dimensional inner product space, at where a linear regression is equiva-

lent to an non-linear regression function in the original space (Li et al., 2000).90

A number of kernel functions can be employed, e.g. polynomial function, radial

basis function and fourier series (Levis & Papageorgiou, 2005). Formulated as a

convex quadratic programming problem, SVR can be solved to global optimality.

Despite the simplicity and optimality of SVR, the problem of tuning two param-95

eters, i.e. training error tolerance � and trade-off parameter balancing model

complexity and accuracy, and selection of suitable kernels still considerably af-

fect its performance accuracy (Lu et al., 2009; Cherkassky & Ma, 2004).

Kriging100

Kriging is a spatial interpolation-based regression analysis methodology (Klei-

jnen & Beers, 2004). Given a query sample, kriging estimates its output as

a weighted sum of the outputs of the known nearby samples. The weights

of samples are computed solely from the data by considering sample closeness

and redundancy, instead of being given by an arbitrary decreasing function of105

distance (Kleijnen, 2009). The interpolation nature of kriging means that the

derived interpolant passes through the given training data points, i.e. the er-

ror between predicted output and real output is zero for all training samples.

Different variants of kriging have been developed in literature, including the

5


most popular ordinary kriging (Lloyd & Atkinson, 2002; Zhu & Lin, 2010) and110

universal kriging (Brus & Heuvelink, 2007; Sampson et al., 2013).

MARS

MARS (Friedman, 1991) is another type of regression analysis that accommo-

dates non-linearity and interaction between independent input variables in its115

functional relationship. Non-linearity is introduced into MARS in the form of

the so-called hinge functions, which are expressions with max operators and look

like max(0,X − const). If independent variable X is greater than a constant

number const, the hinge function is equal to X-const, otherwise the hinge func-

tion equals to 0. The hinge functions create knots in the prediction surface of120

MARS. The functional form of MARS can be a weighted sum of constant, hinge

functions and products of multiple hinge functions, which makes it suitable to

model a wide range of non-linearity (Andrs et al., 2011).

The building of MARS usually consists of two steps, a forward addition and125

a backward deletion step. In the forward addition step, MARS starts from one

single intercept term/constant and iteratively adds pairs of hinge functions (i.e.

max(0,X − const) and max(0,const − X)) that leads to largest reduction in

training error. Afterwards, a backward deletion step, which removes one by

one those hinge functions contributing insignificantly to the model accuracy, is130

employed to improve generalisation of the final model (Leathwick et al., 2006;

Balshi et al., 2009). The presence of hinge functions also make MARS a piece-

wise regression method.

MLP135

Multilayer perceptron is a feedforward artificial neural network, whose structure

is inspired by the organisations of biological neural networks (Hill et al., 1994).

A MLP typically consists of an input layer of measurable features, an output

layer of response variables, sandwiching multiple intermediate layers of neurons.

The network is fully interconnected in the sense that neurons in each layer are140

6


connected to all the neurons in the two neighbour layers (Comrie, 1997; Gevrey

et al., 2003). Each neuron in the intermediate layers takes a weighted linear

combination of outputs from all neurons in the previous layer as input, applies

an non-linear transformation function before supplying the output to all neu-

rons of the next layer. The use of non-linear transformation functions, including145

sigmoid, hyperbolic tangent and logarithmic functions, makes MLP suitable for

modelling highly non-linear relationship (Gevrey et al., 2003; Rafiq et al., 2001).

Identifying the optimal configuration of a MLP, i.e. the number of interme-

diate layers, the number of neurons for each intermediate layer, the type of150

activation function for each neuron and the weights of connection between con-

secutive layers of neurons, is known to be time-consuming and traps in local

optimal solutions (Paliwal & Kumar, 2009). The large degree of freedom in

training a MLP is often blamed for data over-fitting. Dua (2010) has presented

a two-objective mathematical formulation trying to find the best configuration155

of a MLP by balancing the training accuracy and network complexity. More

often, architecture of a MLP is fixed by the user and back-propagation is used to

tune only the weights of connection between neighbour layers of neurons (Gud-

ise & Venayagamoorthy, 2003; Zhang et al., 2007).

160

Random forest

Before introducing random forest we first describe regression tree, which is a

decision tree-based prediction model. Starting from the entire set of samples, a

regression tree selects one independent input variable among all and performs

binary split into two child sets, under the condition that the two child nodes165

give increased purity of the data compared with its single parent node. Purity

is often defined as the deviation of predicting with the mean value of the out-

put variable. The process of binary split is recursively applied for each child

node until a terminating criterion is satisfied. The nodes that are not further

partitioned are called leaves. After growing a large tree, a pruning process is170

employed to remove the leaves contributing insignificantly to the purity im-

7


provement (Breiman et al., 1984; Loh, 2011). In order to improve model fit, a

linear regression model can be fitted for each leaf (Quinlan, 1992).

Random forest is an ensemble learning method of regression trees. In general,175

random forest (Breiman, 2001; Biau, 2012) builds a forest of multiple regression

tree models and aggregate the decisions from all the trees to produce a final

prediction. Given a dataset, multiple bootstrap sample sets are first created by

random sampling with replacement. Each of the bootstrap sample set is then

learned by a revised regression tree algorithm, which differs from the classic180

regression tree by randomly selecting a candidate subset of features for each

binary split of node (Genuer et al., 2010). The accuracy of each regression tree

can be estimated on the training samples absent from the bootstrap set, and

the final prediction can be either simple average of predictions from all trees or

weighted average considering the estimated accuracy. It is demonstrated that185

random forest achieves much robust prediction performance compared with sin-

gle regression tree method (Breiman, 2001; Fanelli et al., 2011).

KNN

KNN belongs to the category of lazy learning algorithms, due to the fact that190

prediction is based on the instances without an explicit training phase of con-

structing models, thus making it one of the simplest regression methods in

literature (Korhonen & Kangas, 1997). Given an enquiry sample, KNN first

identifies K closest instances in the training sample set, the exact value of K is

given a priori. The closeness of samples can be measured by different distance195

metrics, for example Euclidean and Manhattan distances (Scheuber, 2010; Ero-

nen & Klapuri, 2010). Prediction is then taken as weighted mean of the outputs

of the K nearest neighbours, with weight often being defined as the inverse

of distance (Papadopoulos et al., 2011). Despite its simplicity, KNN usually

provides competitive prediction performance against much more sophisticated200

algorithms.

8


Previous work on piecewise regression

Piecewise functions have been frequently studied in literature as well. In (Toms

& Lesperance, 2003), univariate piece-wise linear functions have been used to205

fit ecological data and identify break-points that represent critical threshold

values of a phenomenon. In (Strikholm, 2006), a method based on statistical

testing is proposed to estimate the number of break-points for an univariate

piece-wise linear function. Malash & El-Khaiary (2010) also apply piece-wise

linear regression techniques on univariate experimental adsorption data. Piece-210

wise function is determined by solving a non-linear programming model. Seg-

Reg (www.waterlog.info/segreg.htm) is free software that permits estimating

of piece-wise regression functions with up to two independent variables. For

one independent variable, SegReg splits from a series of candidate break-points

and for each one fits a linear regression for either side of the break-point. The215

break-point corresponding to the largest statistical confidence is taken as the

final solution. In the case of two independent variables, SegReg first determines

the two-region piece-wise regression function between the dependent variable

and the most significant input variable, before computing the relation between

its residual/deviation and the second input variable.220

Both Magnani & Boyd (2009) and Toriello & Vielma (2012) publish work on

data fitting with a special family of piece-wise regression functions, called max-

affine functions. The form of max-affine functions is defined as the maximum

of a series of linear functions, i.e. a sample is projected to all linear functions,225

and the maximum projected value among all is taken as final predicted value

from the piece-wise functions. The use of max-affine functions limits the fitted

surface to be convex. In (Magnani & Boyd, 2009) a heuristic method is used to

ease the difficulty of direct solving the highly non-linear max-affine functions,

while in (Toriello & Vielma, 2012), big-M constraint is used to reformulate230

the problem into an non-convex mixed integer non-linear programming model.

However, computational complexity is limiting their applications to examples

of small scale.

9


More recently, Greene et al. (2015) applies piece-wise regression analysis to235

predict patient’s post-treatment life quality with the pre-treatment life quality

measure, which identifies the segments where therapy benefits vary significantly.

The analysis is performed using Segmented (Muggeo, 2008), a package written

in R (R Development Core Team, 2008). Segmented formulates the problem

using a non-linear model and requires a user to specify the segmented input240

variables, the number of break-points and also the initial guess of each break-

point. Starting from the those user supplied initial positions of break-points,

Segmented iteratively moves around the neighbour of the initial guess points

to search break-points of better quality using local linearisation. However, it is

difficult if not impossible to reasonably guess good starting points for real world245

multivariate problems of large number of samples and input variables, where

visual examination cannot be performed. This makes it hard to identify quality

solutions. Furthermore, Segmented only allows the input variables being par-

titioned to have different regression coefficients across different segments, while

the other input variables keep the same coefficients within the entire ranges,250

significantly restricting its flexibility.

In both (Xue et al., 2013) and (Li et al., 2014a), piece-wise regression func-

tion were employed to detect vegetation changes. Piece-wise linear regression

was tackled using fuzzy logic and identifies the changes in patterns of vegetation255

greenness. Cavanaugh et al. (2014) employ piece-wise regression and find out

that the changes in mangrove area over the last 20 years is a piece-wise functions

of latitude, with regions above and below a specific threshold latitude value fol-

lowing two different patterns of mangrove grows. Moreover, Matthews et al.

(2014) uses 2-segment piece-wise functions to describe the relationship between260

species richness and fragment area of islands, with the critical breakpoint being

determined by simply sampling a number of candidate values and selecting the

one giving best model fit. Unfortunately, the above methods are all limited to

model rather simple relationship between one output variable and one input

10


variable, seriously limiting their usage in more complex problems.265

It is clear that the previous literature work of piece-wise regression methods

are non-linear and computationally restricted to problems of very small scales.

Yet, they cannot be solved to identify globally optimal solutions. In this work,

we propose a novel linear model for piece-wise regression analysis. A single270

input variable is partitioned to separate samples into multiple mutually exclu-

sive segments, while each segment is fitted with a unique multivariate linear

regression function. Assuming that both the partition variable and the number

of break-points are known, we propose an optimisation model that optimally

estimates the position of all break-points and the linear regression coefficients275

for each segment simultaneously so that the total absolute deviation is min-

imised. Thanks to the usage of binary variables, our proposed mathematical

model is linear and can be efficiently solved to global optimality for problems

up to thousands samples (see Results and Discussion section). Furthermore, a

solution procedure is used to identify the key partition variable and the final280

number of break-points. Several real world multivariate benchmark datasets

have been used to demonstrate the applicability and efficiency of the proposed

method.

The proposed piece-wise regression method can help construct expert systems in285

various application domains. Expert systems are computer programs designed

to make decisions analogous to human experts. As an expert system is typ-

ically made up of an inference engine and a knowledge base, the quality and

quantity of information in knowledge base directly affects the usefulness of the

constructed expert system. Our proposed piece-wise regression method can be290

helpful in more efficiently building expert systems via automatic and efficient

acquisition of knowledge. More specifically, the proposed piece-wise regression

method can extract latent knowledge from the large collection of domain expert

curated databases. Those discovered knowledge are represented in the form

of identified relationship between input and output variables of interest, which295

11


can be combined with expert knowledge to form the final expert system (Alonso

et al., 2012). For example, the proposed piece-wise regression method in this

work can be used for building prognostic expert systems in medical applica-

tions. When presented historical data of patients’ clinical variables and survival

length, piece-wise regression can induce domain knowledge by approximating300

the complex relationship between clinical variables and survival length. Those

induced knowledge can then be used to perform prognosis for the current pa-

tients, imitating the end-behaviour of human experts, i.e. medical doctors.

Overall, the key contributions of our work are illustrated below:305

• We propose a novel mixed integer optimisation model for multivariate

regression analysis modelling piece-wise linear functions, which partitions

a single variable into multiple mutually exclusive regions and fits each one

with a distinct multivariate linear function. Given as prior a single input

variable as partition feature and the number of segments, the optimisation310

model can be solved to simultaneously determine the positions of multiple

break points and regression coefficients for each segment.

• Given that neither which feature should be segmented nor the number

of segments are typically known, a heuristic solution procedure is also

introduced that automatically identifies the key partition variable and the315

final number of segments.

• A number of real world benchmark problems have been employed to

demonstrate the applicability and efficiency of the proposed method. As

sharp difference to the existing piece-wise regression methods in literature

which can only be applied to problems of very small size, our proposed320

optimisation model can be solved to global optimality for datasets contain-

ing up to five thousand samples. Comparison to some popular regression

methods based on other methodologies clearly indicates that our proposed

method based on piece-wise function achieves the highest prediction ac-

curacy, and does it consistently. Besides high prediction performance,325

12


our proposed regression method has the advantage of being easily under-

standable and interpretable, as the learned model can be conveniently

represented as a small set of rules.

• As a generic data mining method, our proposed regression method can

help with constructions of expert and intelligent systems via automatic330

extraction of knowledge from database. We have discussed its potential

usage in various application domains.

The rest of the paper is structured as follows: in Section 2, we present

the mathematical programming model and a heuristic solution procedure. In

Section 3, comparative results of our proposed method and some state-of-the-335

art regression algorithms on benchmark examples are presented and discussed.

The last section concludes with our key findings.

2. Method

A novel piecewise linear regression method is proposed in this work. The core

idea of the proposed method is to identify a single input feature, and separate340

the samples into complementary regions on this feature. One different linear

regression function is fitted locally for each region. The sample partition and

calculation of local regression coefficients are performed simultaneously within

the proposed optimisation to achieve least absolute error.

2.1. A novel regression method345

In this section, we first describe a novel mathematical programming model

that optimises the location of break-points and regression coefficients for each

region so as to achieve minimal training error. Subsequently, a solution proce-

dure is proposed to identify the best partition feature and the number of regions.

The indices, parameters and variables associated with the proposed model are

13


listed below:

Indices

s sample, s=1,2,...,S

m feature/independent input variable, m=1,2,...,M

r region, r=1,2,...,R

m* the feature where sample partition takes place

Parameters

Asm numeric value of sample s on feature m

Ys output value of sample s

U′,U′′ arbitrarily large positive numbers

Continuous variables

Wrm regression coefficient for feature m in region r

Br intercept of regression function in region r

Predrs predicted output for sample s in region r

Xr
m*

break −point r on partition feature m*

Ds training error between predicted output and real output for sample s

Binary variables

Frs 1 if sample s falls into region r; 0 otherwise

Assume first that both the partition feature m* and the number of regions

R are given, the R-1 break points are arranged in an ordered way:

Xr−1m ≤ X
r
m ∀m = m

*,r = 2, 3, ...,R (1)

Binary variables Frs are introduced to model if sample s belongs to region r or

14


not. Modelling of which sample belongs to which region is achieved with the

following constraints:

Xr−1m −U
′(1 −Frs ) ≤ Asm ∀s,r = 2, 3, ...,R,m = m

* (2)

Asm ≤ Xrm + U
′(1 −Frs ) ∀s,r = 1, 2, ...,R− 1,m = m

* (3)

When sample s belongs to region r (i.e. Frs = 1), Asm* falls into the region

bounded by the two consecutive break-points Xr−1
m*

and Xr
m*

on feature m∗;

otherwise the two sets of constraints become redundant. A visualisation of

break-points and regions is provided in Figure 1:

Figure 1: Break-points and regions

The following constraints restrict that each sample belongs to one and only one

region: ∑
r

Frs = 1 ∀s (4)

For sample s, its predicted output value for region r, Predrs, is as below:

Predrs =
∑
m

AsmW
r
m + B

r ∀s,r (5)

For any sample s, its training error is equal to the absolute deviation between

15


the real output and the predicted output for the region r where it belongs to

(i.e. Frs = 1):

Ds ≥ Ys −Predrs −U
′′(1 −Frs ) ∀s,r (6)

Ds ≥ Predrs −Ys −U
′′(1 −Frs ) ∀s,r (7)

The objective function is to minimise the sum of absolute training error:

min
∑
s

Ds (8)

The final model, named as Optimal Piece-wise Linear Regression Analysis (OPLRA)

in this work, consists of a linear objective function and several linear constraints,

and the presence of both binary and continuous variables define an MILP prob-

lem, which can be solved to global optimality by standard solution algorithms,350

for example branch and bound. A heuristic solution procedure is also employed

in this work to identify the partition feature and the number of regions, as de-

scribed in Figure 2 below.

The heuristic procedure starts with solving a linear regression on the entire set355

of data with least absolute deviation. Subsequently, each input feature in turn

serves as partition feature m* once and the OPLRA model is solved while al-

lowing two regions (i.e. R = 2). The feature corresponding to the minimum

training error is kept and if its error represents a percentage reduction of more

than β from the global linear regression without data partition, the procedure360

continues; otherwise it is decided that two-region piecewise linear regression does

not provide a desirable improvement upon the classic linear regression, and the

initially derived linear regression function without sample partition is obtained

for prediction. The parameter β, taking value between 0 and 1, quantifies the

percentage reduction in training error that justifies adding one more region.365

16


Figure 2: Heuristic procedure to identify the partition feature and the number of regions

17


If two-region piecewise regression is accepted, the corresponding partition fea-

ture is retained for further analysis while the number of regions is iteratively

increased, until the β training reduction criterion is not satisfied between iter-

ations.

370

β is the only user-specific parameter in our proposed regression method, which

requires fine tuning. A small value may cause over-fitting, i.e. too many regions

are allowed and each region contains only small number of samples, which then

results in unreliable construction of local linear regression functions; while a

value excessively large will lead to premature growing of regions, which then375

under-fit the data. In Results and Discussion section, we will test a series of

values on a number of benchmark datasets and select the optimal value corre-

sponding to the most robust prediction performance.

The constructed piecewise linear regression functions are then used to predict380

the output value of new samples. A testing sample is firstly assigned to one of

the regions, and the regression coefficients for that region are used to estimate

its output value.

2.2. An illustrative example

In order to better illustrate the training of the proposed regression method,385

a simulation model is taken from literature. In brief, the illustrative example

(Palmer & Realff, 2002) describes the operation of a continuous stirred tank

reactor, where a chain reaction of A → B → C takes place. An inlet stream

containing both reactant A and B enters the reactor and the desirable output is

component B. There are 4 independent input variables to the simulation model,390

including temperature of the reactor (T), volume of the reactor (V ), concen-

tration of A and B in the inlet stream (CAin and CBin). The output to be

predicted is the production rate of B (P). The process and associated variables

are described in Figure 3.

395

18


Figure 3: Illustrative example of a continuous stirred tank reactor

With latin hypercube sampling technique (Helton & Davis, 2003) employed

to specify a set of data points, we run the simulation model and collect 300

samples. The goal of the regression analysis is to approximate the functional

relationship between output variable P and input variables including T, V ,

CAin and CBin using piece-wise linear functions. The step-wise description of400

the training procedure is presented in Table 1 below.

Initially, a linear regression function is fitted to the entire dataset without fea-

ture segmentation, which gives an absolute deviation of 1677.78. The second

iteration of the method solves 4 independent OPLRA models allowing 2 regions405

each, respectively specifying T,V,CAin and CBin as partition feature. The two-

region piece-wise linear functions constructed while partitioning on T appears

to yield lower training errors (i.e. 1030.63) than the other 3, and therefore is

taken as the solution for iteration 2. This represents a significant improvement

(i.e. 38.57%) from the initial global linear regression function. From iteration410

3, the partition feature is fixed as T while one more region is allocated for each

increased iteration. Iteration 3 and 4 respectively lowers the training error to

876.66 and 807.12. The iterative procedure terminates when the β criterion is

not satisfied, e.g. if β = 20%, then the iterative procedure terminates at the

19


third iteration and the final regression function has 2 regions; if β = 10%, then415

the final regression function has 3 regions.

Table 1: Piecewise regression functions built at each step of training procedure

Iteration Number
of regions

Partition
feature

Training
error

Training error
improvement

Functional relationship

1 1 NONE 1677.78 P = 1.0240T + 0.0054CAin + 0.0125CBin + 0.4340V − 333.54

2 2 T 1030.63 38.57% P =

{
0.7413T + 0.0040CAin + 0.0102CBin + 0.3406V − 238.74, T ≤ 213.21
1.7156T + 0.0111CAin + 0.0315CBin + 0.7574V − 592.63, T > 213.21

2 V 1143.49 P =

{
0.5952T + 0.0033CAin + 0.0056CBin + 0.4533V − 194.26, V ≤ 42.38
1.4781T + 0.0083CAin + 0.0195CBin + 0.4773V − 48.70, V > 42.38

2 CAin 1485.65 P =

{
0.8930T + 0.0057CAin + 0.0152CBin + 0.4161V − 293.45, CAin ≤ 3528.43
1.4857T + 0.0073CAin + 0.0070CBin + 0.5929V − 489.45, CAin > 3528.43

2 CBin 1627.73 P =

{
1.0242T + 0.0056CAin + 0.0118CBin + 0.4241V − 333.49, CBin ≤ 458.21
1.1105T + 0.0050CAin − 0.1405CBin + 0.5813V − 291.00, CBin > 458.21

3 3 T 876.66 14.94% P =




0.5815T + 0.0030CAin + 0.0097CBin + 0.2654V − 184.45, T ≤ 303.25
1.1353T + 0.0062CAin + 0.0176CBin + 0.4579V − 373.68, 303.25 < T ≤ 316.62
1.8764T + 0.0119CAin + 0.0394CBin + 0.8617V − 654.41, T > 316.62

4 4 T 807.12 7.93% P =




0.5815T + 0.0030CAin + 0.0097CBin + 0.2654V − 184.45, T ≤ 303.25
1.2648T + 0.0054CAin + 0.0148CBin + 0.4510V − 409.61, 303.32 < T ≤ 312.21
1.4872T + 0.0084CAin + 0.0202CBin + 0.6667V − 503.10, 312.21 < T ≤ 320.77
1.9930T + 0.0128CAin + 0.0360CBin + 0.8871V − 695.65, T > 320.77

...

Overall, the key features of our proposed piecewise linear regression method are

summarised here: 1) our method identifies one key partition feature and sepa-

rate samples into multiple complementary regions on it, 2) each region has the420

flexibility of being fitted by its own linear regression function, with all input fea-

tures allowed to have different regression coefficients across different regions, 3)

there is only one tuning parameter β, 4) compared with algorithms like kernel-

based SVR and MLP, the constructed regression function is easy to understand,

as it exhibits linear relationships for different regions.425

It is noted here that the obtained relationship between input and output vari-

ables, presented as rules in Table 1, can be used to build an expert system for

the above operation. Given the chain reaction of A → B → C in stirred tank

reactor (Palmer & Realff, 2002), domain experts perform experiments to create430

a database of samples for different levels of temperature, reactor volume and

concentrations of reactants. Our proposed piece-wise regression method is then

applied to automatically extract the rules that predict production rate from

20


temperature, reactor volume and reactant concentrations. The rules will be

difficult to be provided directly from even chemical engineering experts due to435

the complex nature of the reaction. Since the above extracted rules can calcu-

late a production rate value for any random value of temperature, tank volume

and reactant concentrations, regardless of if they obey physical laws (must be

positive) or are valid for the reaction of interest, expert knowledge should be

incorporated to further refine the rules. For example, expert knowledge can be440

used to constraint the applicable temperature range outside which liquid phase

will vaporise to gas phase or freeze to solid phase, making it impossible for the

reaction to proceed as normal. The final expert system will allow users to query

the likely outcome, as production rate or no reaction, of any combination of

values of temperature, reactor volume and reactant concentrations.445

In the next section, a number of real world regression problems are employed

to benchmark the predictive performance of our proposed model.

3. Results and Discussion

A total number of 7 real world datasets have been downloaded from UCI ma-

chine learning repository (http://archive.ics.uci.edu/ml/) (Bache & Lichman,450

2013) to test the prediction performance of our proposed method. The first re-

gression problem Yacht Hydrodynamics predicts the hydrodynamic performance

of sailing yachts from 7 features describing the hull dimensions and velocity of

the boat for 308 samples. Energy Efficiency (Tsanas & Xifara, 2012) collects

data corresponding to 768 building shapes, described by 8 features including455

wall area, root area and so on. The aims are to establish the relationship be-

tween either heating load or cooling load requirements and the 8 parameters

of the building. The third example, Concrete Strength (Yeh, 1998), looks into

the relationship between compressive strength of concrete and 8 input variables,

including water concentration and age, with 1030 samples of different concretes.460

Airfoil dataset concerns how the different airfoil blade desings, wind speed and

angles of attack affect the sound pressure level. The last 2 case studies, Red

21


Wine Quality and White Wine Quality (Cortez et al., 2009), aims to predict

experts’ preference of red and while wine taste with 11 physicochemical features

of the wines. Almost 1600 red wine and 4900 white wine samples have been465

obtained for analysis.

For each of the 7 benchmark datasets, a 5-fold cross validation, is performed

to estimate the predictive accuracy of the proposed method. Given a dataset,

5-fold cross validation randomly splits the samples into 5 subsets of equal size.470

Each subset is in turn held out once while the other 4 subsets of samples are

used in the training process to derive the regression function. The holdout set

is then used to validate the predictive accuracy of the constructed regression

function. We conduct 10 rounds of 5-fold cross validation by performing differ-

ent random sample splits, and the mean absolute prediction errors (MAE) are475

averaged over 50 testing sets as the final error.

For comparison purposes, a number of state-of-the-art regression methods have

been implemented, including linear regression, MLP, kriging, SVR, KNN, ran-

dom forest, MARS, PaceRegression and ALAMO. Linear regression, MLP, krig-480

ing, SVR, KNN and PaceRegression are implemented in WEKA machine learn-

ing software (Hall et al., 2009). For KNN, the number of nearest neighbours is

selected as 5, while for other methods their default settings have been retained.

Random forest is implemented using Orange (Demšar et al., 2013). We use

the MATLAB toolbox called ARESlab (Jekabsons, 2011) for MARS. ALAMO485

is reproduced using the General Algebraic Modeling System (GAMS) (GAMS

Development Corporation, 2013), and basis function forms including polyno-

mial of degrees up to 3, pair-wise multinomial terms of equal exponents up to 3,

exponential and logarithmic forms are provided for each dataset. Our proposed

method is also implemented in GAMS. Both ALAMO and our proposed model490

are solved using Cplex MILP solver, with optimality gap set as 0. Computa-

tional resource limit is set as 200 seconds for each solving of OPLRA model in

our proposed method.

22


Figure 4: Sensitivity analysis of β. The numbers above points in each plot correspond to the
average numbers of final regions.

3.1. Sensitivity analysis for β

In this subsection, a sensitivity analysis is performed for the parameter β,495

which serves as a terminating criterion of the iterative training procedure for

our proposed method. Taking value between 0 and 1, β defines the minimum

percentage training error reduction that must be achieved to justify the alloca-

tion of an extra region. A range of values have been tested, including: 0.2, 0.15,

0.10, 0.05, 0.03 and 0.01. The results of the sensitivity analysis are provided500

in Figure 4 below.

23


Figure 4 describes how mean absolute error changes with β. The numbers

attached to the points in each plot are the average numbers of final regions,

which always go up as β decreases. For Yacht Hydrodynamics example, setting505

β = 0.20 results in just more than 4 final regions. Decrease the β value to

0.15 increases slightly the prediction error with marginally higher number of

regions. Further decrease β to 0.10 leads to lowest mean prediction error of

0.648 with an average of 5 regions, before excessively low values of β over-fits

the unseen testing samples by yielding much increased prediction error. For En-510

ergy Efficiency Heating case study, when β = 0.10,0.15 and 0.20 our proposed

regression method constructs piece-wise regression functions of an average of 3

regions, yielding MAE of 0.907. Smaller values of β leads to about 5 regions,

which are shown to predict the testing samples with higher accuracy (MAE

around 0.810). In terms of Energy Efficiency Cooling and Concrete Strength515

examples, similar phenomenon can be observed that when β takes overly high

values (i.e. 0.20, 0.15 ), the proposed method terminates prematurely with only

2 regions and relatively high MAE. More regions are allowed by lowering β,

which gives higher prediction accuracies. On Airfoil case study, the proposed

method outputs global multiple linear regression functions without data par-520

titions when β = 0.20. As β decreases, more regions are permitted, which

predict unseen samples with better accuracy. With regards to Red Wine Qual-

ity dataset, the optimal prediction occurs when β = 0.03. On the last example

of White Wine Quality, 2-region piece-wise regression functions achieved with

β = 0.01, 0.03, 0.05 outperforms global multiple linear regressions for higher525

values of β.

It can be seen from Figure 4 that the range of values between 0.01 and 0.05 gen-

erally lead to smaller prediction error than higher values of β. For all datasets

except Yacht Hydrodynamics, prediction errors of β = 0.01, 0.03 and 0.05 are530

evidently smaller than that of β = 0.10, 0.15 and 0.20. Within the range be-

tween 0.01 and 0.05, there is no clear optimal value for β as different values have

24


different effects on the accuracy. We instead seek to identify the most robust

value for β, which gives consistently desirable prediction accuracy across a wide

range of problems. For each dataset, we normalise the MAE of each β accord-535

ing to the formula:
MAEβ−minβ MAEβ

minβ MAEβ
. For example, in Yacht Hydrodynamics,

original MAE of β = 0.01 is normalised from 0.7131 to 0.7131−0.6481
0.6481

= 10.0%,

where 0.6481 is the lowest MAE achieved when β = 0.10. The normalised MAE

of each β represents the actual deviation of it compared to the lowest error, and

is averaged over all examples to reflect its overall competitiveness.540

Overall β = 0.03 provides the smallest normalised MAE of 1.7%, which is

marginally lower than these of β = 0.01 and β = 0.05, respectively as 1.8% and

2.8%. Even higher values of β correspond to noticeably larger normalised MAE

(5.6%, 9.7% and 12.3% for β = 0.10,0.15 and 0.20, respectively). The con-545

sistently small normalised MAE, while β is between 0.01 and 0.05, show that

our proposed regression method is robust with respect to the only user tuning

parameter β. Finally β is set to 0.03 when comparing with other competing

methods in literature.

3.2. Prediction performance comparison550

After identifying a value (i.e. 0.03 ) for the only tuning parameter β in

our proposed regression method, we now compare the accuracy of the proposed

method against some popular regression algorithms with the same set of 7 ex-

amples. The results of the comparison are available in Table 2 below.

Table 2: Comparative testing of different regression methods on benchmark datasets
Hydrodynamics Energy Heating Energy Cooling Concrete Airfoil Red Wine White Wine

linear regression 7.270 2.089 2.266 8.311 0.037 0.506 0.586
MLP 0.809 0.993 1.924 6.229 0.035 0.581 0.623
Kriging 4.324 1.788 2.044 6.224 0.030 0.496 0.576
SVR 6.445 2.036 2.191 8.212 0.037 0.500 0.585
KNN 5.299 1.937 2.148 7.068 0.026 0.515 0.537
Random forest 3.516 1.435 1.644 5.309 0.030 0.484 0.519
MARS 1.011 0.796 1.324 4.871 0.035 0.502 0.570
PaceRegression 7.233 2.089 2.261 8.298 0.037 0.507 0.586
ALAMO 0.787 2.722 2.765 8.044 0.032 0.594 0.639
Proposed 0.706 0.810 1.278 4.870 0.029 0.481 0.551

555

25


In Table 2 and each tested dataset, the lowest prediction error achieved among

all implemented regression methods is marked with bold. On Hydrodynamics

problem, the proposed method in this work provides an MAE of 0.706, which is

lower than any other competing algorithm. ALAMO, MLP and MARS follow560

closely with MAE of 0.787, 0.809 and 1.011, respectively. Mean error rates of

the rest of the methods are between 3 and 8. On Energy Efficiency Heating,

MARS emerges as the most accurate algorithm with an mean absolute error of

0.796, which is closely matched by our proposed method and MLP. Mean pre-

diction errors of the other approaches are almost all twice as large as that of the565

MARS. In terms of Energy Efficiency Cooling dataset, the proposed method,

MARS, random forest and MLP are the top 4 performers with MAE between

1.278 and 1.924. On Concrete Strength, our proposed approach and MARS,

with an MAE of 4.870 and 4.871, again emerge as the leading methods from

random forest, Kriging, MLP and the others. When it comes to Airfoil example,570

all the competing algorithms achieve similar prediction accuracies, with KNN

topping the chart with an MAE of 0.026. The proposed approach in this work

is a merely 0.003 far behind, with kriging and random forest a further 0.001

behind. A merely 0.011 separates the 10 methods. Lastly, on the two Wine

Quality examples, our proposed approach is respectively ranked as 1st and 3rd575

best method.

Overall, for 4 out of the 7 datasets, including Yacht Hydrodynamics, Energy

Efficiency Cooling, Concrete Strength and Red Wine Quality, the proposed re-

gression method achieves the lowest prediction errors. For the other 3 tested580

examples, the proposed method still perform competitively as being second on

Energy Efficiency Heating, Airfoil and third on White Wine Quality.

As there does not exist a single regression method which can always outper-

form others on all datasets, a desirable regression algorithm should demonstrate585

consistently competitive prediction accuracy. In order to more comprehensively

26


Figure 5: Scoring of regression methods

evaluate the relative competitiveness of all the implemented approaches, we em-

ploy the following scoring strategy: for each problem, the regression methods

are ranked in descending order according to their mean prediction error. The

best regression method corresponding to the lowest prediction error is awarded590

the maximum score of 10, the second best regression method corresponding to

the second lowest prediction error is assigned a score of 9 and so on. The scores

of each regression approach are averaged over the 7 datasets, which represent

the overall performance of the method. The higher the score, the better the

relative performance of a method. The scores of the different regression ap-595

proaches used in this work are presented in Figure 5 below.

According to Figure 5, the proposed method is shown to be the most accu-

27


rate and robust regression algorithm among all, achieving a score of 9.43 out

of a possible 10. Random forest and MARS are second and third according to600

the ranking with scores of 8 and 7.43, followed by kriging, KNN, MLP, SVR,

ALAMO, linear regression and PaceRegression in descending order. The advan-

tages of the proposed regression method is quite obvious compared with other

implemented methods.

605

Lastly, we take a look at, for each dataset, the number of regions and the key

partition feature determined by our proposed regression method. The results

are summarised in Table 3. It is clear that the proposed segmented regres-

sion method provides good interpretability as the number of regions are small

(usually between 2 to 4 and at most 5). The partition features may release im-610

portant insights of the underlying system as the output variables change more

dramatically across different ranges alone this feature.

Table 3: Number of regions and partition feature by our proposed method

Dataset Number of regions Partition feature
Hydrodynamics 5 Froude number
Energy Heating 3 Wall Area
Energy Cooling 3 Wall Area
Concrete 3 Age
Airfoil 4 Frequency
Red Wine 2 Alcohol
White Wine 2 Volatile acidity

3.3. Strength and weakness of the proposed piece-wise regression method

No regression method will be the best for all problems. In this section, we

give some general illustration of the pros and cons of the proposed OPLRA615

piece-wise linear regression method, and compare it against some other litera-

ture methods. OPLRA piece-wise regression is inherently deterministic, which

means the same solution is always guaranteed regardless of the number of runs

executed. This is an advantage of OPLRA against stochastic-based methods,

for example MLP, where each execution would typically end up with a different620

28


locally optimal solution. On the other hand, OPLRA is intuitive and easy to

interpret. OPLRA approximates the potentially highly non-linear relationship

between output and input variables as piece-wise linear algebraic functions, the

formalism of which is easy to understand, interpret and use for users without

sophisticated background knowledge. Contrarily, the mechanisms of certain625

methods like SVR, MLP and Kriging lack transparency as the former two work

as black box techniques and the latter requires detailed knowledge on statistics.

The small number of user-specified parameters involved in training of OPLRA is

another remarkable advantage. β is the only tuning parameter in the proposed

OPLRA, which produces robust predictive performance with regards to varying630

values of β as shown in the following Results and Discussion section. Conversely,

usage of certain regression methods, including SVR, MLP and Kriging requires

tuning a large number of parameters, making it a challenging task to identify

their optimal values. More importantly, OPLRA piece-wise regression achieves

more accurate and robust prediction performance against other methods. Using635

a large number of real word problems, OPLRA is shown to outperform popular

state-of-the-art multivariate regression methods in terms of prediction accuracy

and does so consistently across a number of real world problems.

With regards to shortcomings of our proposed piece-wise regression method,640

training of OPLRA generally consumes more computational resource than the

existing methods in literature. Solving OPLRA combinatorial optimisation

model is indeed a more computationally intensive task than heuristic-based

methods, for example regression tree, MARS and quadratic programming-based

SVR. Therefore, we note here that this method is not designed for online ap-645

plications where computation time is valued more than the prediction accu-

racy/model interpretability. Another limitation of OPLRA is that it permits

segmentation of only one input variable, which may not be adequate for the

datasets where trend of the output variable changes dramatically in more than

one input variables.650

29


4. Concluding Remarks

This work addresses the problem of multivariate regression analysis, where

one seeks to estimate the complex relationship between dependent output vari-

ables and independent input variables from training samples. The identified

relationships can then be used to make predictions for unseen observations. We655

have proposed a novel piece-wise regression method, which approaches the prob-

lem by segmenting one input variable into multiple mutually exclusive regions

and simultaneously fitting each one with a distinct multivariate linear function.

An optimisation model has been proposed to optimise the locations of break

points and regression coefficients for each region, while a heuristic procedure660

has also been introduced to find the key partition feature and the number of

break-points by repeatedly solving the optimisation models until a satisfactory

solution is identified.

To demonstrate the applicability and efficiency of the proposed piece-wise re-665

gression method, 7 real world problems have been employed, covering a wide

range of application domains. To benchmark the predictive capability of the pro-

posed method, we have also implemented various popular regression methods in

literature for comparison, including support vector regression, artificial neural

network, MARS and K nearest neighbour. Computational experiments clearly670

indicate that our proposed piece-wise regression method achieves consistently

high predictive accuracy as leading to the lowest prediction errors for 4 out of 7

datasets, second lowest errors for 2 datasets and third lowest error for the other

example. The results confirm our proposed method as a reliable alternative to

traditional regression analysis methods. Another remarkable advantage of our675

proposed method is that the learned model can be conveniently expressed as

a set of if-then rules that are compact and easily understandable. From Table

3, it is clear that the number of if-then rules identified by our method as the

hidden patterns in the large scale databases (up to thousands expert curated

samples) are extremely small (usually 2 to 3 and at most 5). The model inter-680

30


pretability of the proposed piece-wise regression is a desirable advantage over

black modelling techniques, for example support vector regression and neural

network.

With regards to research contribution in expert and intelligent systems, the685

generic machine learning method proposed in this work can be used to con-

struct a large number of automatic decision making or support systems for var-

ious domain applications. As the quality and coverage of information contained

in knowledge base critically affects the efficiency of any expert and intelligent

system, our proposed machine learning method can serve to automatically and690

more efficiently acquire knowledge from database by approximating the relation-

ship between output and input variables as rules. Subsequently, the discovered

knowledge can be used to generate forecasts to users’ enquiry.

To further improve the efficiency of the proposed piece-wise regression method695

in this work, the following limitations can be considered for refinement. As the

piece-wise regression method proposed in this work can only partition a single

input variable, one potential improvement is to generalise the method so that

to permit segmentation of multiple variables so as to better capture the non-

linearity in datasets. Secondly, as our proposed method in this work can only700

handle continuous input variables, we plan to improve its applicability by gener-

alising it to deal with categorical input variables having many distinct levels. In

addition, the relationship between output and input variables are approximated

as linear for each segment in the current method, which may not adequately

model the underlying patterns. To overcome this, more complex non-linear ba-705

sis functions, for example polynomial, exponential and logarithmic forms, can

be added to allow more flexibility. Another limitation of our method is the rel-

atively high computational cost, which may restrict its usage in certain online

applications, where learning speed of the method is considered more important

than actual prediction accuracy. To tackle this problem, we can explore more710

efficient heuristic solution procedures that, by estimating the possible break-

31


point positions and constricting the solution space, more quickly converge to a

quality solution.

In terms of practical future applications in expert and intelligent systems, the715

proposed piece-wise regression method can benefit many via automatic extrac-

tion of knowledge from databases and generate accurate forecasts. As examples,

we have identified the following directions as possible avenues worth investiga-

tion in the near future. First, our proposed method can be incorporated into

the construction of a decision support expert system that continuously predicts720

the personalised risk of prisoner with mental illness being released from the jail,

aiding clinician for decision making (Constantinou et al., 2015). Other applica-

tions that can benefit from our work include intelligent drowsiness monitoring

system and stock price prediction. In drowsiness monitoring, the proposed

regression model can be built into an intelligent fatigue detection equipment,725

which records the dynamic physiological signals of drivers or medical staffs and

continuously predicts their level of fatigues. A warning will be automatically

issued when the model predicts the fatigue level of subjects to be above a pre-

specified threshold level (Chen et al., 2015). In financial area, our method can

help with construction of an automatic system that forecasts the stock price730

based on the ever-changing variables quantifying the current performance of a

company, including assets, liabilities and income, providing management with

data support to make better financial benefits (Ballings et al., 2015). Lastly,

the proposed method developed here can also find application in airline industry

where managers and decision makers can benefit from a framework powerful of735

predicting the level of customer satisfaction from various aspects of services, and

therefore making it possible for them to carefully allocate resource to maximise

customer loyalty (Leong et al., 2015).

32


5. Acknowledgements

Funding from the UK Engineering and Physical Sciences Research Coun-740

cil (to LY, SL and LGP through the EPSRC Centre for Innovative Manufac-

turing in Emergent Macromolecular Therapies), the UK Leverhulme Trust (to

ST and LGP, RPG-2012-686), the European Union(to ST, HEALTH-F2-2011-

261366),and the Centre for Process Systems Engineering (CPSE) at Imperial

and University College London are gratefully acknowledged.745

References

Afantitis, A., Melagraki, G., Sarimveis, H., Koutentis, P. A., Markopoulos, J.,

& Igglessi-Markopoulou, O. (2006). A novel {QSAR} model for predicting

induction of apoptosis by 4-aryl-4h-chromenes. Bioorganic and Medicinal

Chemistry , 14 , 6686 – 6694.750

Alonso, F., Martnez, L., Prez, A., & Valente, J. P. (2012). Cooperation between

expert knowledge and data mining discovered knowledge: Lessons learned.

Expert Systems with Applications , 39 , 7524 – 7535.

Andrs, J. D., Lorca, P., de Cos Juez, F. J., & Snchez-Lasheras, F. (2011).

Bankruptcy forecasting: A hybrid approach using fuzzy c-means clustering755

and multivariate adaptive regression splines (mars). Expert Systems with

Applications , 38 , 1866 – 1875.

Bache, K., & Lichman, M. (2013). UCI machine learning repository.

Bai, Y., Wang, P., Li, C., Xie, J., & Wang, Y. (2014). A multi-scale relevance

vector regression approach for daily urban water demand forecasting. Journal760

of Hydrology , 517 , 236 – 245.

Ballings, M., den Poel, D. V., Hespeels, N., & Gryp, R. (2015). Evaluating

multiple classifiers for stock price direction prediction. Expert Systems with

Applications , 42 , 7046 – 7056.

33


Balshi, M. S., Mcguire, A. D., Duffy, P., Flannigan, M., Walsh, J., & Melillo, J.765

(2009). Assessing the response of area burned to changing climate in western

boreal north america using a multivariate adaptive regression splines (mars)

approach. Global Change Biology , 15 , 578–600.

Beck, J., Friedrich, D., Brandani, S., Guillas, S., & Fraga, E. (2012). Surro-

gate based optimisation for design of pressure swing adsorption systems. In770

Proceedings of the 22nd European Symposium on Computer Aided Process

Engineering .

Bermolen, P., & Rossi, D. (2009). Support vector regression for link load pre-

diction. Computer Networks , 53 , 191 – 201. QoS Aspects in Next-Generation

Networks.775

Biau, G. (2012). Analysis of a random forests model. Journal of Machine

Learning Research, 13 , 1063–1095.

Breiman, L. (2001). Random forests. Machine Learning , 45 , 5–32.

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classifica-

tion and Regression Trees . Wadsworth.780

Brus, D. J., & Heuvelink, G. B. (2007). Optimization of sample patterns for

universal kriging of environmental variables. Geoderma, 138 , 86 – 95.

Caballero, J. A., & Grossmann, I. E. (2008). An algorithm for the use of

surrogate models in modular flowsheet optimization. AIChE Journal , 54 ,

2633–2650.785

Cavanaugh, K. C., Kellner, J. R., Forde, A. J., Gruner, D. S., Parker, J. D.,

Rodriguez, W., & Feller, I. C. (2014). Poleward expansion of mangroves is a

threshold response to decreased frequency of extreme cold events. Proceedings

of the National Academy of Sciences , 111 , 723–727.

Chang, C.-C., & Lin, C.-J. (2011). Libsvm: A library for support vector ma-790

chines. ACM Transactions on Intelligent Systems and Technology , 2 , 27:1–

27:27.

34


Chen, L., Zhao, Y., Zhang, J., & zhong Zou, J. (2015). Automatic detection of

alertness/drowsiness from physiological signals using wavelet-based nonlinear

features and machine learning. Expert Systems with Applications , 42 , 7344 –795

7355.

Chen, Q.-L., Wu, K.-J., & He, C.-H. (2014). Thermal conductivity of ionic

liquids at atmospheric pressure: Database, analysis, and prediction using a

topological index method. Industrial and Engineering Chemistry Research,

53 , 7224–7232.800

Cherkassky, V., & Ma, Y. (2004). Practical selection of svm parameters and

noise estimation for svm regression. Neural Networks , 17 , 113 – 126.

Comrie, A. C. (1997). Comparing neural networks and regression models for

ozone forecasting. Journal of the Air and Waste Management Association,

47 , 653–663.805

Constantinou, A. C., Freestone, M., Marsh, W., Fenton, N., & Coid, J. (2015).

Risk assessment and risk management of violent reoffending among prisoners.

Expert Systems with Applications , 42 , 7511 – 7529.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling

wine preferences by data mining from physicochemical properties. Decision810

Support Systems , 47 , 547 – 553. Smart Business Networks: Concepts and

Empirical Evidence.

Cozad, A., Sahinidis, N. V., & Miller, D. C. (2014). Learning surrogate models

for simulation-based optimization. AIChE Journal , 60 , 2211–2227.

Davis, E., & Ierapetritou, M. (2008). A kriging-based approach to minlp con-815

taining black-box models and noise. Industrial and Engineering Chemistry

Research, 47 , 6101–6125.

Demšar, J., Curk, T., Erjavec, A., Črt Gorup, Hočevar, T., Milutinovič, M.,

Možina, M., Polajnar, M., Toplak, M., Starič, A., Štajdohar, M., Umek, L.,

35


Žagar, L., Žbontar, J., Žitnik, M., & Zupan, B. (2013). Orange: Data mining820

toolbox in python. Journal of Machine Learning Research, 14 , 2349–2353.

Dua, V. (2010). A mixed-integer programming approach for optimal configura-

tion of artificial neural networks. Chemical Engineering Research and Design,

88 , 55 – 60.

Eronen, A., & Klapuri, A. (2010). Music tempo estimation with k -nn regression.825

Audio, Speech, and Language Processing, IEEE Transactions on, 18 , 50–57.

Fanelli, G., Gall, J., & Van Gool, L. (2011). Real time head pose estimation

with random regression forests. In Computer Vision and Pattern Recognition

(CVPR), 2011 IEEE Conference on (pp. 617–624).

Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals830

of Statistics , 19 , pp. 1–67.

GAMS Development Corporation (2013). General Algebraic Modeling System

(GAMS) Release 24.2.1.. Washington, DC, USA.

Genuer, R., Poggi, J.-M., & Tuleau-Malot, C. (2010). Variable selection using

random forests. Pattern Recognition Letters , 31 , 2225 – 2236.835

Gevrey, M., Dimopoulos, I., & Lek, S. (2003). Review and comparison of meth-

ods to study the contribution of variables in artificial neural network models.

Ecological Modelling , 160 , 249 – 264. Modelling the structure of acquatic

communities: concepts, methods and problems.

Ghasemi, J., Saaidpour, S., & Brown, S. D. (2007). Qspr study for estimation840

of acidity constants of some aromatic acids derivatives using multiple linear

regression (mlr) analysis. Journal of Molecular Structure: THEOCHEM , 805 ,

27 – 32.

Greene, M., Rolfson, O., Garellick, G., Gordon, M., & Nemes, S. (2015).

Improved statistical analysis of pre- and post-treatment patient-reported845

36


outcome measures (proms): the applicability of piecewise linear regression

splines. Quality of Life Research, 24 , 567–573.

Gudise, V., & Venayagamoorthy, G. (2003). Comparison of particle swarm

optimization and backpropagation as training algorithms for neural networks.

In Swarm Intelligence Symposium, 2003. SIS ’03. Proceedings of the 2003850

IEEE (pp. 110–117).

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H.

(2009). The weka data mining software: An update. SIGKDD Explorations

Newsletter , 11 , 10–18.

Helton, J., & Davis, F. (2003). Latin hypercube sampling and the propagation855

of uncertainty in analyses of complex systems. Reliability Engineering and

System Safety , 81 , 23 – 69.

Henao, C. A., & Maravelias, C. T. (2010). Surrogate-based process synthesis. In

S. Pierucci, & G. B. Ferraris (Eds.), 20th European Symposium on Computer

Aided Process Engineering (pp. 1129 – 1134). Elsevier volume 28 of Computer860

Aided Chemical Engineering .

Henao, C. A., & Maravelias, C. T. (2011). Surrogate-based superstructure

optimization framework. AIChE Journal , 57 , 1216–1232.

Hill, T., Marquez, L., O’Connor, M., & Remus, W. (1994). Artificial neural

network models for forecasting and decision making. International Journal of865

Forecasting , 10 , 5 – 15.

Jekabsons, G. (2011). Areslab: Adaptive regression splines toolbox for mat-

lab/octave.

Khayet, M., Cojocaru, C., & Zakrzewska-Trznadel, G. (2008). Response surface

modelling and optimization in pervaporation. Journal of Membrane Science,870

321 , 272 – 283.

37


Khuri, A. I., & Mukhopadhyay, S. (2010). Response surface methodology. Wiley

Interdisciplinary Reviews: Computational Statistics , 2 , 128–149.

Kleijnen, J. P. (2009). Kriging metamodeling in simulation: A review. European

Journal of Operational Research, 192 , 707 – 716.875

Kleijnen, J. P. C., & Beers, W. C. M. v. (2004). Application-driven sequential

designs for simulation experiments: Kriging metamodelling. The Journal of

the Operational Research Society , 55 , pp. 876–883.

Kone, E. R. S., & Karwan, M. H. (2011). Combining a new data classification

technique and regression analysis to predict the cost-to-serve new customers.880

Computer and Industrial Engineering , 61 , 184–197.

Korhonen, K. T., & Kangas, A. (1997). Application of nearestneighbour regres-

sion for generalizing sample tree information. Scandinavian Journal of Forest

Research, 12 , 97–101.

Leathwick, J., Elith, J., & Hastie, T. (2006). Comparative performance of885

generalized additive models and multivariate adaptive regression splines for

statistical modelling of species distributions. Ecological Modelling , 199 , 188

– 196. Predicting Species Distributions Results from a Second Workshop

on Advances in Predictive Species Distribution Models, held in Riederalp,

Switzerland, 2004.890

Leong, L.-Y., Hew, T.-S., Lee, V.-H., & Ooi, K.-B. (2015). An semartificial-

neural-network analysis of the relationships between servperf, customer sat-

isfaction and loyalty among low-cost and full-service airline. Expert Systems

with Applications , 42 , 6620 – 6634.

Levis, A. A., & Papageorgiou, L. G. (2005). Customer demand forecasting895

via support vector regression analysis. Chemical Engineering Research and

Design, 83 , 1009 – 1018.

38


Li, B., Zhang, L., Yan, Q., & Xue, Y. (2014a). Application of piecewise lin-

ear regression in the detection of vegetation greenness trends on the tibetan

plateau. International Journal of Remote Sensing , 35 , 1526–1539.900

Li, S., Feng, L., P., B., & Seidel-Morgenstern, A. (2014b). Using surrogate

models for efficient optimization of simulated moving bed chromatography.

Computers and Chemical Engineering , 67 , 121 – 132.

Li, Y., Gong, S., & Liddell, H. (2000). Support vector regression and classifica-

tion based multi-view face detection and recognition. In Automatic Face and905

Gesture Recognition, 2000. Proceedings. Fourth IEEE International Confer-

ence on (pp. 300–305).

Lloyd, C. D., & Atkinson, P. M. (2002). Deriving dsms from lidar data with

kriging. International Journal of Remote Sensing , 23 , 2519–2524.

Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary910

Reviews: Data Mining and Knowledge Discovery , 1 , 14–23.

Lu, C.-J., Lee, T.-S., & Chiu, C.-C. (2009). Financial time series forecasting us-

ing independent component analysis and support vector regression. Decision

Support Systems , 47 , 115 – 125.

Magnani, A., & Boyd, S. (2009). Convex piecewise-linear fitting. Optimization915

and Engineering , 10 , 1–17.

Malash, G. F., & El-Khaiary, M. I. (2010). Piecewise linear regression: A

statistical method for the analysis of experimental adsorption data by the

intraparticle-diffusion models. Chemical Engineering Journal , 163 , 256 –

263.920

Matthews, T. J., Steinbauer, M. J., Tzirkalli, E., Triantis, K. A., & Whittaker,

R. J. (2014). Thresholds and the speciesarea relationship: a synthetic analysis

of habitat island datasets. Journal of Biogeography , 41 , 1018–1028. URL:

http://dx.doi.org/10.1111/jbi.12286. doi:10.1111/jbi.12286.

39

http://dx.doi.org/10.1111/jbi.12286
http://dx.doi.org/10.1111/jbi.12286


Miller, D. C., Syamlal, M., Mebane, D. S., Storlie, C., Bhattacharyya, D.,925

Sahinidis, N. V., Agarwal, D., Tong, C., Zitney, S. E., Sarkar, A., Sun, X.,

Sundaresan, S., Ryan, E., Engel, D., & Dale, C. (2014). Carbon capture simu-

lation initiative: A case study in multiscale modeling and new challenges. An-

nual Review of Chemical and Biomolecular Engineering , 5 , 301–323. PMID:

24797817.930

Minjares-Fuentes, R., Femenia, A., Garau, M., Meza-Velzquez, J., Simal, S.,

& Rossell, C. (2014). Ultrasound-assisted extraction of pectins from grape

pomace using citric acid: A response surface methodology approach. Carbo-

hydrate Polymers , 106 , 179 – 189.

Muggeo, V. M. (2008). Segmented: an r package to fit regression models with935

broken-line relationships. R news , 8 , 20–25.

Nuchitprasittichai, A., & Cremaschi, S. (2013). An algorithm to determine

sample sizes for optimization with artificial neural networks. AIChE Journal ,

59 , 805–812.

Paliwal, M., & Kumar, U. A. (2009). Neural networks and statistical techniques:940

A review of applications. Expert Systems with Applications , 36 , 2 – 17.

Palmer, K., & Realff, M. (2002). Metamodeling approach to optimization of

steady-state flowsheet simulations: Model generation. Chemical Engineering

Research and Design, 80 , 760 – 772.

Pan, J., Kung, P., Bretholt, A., & Lu, J. (2014). Prediction of energys environ-945

mental impact using a three-variable time series model. Expert Systems with

Applications , 41 , 1031 – 1040.

Papadopoulos, H., Vovk, V., & Gammerman, A. (2011). Regression confor-

mal prediction with nearest neighbours. Journal of Artificial Intelligence

Research, 40 , 815–840.950

40


Quinlan, J. R. (1992). Learning with continuous classes. In Proceedings of the

Australian Joint Conference on Artificial Intelligence (pp. 343–348). World

Scientific.

R Development Core Team (2008). R: A Language and Environment for Sta-

tistical Computing . R Foundation for Statistical Computing Vienna, Austria.955

ISBN 3-900051-07-0.

Rafiq, M., Bugmann, G., & Easterbrook, D. (2001). Neural network design for

engineering applications. Computers and Structures , 79 , 1541 – 1552.

Sampson, P. D., Richards, M., Szpiro, A. A., Bergen, S., Sheppard, L., Larson,

T. V., & Kaufman, J. D. (2013). A regionalized national universal kriging960

model using partial least squares regression for estimating annual pm2.5 con-

centrations in epidemiology. Atmospheric Environment , 75 , 383 – 392.

Sarimveis, H., Alexandridis, A., Mazarakis, S., & Bafas, G. (2004). A

new algorithm for developing dynamic radial basis function neural net-

work models based on genetic algorithms. Computers and Chem-965

ical Engineering , 28 , 209 – 217. URL: http://www.sciencedirect.

com/science/article/pii/S0098135403001698. doi:http://dx.doi.org/

10.1016/S0098-1354(03)00169-8. Escape 12.

Scheuber, M. (2010). Potentials and limits of the k-nearest-neighbour method

for regionalising sample-based data in forestry. European Journal of Forest970

Research, 129 , 825–832.

Smola, A., & Schlkopf, B. (2004). A tutorial on support vector regression.

Statistics and Computing , 14 , 199–222.

Strikholm, B. (2006). Determining the number of breaks in a piecewise lin-

ear regression model . Working Paper Series in Economics and Finance 648975

Stockholm School of Economics.

Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal

of the Royal Statistical Society, Series B , 58 , 267–288.

41

http://www.sciencedirect.com/science/article/pii/S0098135403001698
http://www.sciencedirect.com/science/article/pii/S0098135403001698
http://www.sciencedirect.com/science/article/pii/S0098135403001698
http://dx.doi.org/http://dx.doi.org/10.1016/S0098-1354(03)00169-8
http://dx.doi.org/http://dx.doi.org/10.1016/S0098-1354(03)00169-8
http://dx.doi.org/http://dx.doi.org/10.1016/S0098-1354(03)00169-8


Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a ret-

rospective. Journal of the Royal Statistical Society: Series B (Statistical980

Methodology), 73 , 273–282.

Toms, J. D., & Lesperance, M. L. (2003). Piecewise regression: A tool for

identifying ecological thresholds. Ecology , 84 , 2034–2041.

Toriello, A., & Vielma, J. P. (2012). Fitting piecewise linear continuous func-

tions. European Journal of Operational Research, 219 , 86–95.985

Tsanas, A., & Xifara, A. (2012). Accurate quantitative estimation of energy

performance of residential buildings using statistical machine learning tools.

Energy and Buildings , 49 , 560 – 567.

Venkatesh, K., Ravi, V., Prinzie, A., & den Poel, D. V. (2014). Cash demand

forecasting in atms by clustering and neural networks. European Journal of990

Operational Research, 232 , 383 – 392.

Viana, F. A. C., Simpson, T. W., Balabanov, V., & Toropov, V. (2014). Meta-

modeling in Multidisciplinary Design Optimization: How Far Have We Really

Come? AIAA Journal , 52 , 670–690.

Wu, K.-J., Chen, Q.-L., & He, C.-H. (2014). Speed of sound of ionic liquids:995

Database, estimation, and its application for thermal conductivity prediction.

AIChE Journal , 60 , 1120–1131.

Xue, Y., Liu, S., Zhang, L., & Hu, Y. (2013). Integrating fuzzy logic with

piecewise linear regression for detecting vegetation greenness change in the

yukon river basin, alaska. International Journal of Remote Sensing , 34 , 4242–1000

4263.

Yeh, I.-C. (1998). Modeling of strength of high-performance concrete using

artificial neural networks. Cement and Concrete Research, 28 , 1797 – 1808.

Zhang, J.-R., Zhang, J., Lok, T.-M., & Lyu, M. R. (2007). A hybrid particle

swarm optimizationback-propagation algorithm for feedforward neural net-1005

42


work training. Applied Mathematics and Computation, 185 , 1026 – 1037.

Special Issue on Intelligent Computing Theory and Methodology.

Zhang, Y., & Sahinidis, N. V. (2013). Uncertainty quantification in co2 seques-

tration using surrogate models from polynomial chaos expansion. Industrial

and Engineering Chemistry Research, 52 , 3121–3132.1010

Zhu, Q., & Lin, H. (2010). Comparing ordinary kriging and regression kriging

for soil properties in contrasting landscapes. Pedosphere, 20 , 594 – 606.

43


	Introduction
	Method
	A novel regression method
	An illustrative example

	Results and Discussion
	Sensitivity analysis for 
	Prediction performance comparison
	Strength and weakness of the proposed piece-wise regression method

	Concluding Remarks
	Acknowledgements