Mathematical Programming for Piecewise Linear Regression Analysis Lingjian Yanga, Songsong Liua, Sophia Tsokab, Lazaros G. Papageorgioua,∗ aCentre for Process Systems Engineering, Department of Chemical Engineering, University College London, Torrington Place, London WC1E 7JE, UK bDepartment of Informatics, School of Natural and Mahtematical Sciences, King’s College London, Strand, London WC2R 2LS, UK Abstract In data mining, regression analysis is a computational tool that predicts con- tinuous output variables from a number of independent input variables, by ap- proximating their complex inner relationship. A large number of methods have been successfully proposed, based on various methodologies, including linear regression, support vector regression, neural network, piece-wise regression etc. In terms of piece-wise regression, the existing methods in literature are usu- ally restricted to problems of very small scale, due to their inherent non-linear nature. In this work, a more efficient piece-wise regression method is intro- duced based on a novel integer linear programming formulation. The proposed method partitions one input variable into multiple mutually exclusive segments, and fits one multivariate linear regression function per segment to minimise the total absolute error. Assuming both the single partition feature and the number of regions are known, the mixed integer linear model is proposed to simultane- ously determine the locations of multiple break-points and regression coefficients for each segment. Furthermore, an efficient heuristic procedure is presented to identify the key partition feature and final number of break-points. 7 real world problems covering several application domains have been used to demon- ∗Corresponding author: Tel.: +442076792563; Fax.: +442073882348 Email addresses: lingjian.yang.10@ucl.ac.uk (Lingjian Yang), s.liu@ucl.ac.uk (Songsong Liu), sophia.tsoka@kcl.ac.uk (Sophia Tsoka), l.papageorgiou@ucl.ac.uk (Lazaros G. Papageorgiou) Preprint submitted to Journal of Expert Systems with Applications August 12, 2015 strate the efficiency of our proposed method. It is shown that our proposed piece-wise regression method can be solved to global optimality for datasets of thousands samples, which also consistently achieves higher prediction accuracy than a number of state-of-the-art regression methods. Another advantage of the proposed method is that the learned model can be conveniently expressed as a small number of if-then rules that are easily interpretable. Overall, this work proposes an efficient rule-based multivariate regression method based on piece-wise functions and achieves better prediction performance than state-of- the-arts approaches. This novel method can benefit expert systems in various applications by automatically acquiring knowledge from databases to improve the quality of knowledge base. Keywords: regression analysis, surrogate model, piecewise linear function, mathematical programming, optimisation 1. Introduction In data mining, regression is a type of analysis that predicts continuous out- put/response variables from several independent input variables. Given a num- ber of samples, each one of which is characterised by certain input and output variables, regression analysis aims to approximate their functional relationship.5 The estimated functional relationship can then be used to predict the level of output variable for new enquiry samples. Generally, regression analysis can be useful under two circumstances: 1) when the process of interest is a black-box, i.e. there is limited knowledge of the underlying mechanism of the system. In this case, regression analysis can accurately predict the output variables from10 the relevant input variables without requiring details of the however compli- cated inner mechanism (Bai et al., 2014; Venkatesh et al., 2014; Cortez et al., 2009; Davis & Ierapetritou, 2008). Quite frequently, the user would also like to gain some valuable insights into the true underlying functional relationship, which means the interpretability of a regression method is also of importance, 2)15 when the detailed simulation model relating input variables to output variables, 2 usually via some other intermediate variables, is known, yet is too complex and expensive to be evaluated comprehensively in feasible computational time. In this case, regression analysis is capable of approximating the overall system be- haviour with much simpler functions while preserving a desired level of accuracy,20 and can then be more cheaply evaluated (Caballero & Grossmann, 2008; Henao & Maravelias, 2011, 2010; Viana et al., 2014; Beck et al., 2012). Over the past years, regression analysis has been established as a powerful tool in a wide range of applications, including: customer demand forecasting (Levis25 & Papageorgiou, 2005; Kone & Karwan, 2011), investigation of CO2 capture process (Zhang & Sahinidis, 2013; Nuchitprasittichai & Cremaschi, 2013), opti- misation of moving bed chromatography (Li et al., 2014b), forecasting of CO2 emission (Pan et al., 2014), prediction of acidity constants for aromatic acids (Ghasemi et al., 2007), prediction of induction of apoptosis by different chemical30 components (Afantitis et al., 2006) and estimation of thermodynamic property of ionic liquids (Chen et al., 2014; Wu et al., 2014). A large number of regression analysis methodologies exist in the literature, including: linear regression, support vector regression (SVR), kriging, radial35 basis function (RBF) (Sarimveis et al., 2004), multivariate adaptive regression splines (MARS), multilayer perceptron (MLP), random forest, K-nearest neigh- bour (KNN) and piecewise regressions. We briefly summarise those regression methodologies before presenting our proposed method. 40 Linear regression Linear regression is one of the most classic types of regression analysis, which predicts the output variables as linear combinations of the input variables. The regression coefficients of the input variables are usually estimated using least squared error or least absolute error approaches, and the problems can45 be formulated as either quadratic programming or linear programming prob- lems, which can be solved efficiently. In some cases when the estimated linear 3 relationship fails to adequately describe the data, a variant of linear regres- sion analysis, called polynomial regression, can be adopted to accommodate non-linearity (Khuri & Mukhopadhyay, 2010). In polynomial regression, higher50 degree polynomials of the original independent input variables are added as new input variables into the regression function, before estimating the coefficients of the aggregated regression function. Polynomial functions of second-degree have been most frequently used in literature due to its robust performance and computational efficiency (Khayet et al., 2008; Minjares-Fuentes et al., 2014).55 Another popular variant of linear regression is called least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1994). In LASSO, summation of absolute values of regression coefficients is added as a penalty term into the objective function. The nature of LASSO encourages some coefficients to equal60 to 0, thus performing implicit feature selection (Tibshirani, 2011). Automated learning of algebraic models for optimisation (ALAMO) (Cozad et al., 2014; Zhang & Sahinidis, 2013) is a mathematical programming-based regression method that proposes low-complexity functions to predict output65 variables. Given the independent input features, ALAMO starts with defining a large set of potential basis functions, such as polynomial, multinomial, expo- nential and logarithmic forms of the original input variables. Subsequently an mixed integer linear programming model (MILP) is solved to select the best subset of T basis functions that optimally fit the data. The value of T is ini-70 tially set equal to 1 and then iteratively increased until the Akaike information criterion, which measures the generalisation of the constructed model, starts to decrease (Miller et al., 2014). The integer programming model is capable of capturing the synthetic effect of different basis functions, which is considered more efficient than traditional step-wise feature selection.75 SVR Support vector machine is a very established statistical learning algorithm, 4 which fits a hyper plane to the data in hand (Smola & Schlkopf, 2004). SVR minimises two terms in the objective function, one of which is �-insensitive loss80 function, i.e. only sample training error greater than an user-specific threshold, �, is considered in the loss function. The other term is model complexity, which is expressed as sum of squared regression coefficients. Controlling model com- plexity usually ensures the model generalisation, i.e. high prediction accuracy in testing samples. Another user-specified trade-off parameter balances the sig-85 nificance of the two terms (Chang & Lin, 2011; Bermolen & Rossi, 2009). One of the most important features that contribute to the competitiveness of SVR is the kernel trick. Kernel trick maps the dataset from the original space to higher-dimensional inner product space, at where a linear regression is equiva- lent to an non-linear regression function in the original space (Li et al., 2000).90 A number of kernel functions can be employed, e.g. polynomial function, radial basis function and fourier series (Levis & Papageorgiou, 2005). Formulated as a convex quadratic programming problem, SVR can be solved to global optimality. Despite the simplicity and optimality of SVR, the problem of tuning two param-95 eters, i.e. training error tolerance � and trade-off parameter balancing model complexity and accuracy, and selection of suitable kernels still considerably af- fect its performance accuracy (Lu et al., 2009; Cherkassky & Ma, 2004). Kriging100 Kriging is a spatial interpolation-based regression analysis methodology (Klei- jnen & Beers, 2004). Given a query sample, kriging estimates its output as a weighted sum of the outputs of the known nearby samples. The weights of samples are computed solely from the data by considering sample closeness and redundancy, instead of being given by an arbitrary decreasing function of105 distance (Kleijnen, 2009). The interpolation nature of kriging means that the derived interpolant passes through the given training data points, i.e. the er- ror between predicted output and real output is zero for all training samples. Different variants of kriging have been developed in literature, including the 5 most popular ordinary kriging (Lloyd & Atkinson, 2002; Zhu & Lin, 2010) and110 universal kriging (Brus & Heuvelink, 2007; Sampson et al., 2013). MARS MARS (Friedman, 1991) is another type of regression analysis that accommo- dates non-linearity and interaction between independent input variables in its115 functional relationship. Non-linearity is introduced into MARS in the form of the so-called hinge functions, which are expressions with max operators and look like max(0,X − const). If independent variable X is greater than a constant number const, the hinge function is equal to X-const, otherwise the hinge func- tion equals to 0. The hinge functions create knots in the prediction surface of120 MARS. The functional form of MARS can be a weighted sum of constant, hinge functions and products of multiple hinge functions, which makes it suitable to model a wide range of non-linearity (Andrs et al., 2011). The building of MARS usually consists of two steps, a forward addition and125 a backward deletion step. In the forward addition step, MARS starts from one single intercept term/constant and iteratively adds pairs of hinge functions (i.e. max(0,X − const) and max(0,const − X)) that leads to largest reduction in training error. Afterwards, a backward deletion step, which removes one by one those hinge functions contributing insignificantly to the model accuracy, is130 employed to improve generalisation of the final model (Leathwick et al., 2006; Balshi et al., 2009). The presence of hinge functions also make MARS a piece- wise regression method. MLP135 Multilayer perceptron is a feedforward artificial neural network, whose structure is inspired by the organisations of biological neural networks (Hill et al., 1994). A MLP typically consists of an input layer of measurable features, an output layer of response variables, sandwiching multiple intermediate layers of neurons. The network is fully interconnected in the sense that neurons in each layer are140 6 connected to all the neurons in the two neighbour layers (Comrie, 1997; Gevrey et al., 2003). Each neuron in the intermediate layers takes a weighted linear combination of outputs from all neurons in the previous layer as input, applies an non-linear transformation function before supplying the output to all neu- rons of the next layer. The use of non-linear transformation functions, including145 sigmoid, hyperbolic tangent and logarithmic functions, makes MLP suitable for modelling highly non-linear relationship (Gevrey et al., 2003; Rafiq et al., 2001). Identifying the optimal configuration of a MLP, i.e. the number of interme- diate layers, the number of neurons for each intermediate layer, the type of150 activation function for each neuron and the weights of connection between con- secutive layers of neurons, is known to be time-consuming and traps in local optimal solutions (Paliwal & Kumar, 2009). The large degree of freedom in training a MLP is often blamed for data over-fitting. Dua (2010) has presented a two-objective mathematical formulation trying to find the best configuration155 of a MLP by balancing the training accuracy and network complexity. More often, architecture of a MLP is fixed by the user and back-propagation is used to tune only the weights of connection between neighbour layers of neurons (Gud- ise & Venayagamoorthy, 2003; Zhang et al., 2007). 160 Random forest Before introducing random forest we first describe regression tree, which is a decision tree-based prediction model. Starting from the entire set of samples, a regression tree selects one independent input variable among all and performs binary split into two child sets, under the condition that the two child nodes165 give increased purity of the data compared with its single parent node. Purity is often defined as the deviation of predicting with the mean value of the out- put variable. The process of binary split is recursively applied for each child node until a terminating criterion is satisfied. The nodes that are not further partitioned are called leaves. After growing a large tree, a pruning process is170 employed to remove the leaves contributing insignificantly to the purity im- 7 provement (Breiman et al., 1984; Loh, 2011). In order to improve model fit, a linear regression model can be fitted for each leaf (Quinlan, 1992). Random forest is an ensemble learning method of regression trees. In general,175 random forest (Breiman, 2001; Biau, 2012) builds a forest of multiple regression tree models and aggregate the decisions from all the trees to produce a final prediction. Given a dataset, multiple bootstrap sample sets are first created by random sampling with replacement. Each of the bootstrap sample set is then learned by a revised regression tree algorithm, which differs from the classic180 regression tree by randomly selecting a candidate subset of features for each binary split of node (Genuer et al., 2010). The accuracy of each regression tree can be estimated on the training samples absent from the bootstrap set, and the final prediction can be either simple average of predictions from all trees or weighted average considering the estimated accuracy. It is demonstrated that185 random forest achieves much robust prediction performance compared with sin- gle regression tree method (Breiman, 2001; Fanelli et al., 2011). KNN KNN belongs to the category of lazy learning algorithms, due to the fact that190 prediction is based on the instances without an explicit training phase of con- structing models, thus making it one of the simplest regression methods in literature (Korhonen & Kangas, 1997). Given an enquiry sample, KNN first identifies K closest instances in the training sample set, the exact value of K is given a priori. The closeness of samples can be measured by different distance195 metrics, for example Euclidean and Manhattan distances (Scheuber, 2010; Ero- nen & Klapuri, 2010). Prediction is then taken as weighted mean of the outputs of the K nearest neighbours, with weight often being defined as the inverse of distance (Papadopoulos et al., 2011). Despite its simplicity, KNN usually provides competitive prediction performance against much more sophisticated200 algorithms. 8 Previous work on piecewise regression Piecewise functions have been frequently studied in literature as well. In (Toms & Lesperance, 2003), univariate piece-wise linear functions have been used to205 fit ecological data and identify break-points that represent critical threshold values of a phenomenon. In (Strikholm, 2006), a method based on statistical testing is proposed to estimate the number of break-points for an univariate piece-wise linear function. Malash & El-Khaiary (2010) also apply piece-wise linear regression techniques on univariate experimental adsorption data. Piece-210 wise function is determined by solving a non-linear programming model. Seg- Reg (www.waterlog.info/segreg.htm) is free software that permits estimating of piece-wise regression functions with up to two independent variables. For one independent variable, SegReg splits from a series of candidate break-points and for each one fits a linear regression for either side of the break-point. The215 break-point corresponding to the largest statistical confidence is taken as the final solution. In the case of two independent variables, SegReg first determines the two-region piece-wise regression function between the dependent variable and the most significant input variable, before computing the relation between its residual/deviation and the second input variable.220 Both Magnani & Boyd (2009) and Toriello & Vielma (2012) publish work on data fitting with a special family of piece-wise regression functions, called max- affine functions. The form of max-affine functions is defined as the maximum of a series of linear functions, i.e. a sample is projected to all linear functions,225 and the maximum projected value among all is taken as final predicted value from the piece-wise functions. The use of max-affine functions limits the fitted surface to be convex. In (Magnani & Boyd, 2009) a heuristic method is used to ease the difficulty of direct solving the highly non-linear max-affine functions, while in (Toriello & Vielma, 2012), big-M constraint is used to reformulate230 the problem into an non-convex mixed integer non-linear programming model. However, computational complexity is limiting their applications to examples of small scale. 9 More recently, Greene et al. (2015) applies piece-wise regression analysis to235 predict patient’s post-treatment life quality with the pre-treatment life quality measure, which identifies the segments where therapy benefits vary significantly. The analysis is performed using Segmented (Muggeo, 2008), a package written in R (R Development Core Team, 2008). Segmented formulates the problem using a non-linear model and requires a user to specify the segmented input240 variables, the number of break-points and also the initial guess of each break- point. Starting from the those user supplied initial positions of break-points, Segmented iteratively moves around the neighbour of the initial guess points to search break-points of better quality using local linearisation. However, it is difficult if not impossible to reasonably guess good starting points for real world245 multivariate problems of large number of samples and input variables, where visual examination cannot be performed. This makes it hard to identify quality solutions. Furthermore, Segmented only allows the input variables being par- titioned to have different regression coefficients across different segments, while the other input variables keep the same coefficients within the entire ranges,250 significantly restricting its flexibility. In both (Xue et al., 2013) and (Li et al., 2014a), piece-wise regression func- tion were employed to detect vegetation changes. Piece-wise linear regression was tackled using fuzzy logic and identifies the changes in patterns of vegetation255 greenness. Cavanaugh et al. (2014) employ piece-wise regression and find out that the changes in mangrove area over the last 20 years is a piece-wise functions of latitude, with regions above and below a specific threshold latitude value fol- lowing two different patterns of mangrove grows. Moreover, Matthews et al. (2014) uses 2-segment piece-wise functions to describe the relationship between260 species richness and fragment area of islands, with the critical breakpoint being determined by simply sampling a number of candidate values and selecting the one giving best model fit. Unfortunately, the above methods are all limited to model rather simple relationship between one output variable and one input 10 variable, seriously limiting their usage in more complex problems.265 It is clear that the previous literature work of piece-wise regression methods are non-linear and computationally restricted to problems of very small scales. Yet, they cannot be solved to identify globally optimal solutions. In this work, we propose a novel linear model for piece-wise regression analysis. A single270 input variable is partitioned to separate samples into multiple mutually exclu- sive segments, while each segment is fitted with a unique multivariate linear regression function. Assuming that both the partition variable and the number of break-points are known, we propose an optimisation model that optimally estimates the position of all break-points and the linear regression coefficients275 for each segment simultaneously so that the total absolute deviation is min- imised. Thanks to the usage of binary variables, our proposed mathematical model is linear and can be efficiently solved to global optimality for problems up to thousands samples (see Results and Discussion section). Furthermore, a solution procedure is used to identify the key partition variable and the final280 number of break-points. Several real world multivariate benchmark datasets have been used to demonstrate the applicability and efficiency of the proposed method. The proposed piece-wise regression method can help construct expert systems in285 various application domains. Expert systems are computer programs designed to make decisions analogous to human experts. As an expert system is typ- ically made up of an inference engine and a knowledge base, the quality and quantity of information in knowledge base directly affects the usefulness of the constructed expert system. Our proposed piece-wise regression method can be290 helpful in more efficiently building expert systems via automatic and efficient acquisition of knowledge. More specifically, the proposed piece-wise regression method can extract latent knowledge from the large collection of domain expert curated databases. Those discovered knowledge are represented in the form of identified relationship between input and output variables of interest, which295 11 can be combined with expert knowledge to form the final expert system (Alonso et al., 2012). For example, the proposed piece-wise regression method in this work can be used for building prognostic expert systems in medical applica- tions. When presented historical data of patients’ clinical variables and survival length, piece-wise regression can induce domain knowledge by approximating300 the complex relationship between clinical variables and survival length. Those induced knowledge can then be used to perform prognosis for the current pa- tients, imitating the end-behaviour of human experts, i.e. medical doctors. Overall, the key contributions of our work are illustrated below:305 • We propose a novel mixed integer optimisation model for multivariate regression analysis modelling piece-wise linear functions, which partitions a single variable into multiple mutually exclusive regions and fits each one with a distinct multivariate linear function. Given as prior a single input variable as partition feature and the number of segments, the optimisation310 model can be solved to simultaneously determine the positions of multiple break points and regression coefficients for each segment. • Given that neither which feature should be segmented nor the number of segments are typically known, a heuristic solution procedure is also introduced that automatically identifies the key partition variable and the315 final number of segments. • A number of real world benchmark problems have been employed to demonstrate the applicability and efficiency of the proposed method. As sharp difference to the existing piece-wise regression methods in literature which can only be applied to problems of very small size, our proposed320 optimisation model can be solved to global optimality for datasets contain- ing up to five thousand samples. Comparison to some popular regression methods based on other methodologies clearly indicates that our proposed method based on piece-wise function achieves the highest prediction ac- curacy, and does it consistently. Besides high prediction performance,325 12 our proposed regression method has the advantage of being easily under- standable and interpretable, as the learned model can be conveniently represented as a small set of rules. • As a generic data mining method, our proposed regression method can help with constructions of expert and intelligent systems via automatic330 extraction of knowledge from database. We have discussed its potential usage in various application domains. The rest of the paper is structured as follows: in Section 2, we present the mathematical programming model and a heuristic solution procedure. In Section 3, comparative results of our proposed method and some state-of-the-335 art regression algorithms on benchmark examples are presented and discussed. The last section concludes with our key findings. 2. Method A novel piecewise linear regression method is proposed in this work. The core idea of the proposed method is to identify a single input feature, and separate340 the samples into complementary regions on this feature. One different linear regression function is fitted locally for each region. The sample partition and calculation of local regression coefficients are performed simultaneously within the proposed optimisation to achieve least absolute error. 2.1. A novel regression method345 In this section, we first describe a novel mathematical programming model that optimises the location of break-points and regression coefficients for each region so as to achieve minimal training error. Subsequently, a solution proce- dure is proposed to identify the best partition feature and the number of regions. The indices, parameters and variables associated with the proposed model are 13 listed below: Indices s sample, s=1,2,...,S m feature/independent input variable, m=1,2,...,M r region, r=1,2,...,R m* the feature where sample partition takes place Parameters Asm numeric value of sample s on feature m Ys output value of sample s U′,U′′ arbitrarily large positive numbers Continuous variables Wrm regression coefficient for feature m in region r Br intercept of regression function in region r Predrs predicted output for sample s in region r Xr m* break −point r on partition feature m* Ds training error between predicted output and real output for sample s Binary variables Frs 1 if sample s falls into region r; 0 otherwise Assume first that both the partition feature m* and the number of regions R are given, the R-1 break points are arranged in an ordered way: Xr−1m ≤ X r m ∀m = m *,r = 2, 3, ...,R (1) Binary variables Frs are introduced to model if sample s belongs to region r or 14 not. Modelling of which sample belongs to which region is achieved with the following constraints: Xr−1m −U ′(1 −Frs ) ≤ Asm ∀s,r = 2, 3, ...,R,m = m * (2) Asm ≤ Xrm + U ′(1 −Frs ) ∀s,r = 1, 2, ...,R− 1,m = m * (3) When sample s belongs to region r (i.e. Frs = 1), Asm* falls into the region bounded by the two consecutive break-points Xr−1 m* and Xr m* on feature m∗; otherwise the two sets of constraints become redundant. A visualisation of break-points and regions is provided in Figure 1: Figure 1: Break-points and regions The following constraints restrict that each sample belongs to one and only one region: ∑ r Frs = 1 ∀s (4) For sample s, its predicted output value for region r, Predrs, is as below: Predrs = ∑ m AsmW r m + B r ∀s,r (5) For any sample s, its training error is equal to the absolute deviation between 15 the real output and the predicted output for the region r where it belongs to (i.e. Frs = 1): Ds ≥ Ys −Predrs −U ′′(1 −Frs ) ∀s,r (6) Ds ≥ Predrs −Ys −U ′′(1 −Frs ) ∀s,r (7) The objective function is to minimise the sum of absolute training error: min ∑ s Ds (8) The final model, named as Optimal Piece-wise Linear Regression Analysis (OPLRA) in this work, consists of a linear objective function and several linear constraints, and the presence of both binary and continuous variables define an MILP prob- lem, which can be solved to global optimality by standard solution algorithms,350 for example branch and bound. A heuristic solution procedure is also employed in this work to identify the partition feature and the number of regions, as de- scribed in Figure 2 below. The heuristic procedure starts with solving a linear regression on the entire set355 of data with least absolute deviation. Subsequently, each input feature in turn serves as partition feature m* once and the OPLRA model is solved while al- lowing two regions (i.e. R = 2). The feature corresponding to the minimum training error is kept and if its error represents a percentage reduction of more than β from the global linear regression without data partition, the procedure360 continues; otherwise it is decided that two-region piecewise linear regression does not provide a desirable improvement upon the classic linear regression, and the initially derived linear regression function without sample partition is obtained for prediction. The parameter β, taking value between 0 and 1, quantifies the percentage reduction in training error that justifies adding one more region.365 16 Figure 2: Heuristic procedure to identify the partition feature and the number of regions 17 If two-region piecewise regression is accepted, the corresponding partition fea- ture is retained for further analysis while the number of regions is iteratively increased, until the β training reduction criterion is not satisfied between iter- ations. 370 β is the only user-specific parameter in our proposed regression method, which requires fine tuning. A small value may cause over-fitting, i.e. too many regions are allowed and each region contains only small number of samples, which then results in unreliable construction of local linear regression functions; while a value excessively large will lead to premature growing of regions, which then375 under-fit the data. In Results and Discussion section, we will test a series of values on a number of benchmark datasets and select the optimal value corre- sponding to the most robust prediction performance. The constructed piecewise linear regression functions are then used to predict380 the output value of new samples. A testing sample is firstly assigned to one of the regions, and the regression coefficients for that region are used to estimate its output value. 2.2. An illustrative example In order to better illustrate the training of the proposed regression method,385 a simulation model is taken from literature. In brief, the illustrative example (Palmer & Realff, 2002) describes the operation of a continuous stirred tank reactor, where a chain reaction of A → B → C takes place. An inlet stream containing both reactant A and B enters the reactor and the desirable output is component B. There are 4 independent input variables to the simulation model,390 including temperature of the reactor (T), volume of the reactor (V ), concen- tration of A and B in the inlet stream (CAin and CBin). The output to be predicted is the production rate of B (P). The process and associated variables are described in Figure 3. 395 18 Figure 3: Illustrative example of a continuous stirred tank reactor With latin hypercube sampling technique (Helton & Davis, 2003) employed to specify a set of data points, we run the simulation model and collect 300 samples. The goal of the regression analysis is to approximate the functional relationship between output variable P and input variables including T, V , CAin and CBin using piece-wise linear functions. The step-wise description of400 the training procedure is presented in Table 1 below. Initially, a linear regression function is fitted to the entire dataset without fea- ture segmentation, which gives an absolute deviation of 1677.78. The second iteration of the method solves 4 independent OPLRA models allowing 2 regions405 each, respectively specifying T,V,CAin and CBin as partition feature. The two- region piece-wise linear functions constructed while partitioning on T appears to yield lower training errors (i.e. 1030.63) than the other 3, and therefore is taken as the solution for iteration 2. This represents a significant improvement (i.e. 38.57%) from the initial global linear regression function. From iteration410 3, the partition feature is fixed as T while one more region is allocated for each increased iteration. Iteration 3 and 4 respectively lowers the training error to 876.66 and 807.12. The iterative procedure terminates when the β criterion is not satisfied, e.g. if β = 20%, then the iterative procedure terminates at the 19 third iteration and the final regression function has 2 regions; if β = 10%, then415 the final regression function has 3 regions. Table 1: Piecewise regression functions built at each step of training procedure Iteration Number of regions Partition feature Training error Training error improvement Functional relationship 1 1 NONE 1677.78 P = 1.0240T + 0.0054CAin + 0.0125CBin + 0.4340V − 333.54 2 2 T 1030.63 38.57% P = { 0.7413T + 0.0040CAin + 0.0102CBin + 0.3406V − 238.74, T ≤ 213.21 1.7156T + 0.0111CAin + 0.0315CBin + 0.7574V − 592.63, T > 213.21 2 V 1143.49 P = { 0.5952T + 0.0033CAin + 0.0056CBin + 0.4533V − 194.26, V ≤ 42.38 1.4781T + 0.0083CAin + 0.0195CBin + 0.4773V − 48.70, V > 42.38 2 CAin 1485.65 P = { 0.8930T + 0.0057CAin + 0.0152CBin + 0.4161V − 293.45, CAin ≤ 3528.43 1.4857T + 0.0073CAin + 0.0070CBin + 0.5929V − 489.45, CAin > 3528.43 2 CBin 1627.73 P = { 1.0242T + 0.0056CAin + 0.0118CBin + 0.4241V − 333.49, CBin ≤ 458.21 1.1105T + 0.0050CAin − 0.1405CBin + 0.5813V − 291.00, CBin > 458.21 3 3 T 876.66 14.94% P =   0.5815T + 0.0030CAin + 0.0097CBin + 0.2654V − 184.45, T ≤ 303.25 1.1353T + 0.0062CAin + 0.0176CBin + 0.4579V − 373.68, 303.25 < T ≤ 316.62 1.8764T + 0.0119CAin + 0.0394CBin + 0.8617V − 654.41, T > 316.62 4 4 T 807.12 7.93% P =   0.5815T + 0.0030CAin + 0.0097CBin + 0.2654V − 184.45, T ≤ 303.25 1.2648T + 0.0054CAin + 0.0148CBin + 0.4510V − 409.61, 303.32 < T ≤ 312.21 1.4872T + 0.0084CAin + 0.0202CBin + 0.6667V − 503.10, 312.21 < T ≤ 320.77 1.9930T + 0.0128CAin + 0.0360CBin + 0.8871V − 695.65, T > 320.77 ... Overall, the key features of our proposed piecewise linear regression method are summarised here: 1) our method identifies one key partition feature and sepa- rate samples into multiple complementary regions on it, 2) each region has the420 flexibility of being fitted by its own linear regression function, with all input fea- tures allowed to have different regression coefficients across different regions, 3) there is only one tuning parameter β, 4) compared with algorithms like kernel- based SVR and MLP, the constructed regression function is easy to understand, as it exhibits linear relationships for different regions.425 It is noted here that the obtained relationship between input and output vari- ables, presented as rules in Table 1, can be used to build an expert system for the above operation. Given the chain reaction of A → B → C in stirred tank reactor (Palmer & Realff, 2002), domain experts perform experiments to create430 a database of samples for different levels of temperature, reactor volume and concentrations of reactants. Our proposed piece-wise regression method is then applied to automatically extract the rules that predict production rate from 20 temperature, reactor volume and reactant concentrations. The rules will be difficult to be provided directly from even chemical engineering experts due to435 the complex nature of the reaction. Since the above extracted rules can calcu- late a production rate value for any random value of temperature, tank volume and reactant concentrations, regardless of if they obey physical laws (must be positive) or are valid for the reaction of interest, expert knowledge should be incorporated to further refine the rules. For example, expert knowledge can be440 used to constraint the applicable temperature range outside which liquid phase will vaporise to gas phase or freeze to solid phase, making it impossible for the reaction to proceed as normal. The final expert system will allow users to query the likely outcome, as production rate or no reaction, of any combination of values of temperature, reactor volume and reactant concentrations.445 In the next section, a number of real world regression problems are employed to benchmark the predictive performance of our proposed model. 3. Results and Discussion A total number of 7 real world datasets have been downloaded from UCI ma- chine learning repository (http://archive.ics.uci.edu/ml/) (Bache & Lichman,450 2013) to test the prediction performance of our proposed method. The first re- gression problem Yacht Hydrodynamics predicts the hydrodynamic performance of sailing yachts from 7 features describing the hull dimensions and velocity of the boat for 308 samples. Energy Efficiency (Tsanas & Xifara, 2012) collects data corresponding to 768 building shapes, described by 8 features including455 wall area, root area and so on. The aims are to establish the relationship be- tween either heating load or cooling load requirements and the 8 parameters of the building. The third example, Concrete Strength (Yeh, 1998), looks into the relationship between compressive strength of concrete and 8 input variables, including water concentration and age, with 1030 samples of different concretes.460 Airfoil dataset concerns how the different airfoil blade desings, wind speed and angles of attack affect the sound pressure level. The last 2 case studies, Red 21 Wine Quality and White Wine Quality (Cortez et al., 2009), aims to predict experts’ preference of red and while wine taste with 11 physicochemical features of the wines. Almost 1600 red wine and 4900 white wine samples have been465 obtained for analysis. For each of the 7 benchmark datasets, a 5-fold cross validation, is performed to estimate the predictive accuracy of the proposed method. Given a dataset, 5-fold cross validation randomly splits the samples into 5 subsets of equal size.470 Each subset is in turn held out once while the other 4 subsets of samples are used in the training process to derive the regression function. The holdout set is then used to validate the predictive accuracy of the constructed regression function. We conduct 10 rounds of 5-fold cross validation by performing differ- ent random sample splits, and the mean absolute prediction errors (MAE) are475 averaged over 50 testing sets as the final error. For comparison purposes, a number of state-of-the-art regression methods have been implemented, including linear regression, MLP, kriging, SVR, KNN, ran- dom forest, MARS, PaceRegression and ALAMO. Linear regression, MLP, krig-480 ing, SVR, KNN and PaceRegression are implemented in WEKA machine learn- ing software (Hall et al., 2009). For KNN, the number of nearest neighbours is selected as 5, while for other methods their default settings have been retained. Random forest is implemented using Orange (Demšar et al., 2013). We use the MATLAB toolbox called ARESlab (Jekabsons, 2011) for MARS. ALAMO485 is reproduced using the General Algebraic Modeling System (GAMS) (GAMS Development Corporation, 2013), and basis function forms including polyno- mial of degrees up to 3, pair-wise multinomial terms of equal exponents up to 3, exponential and logarithmic forms are provided for each dataset. Our proposed method is also implemented in GAMS. Both ALAMO and our proposed model490 are solved using Cplex MILP solver, with optimality gap set as 0. Computa- tional resource limit is set as 200 seconds for each solving of OPLRA model in our proposed method. 22 Figure 4: Sensitivity analysis of β. The numbers above points in each plot correspond to the average numbers of final regions. 3.1. Sensitivity analysis for β In this subsection, a sensitivity analysis is performed for the parameter β,495 which serves as a terminating criterion of the iterative training procedure for our proposed method. Taking value between 0 and 1, β defines the minimum percentage training error reduction that must be achieved to justify the alloca- tion of an extra region. A range of values have been tested, including: 0.2, 0.15, 0.10, 0.05, 0.03 and 0.01. The results of the sensitivity analysis are provided500 in Figure 4 below. 23 Figure 4 describes how mean absolute error changes with β. The numbers attached to the points in each plot are the average numbers of final regions, which always go up as β decreases. For Yacht Hydrodynamics example, setting505 β = 0.20 results in just more than 4 final regions. Decrease the β value to 0.15 increases slightly the prediction error with marginally higher number of regions. Further decrease β to 0.10 leads to lowest mean prediction error of 0.648 with an average of 5 regions, before excessively low values of β over-fits the unseen testing samples by yielding much increased prediction error. For En-510 ergy Efficiency Heating case study, when β = 0.10,0.15 and 0.20 our proposed regression method constructs piece-wise regression functions of an average of 3 regions, yielding MAE of 0.907. Smaller values of β leads to about 5 regions, which are shown to predict the testing samples with higher accuracy (MAE around 0.810). In terms of Energy Efficiency Cooling and Concrete Strength515 examples, similar phenomenon can be observed that when β takes overly high values (i.e. 0.20, 0.15 ), the proposed method terminates prematurely with only 2 regions and relatively high MAE. More regions are allowed by lowering β, which gives higher prediction accuracies. On Airfoil case study, the proposed method outputs global multiple linear regression functions without data par-520 titions when β = 0.20. As β decreases, more regions are permitted, which predict unseen samples with better accuracy. With regards to Red Wine Qual- ity dataset, the optimal prediction occurs when β = 0.03. On the last example of White Wine Quality, 2-region piece-wise regression functions achieved with β = 0.01, 0.03, 0.05 outperforms global multiple linear regressions for higher525 values of β. It can be seen from Figure 4 that the range of values between 0.01 and 0.05 gen- erally lead to smaller prediction error than higher values of β. For all datasets except Yacht Hydrodynamics, prediction errors of β = 0.01, 0.03 and 0.05 are530 evidently smaller than that of β = 0.10, 0.15 and 0.20. Within the range be- tween 0.01 and 0.05, there is no clear optimal value for β as different values have 24 different effects on the accuracy. We instead seek to identify the most robust value for β, which gives consistently desirable prediction accuracy across a wide range of problems. For each dataset, we normalise the MAE of each β accord-535 ing to the formula: MAEβ−minβ MAEβ minβ MAEβ . For example, in Yacht Hydrodynamics, original MAE of β = 0.01 is normalised from 0.7131 to 0.7131−0.6481 0.6481 = 10.0%, where 0.6481 is the lowest MAE achieved when β = 0.10. The normalised MAE of each β represents the actual deviation of it compared to the lowest error, and is averaged over all examples to reflect its overall competitiveness.540 Overall β = 0.03 provides the smallest normalised MAE of 1.7%, which is marginally lower than these of β = 0.01 and β = 0.05, respectively as 1.8% and 2.8%. Even higher values of β correspond to noticeably larger normalised MAE (5.6%, 9.7% and 12.3% for β = 0.10,0.15 and 0.20, respectively). The con-545 sistently small normalised MAE, while β is between 0.01 and 0.05, show that our proposed regression method is robust with respect to the only user tuning parameter β. Finally β is set to 0.03 when comparing with other competing methods in literature. 3.2. Prediction performance comparison550 After identifying a value (i.e. 0.03 ) for the only tuning parameter β in our proposed regression method, we now compare the accuracy of the proposed method against some popular regression algorithms with the same set of 7 ex- amples. The results of the comparison are available in Table 2 below. Table 2: Comparative testing of different regression methods on benchmark datasets Hydrodynamics Energy Heating Energy Cooling Concrete Airfoil Red Wine White Wine linear regression 7.270 2.089 2.266 8.311 0.037 0.506 0.586 MLP 0.809 0.993 1.924 6.229 0.035 0.581 0.623 Kriging 4.324 1.788 2.044 6.224 0.030 0.496 0.576 SVR 6.445 2.036 2.191 8.212 0.037 0.500 0.585 KNN 5.299 1.937 2.148 7.068 0.026 0.515 0.537 Random forest 3.516 1.435 1.644 5.309 0.030 0.484 0.519 MARS 1.011 0.796 1.324 4.871 0.035 0.502 0.570 PaceRegression 7.233 2.089 2.261 8.298 0.037 0.507 0.586 ALAMO 0.787 2.722 2.765 8.044 0.032 0.594 0.639 Proposed 0.706 0.810 1.278 4.870 0.029 0.481 0.551 555 25 In Table 2 and each tested dataset, the lowest prediction error achieved among all implemented regression methods is marked with bold. On Hydrodynamics problem, the proposed method in this work provides an MAE of 0.706, which is lower than any other competing algorithm. ALAMO, MLP and MARS follow560 closely with MAE of 0.787, 0.809 and 1.011, respectively. Mean error rates of the rest of the methods are between 3 and 8. On Energy Efficiency Heating, MARS emerges as the most accurate algorithm with an mean absolute error of 0.796, which is closely matched by our proposed method and MLP. Mean pre- diction errors of the other approaches are almost all twice as large as that of the565 MARS. In terms of Energy Efficiency Cooling dataset, the proposed method, MARS, random forest and MLP are the top 4 performers with MAE between 1.278 and 1.924. On Concrete Strength, our proposed approach and MARS, with an MAE of 4.870 and 4.871, again emerge as the leading methods from random forest, Kriging, MLP and the others. When it comes to Airfoil example,570 all the competing algorithms achieve similar prediction accuracies, with KNN topping the chart with an MAE of 0.026. The proposed approach in this work is a merely 0.003 far behind, with kriging and random forest a further 0.001 behind. A merely 0.011 separates the 10 methods. Lastly, on the two Wine Quality examples, our proposed approach is respectively ranked as 1st and 3rd575 best method. Overall, for 4 out of the 7 datasets, including Yacht Hydrodynamics, Energy Efficiency Cooling, Concrete Strength and Red Wine Quality, the proposed re- gression method achieves the lowest prediction errors. For the other 3 tested580 examples, the proposed method still perform competitively as being second on Energy Efficiency Heating, Airfoil and third on White Wine Quality. As there does not exist a single regression method which can always outper- form others on all datasets, a desirable regression algorithm should demonstrate585 consistently competitive prediction accuracy. In order to more comprehensively 26 Figure 5: Scoring of regression methods evaluate the relative competitiveness of all the implemented approaches, we em- ploy the following scoring strategy: for each problem, the regression methods are ranked in descending order according to their mean prediction error. The best regression method corresponding to the lowest prediction error is awarded590 the maximum score of 10, the second best regression method corresponding to the second lowest prediction error is assigned a score of 9 and so on. The scores of each regression approach are averaged over the 7 datasets, which represent the overall performance of the method. The higher the score, the better the relative performance of a method. The scores of the different regression ap-595 proaches used in this work are presented in Figure 5 below. According to Figure 5, the proposed method is shown to be the most accu- 27 rate and robust regression algorithm among all, achieving a score of 9.43 out of a possible 10. Random forest and MARS are second and third according to600 the ranking with scores of 8 and 7.43, followed by kriging, KNN, MLP, SVR, ALAMO, linear regression and PaceRegression in descending order. The advan- tages of the proposed regression method is quite obvious compared with other implemented methods. 605 Lastly, we take a look at, for each dataset, the number of regions and the key partition feature determined by our proposed regression method. The results are summarised in Table 3. It is clear that the proposed segmented regres- sion method provides good interpretability as the number of regions are small (usually between 2 to 4 and at most 5). The partition features may release im-610 portant insights of the underlying system as the output variables change more dramatically across different ranges alone this feature. Table 3: Number of regions and partition feature by our proposed method Dataset Number of regions Partition feature Hydrodynamics 5 Froude number Energy Heating 3 Wall Area Energy Cooling 3 Wall Area Concrete 3 Age Airfoil 4 Frequency Red Wine 2 Alcohol White Wine 2 Volatile acidity 3.3. Strength and weakness of the proposed piece-wise regression method No regression method will be the best for all problems. In this section, we give some general illustration of the pros and cons of the proposed OPLRA615 piece-wise linear regression method, and compare it against some other litera- ture methods. OPLRA piece-wise regression is inherently deterministic, which means the same solution is always guaranteed regardless of the number of runs executed. This is an advantage of OPLRA against stochastic-based methods, for example MLP, where each execution would typically end up with a different620 28 locally optimal solution. On the other hand, OPLRA is intuitive and easy to interpret. OPLRA approximates the potentially highly non-linear relationship between output and input variables as piece-wise linear algebraic functions, the formalism of which is easy to understand, interpret and use for users without sophisticated background knowledge. Contrarily, the mechanisms of certain625 methods like SVR, MLP and Kriging lack transparency as the former two work as black box techniques and the latter requires detailed knowledge on statistics. The small number of user-specified parameters involved in training of OPLRA is another remarkable advantage. β is the only tuning parameter in the proposed OPLRA, which produces robust predictive performance with regards to varying630 values of β as shown in the following Results and Discussion section. Conversely, usage of certain regression methods, including SVR, MLP and Kriging requires tuning a large number of parameters, making it a challenging task to identify their optimal values. More importantly, OPLRA piece-wise regression achieves more accurate and robust prediction performance against other methods. Using635 a large number of real word problems, OPLRA is shown to outperform popular state-of-the-art multivariate regression methods in terms of prediction accuracy and does so consistently across a number of real world problems. With regards to shortcomings of our proposed piece-wise regression method,640 training of OPLRA generally consumes more computational resource than the existing methods in literature. Solving OPLRA combinatorial optimisation model is indeed a more computationally intensive task than heuristic-based methods, for example regression tree, MARS and quadratic programming-based SVR. Therefore, we note here that this method is not designed for online ap-645 plications where computation time is valued more than the prediction accu- racy/model interpretability. Another limitation of OPLRA is that it permits segmentation of only one input variable, which may not be adequate for the datasets where trend of the output variable changes dramatically in more than one input variables.650 29 4. Concluding Remarks This work addresses the problem of multivariate regression analysis, where one seeks to estimate the complex relationship between dependent output vari- ables and independent input variables from training samples. The identified relationships can then be used to make predictions for unseen observations. We655 have proposed a novel piece-wise regression method, which approaches the prob- lem by segmenting one input variable into multiple mutually exclusive regions and simultaneously fitting each one with a distinct multivariate linear function. An optimisation model has been proposed to optimise the locations of break points and regression coefficients for each region, while a heuristic procedure660 has also been introduced to find the key partition feature and the number of break-points by repeatedly solving the optimisation models until a satisfactory solution is identified. To demonstrate the applicability and efficiency of the proposed piece-wise re-665 gression method, 7 real world problems have been employed, covering a wide range of application domains. To benchmark the predictive capability of the pro- posed method, we have also implemented various popular regression methods in literature for comparison, including support vector regression, artificial neural network, MARS and K nearest neighbour. Computational experiments clearly670 indicate that our proposed piece-wise regression method achieves consistently high predictive accuracy as leading to the lowest prediction errors for 4 out of 7 datasets, second lowest errors for 2 datasets and third lowest error for the other example. The results confirm our proposed method as a reliable alternative to traditional regression analysis methods. Another remarkable advantage of our675 proposed method is that the learned model can be conveniently expressed as a set of if-then rules that are compact and easily understandable. From Table 3, it is clear that the number of if-then rules identified by our method as the hidden patterns in the large scale databases (up to thousands expert curated samples) are extremely small (usually 2 to 3 and at most 5). The model inter-680 30 pretability of the proposed piece-wise regression is a desirable advantage over black modelling techniques, for example support vector regression and neural network. With regards to research contribution in expert and intelligent systems, the685 generic machine learning method proposed in this work can be used to con- struct a large number of automatic decision making or support systems for var- ious domain applications. As the quality and coverage of information contained in knowledge base critically affects the efficiency of any expert and intelligent system, our proposed machine learning method can serve to automatically and690 more efficiently acquire knowledge from database by approximating the relation- ship between output and input variables as rules. Subsequently, the discovered knowledge can be used to generate forecasts to users’ enquiry. To further improve the efficiency of the proposed piece-wise regression method695 in this work, the following limitations can be considered for refinement. As the piece-wise regression method proposed in this work can only partition a single input variable, one potential improvement is to generalise the method so that to permit segmentation of multiple variables so as to better capture the non- linearity in datasets. Secondly, as our proposed method in this work can only700 handle continuous input variables, we plan to improve its applicability by gener- alising it to deal with categorical input variables having many distinct levels. In addition, the relationship between output and input variables are approximated as linear for each segment in the current method, which may not adequately model the underlying patterns. To overcome this, more complex non-linear ba-705 sis functions, for example polynomial, exponential and logarithmic forms, can be added to allow more flexibility. Another limitation of our method is the rel- atively high computational cost, which may restrict its usage in certain online applications, where learning speed of the method is considered more important than actual prediction accuracy. To tackle this problem, we can explore more710 efficient heuristic solution procedures that, by estimating the possible break- 31 point positions and constricting the solution space, more quickly converge to a quality solution. In terms of practical future applications in expert and intelligent systems, the715 proposed piece-wise regression method can benefit many via automatic extrac- tion of knowledge from databases and generate accurate forecasts. As examples, we have identified the following directions as possible avenues worth investiga- tion in the near future. First, our proposed method can be incorporated into the construction of a decision support expert system that continuously predicts720 the personalised risk of prisoner with mental illness being released from the jail, aiding clinician for decision making (Constantinou et al., 2015). Other applica- tions that can benefit from our work include intelligent drowsiness monitoring system and stock price prediction. In drowsiness monitoring, the proposed regression model can be built into an intelligent fatigue detection equipment,725 which records the dynamic physiological signals of drivers or medical staffs and continuously predicts their level of fatigues. A warning will be automatically issued when the model predicts the fatigue level of subjects to be above a pre- specified threshold level (Chen et al., 2015). In financial area, our method can help with construction of an automatic system that forecasts the stock price730 based on the ever-changing variables quantifying the current performance of a company, including assets, liabilities and income, providing management with data support to make better financial benefits (Ballings et al., 2015). Lastly, the proposed method developed here can also find application in airline industry where managers and decision makers can benefit from a framework powerful of735 predicting the level of customer satisfaction from various aspects of services, and therefore making it possible for them to carefully allocate resource to maximise customer loyalty (Leong et al., 2015). 32 5. Acknowledgements Funding from the UK Engineering and Physical Sciences Research Coun-740 cil (to LY, SL and LGP through the EPSRC Centre for Innovative Manufac- turing in Emergent Macromolecular Therapies), the UK Leverhulme Trust (to ST and LGP, RPG-2012-686), the European Union(to ST, HEALTH-F2-2011- 261366),and the Centre for Process Systems Engineering (CPSE) at Imperial and University College London are gratefully acknowledged.745 References Afantitis, A., Melagraki, G., Sarimveis, H., Koutentis, P. A., Markopoulos, J., & Igglessi-Markopoulou, O. (2006). A novel {QSAR} model for predicting induction of apoptosis by 4-aryl-4h-chromenes. Bioorganic and Medicinal Chemistry , 14 , 6686 – 6694.750 Alonso, F., Martnez, L., Prez, A., & Valente, J. P. (2012). Cooperation between expert knowledge and data mining discovered knowledge: Lessons learned. Expert Systems with Applications , 39 , 7524 – 7535. Andrs, J. D., Lorca, P., de Cos Juez, F. J., & Snchez-Lasheras, F. (2011). Bankruptcy forecasting: A hybrid approach using fuzzy c-means clustering755 and multivariate adaptive regression splines (mars). Expert Systems with Applications , 38 , 1866 – 1875. Bache, K., & Lichman, M. (2013). UCI machine learning repository. Bai, Y., Wang, P., Li, C., Xie, J., & Wang, Y. (2014). A multi-scale relevance vector regression approach for daily urban water demand forecasting. Journal760 of Hydrology , 517 , 236 – 245. Ballings, M., den Poel, D. V., Hespeels, N., & Gryp, R. (2015). Evaluating multiple classifiers for stock price direction prediction. Expert Systems with Applications , 42 , 7046 – 7056. 33 Balshi, M. S., Mcguire, A. D., Duffy, P., Flannigan, M., Walsh, J., & Melillo, J.765 (2009). Assessing the response of area burned to changing climate in western boreal north america using a multivariate adaptive regression splines (mars) approach. Global Change Biology , 15 , 578–600. Beck, J., Friedrich, D., Brandani, S., Guillas, S., & Fraga, E. (2012). Surro- gate based optimisation for design of pressure swing adsorption systems. In770 Proceedings of the 22nd European Symposium on Computer Aided Process Engineering . Bermolen, P., & Rossi, D. (2009). Support vector regression for link load pre- diction. Computer Networks , 53 , 191 – 201. QoS Aspects in Next-Generation Networks.775 Biau, G. (2012). Analysis of a random forests model. Journal of Machine Learning Research, 13 , 1063–1095. Breiman, L. (2001). Random forests. Machine Learning , 45 , 5–32. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classifica- tion and Regression Trees . Wadsworth.780 Brus, D. J., & Heuvelink, G. B. (2007). Optimization of sample patterns for universal kriging of environmental variables. Geoderma, 138 , 86 – 95. Caballero, J. A., & Grossmann, I. E. (2008). An algorithm for the use of surrogate models in modular flowsheet optimization. AIChE Journal , 54 , 2633–2650.785 Cavanaugh, K. C., Kellner, J. R., Forde, A. J., Gruner, D. S., Parker, J. D., Rodriguez, W., & Feller, I. C. (2014). Poleward expansion of mangroves is a threshold response to decreased frequency of extreme cold events. Proceedings of the National Academy of Sciences , 111 , 723–727. Chang, C.-C., & Lin, C.-J. (2011). Libsvm: A library for support vector ma-790 chines. ACM Transactions on Intelligent Systems and Technology , 2 , 27:1– 27:27. 34 Chen, L., Zhao, Y., Zhang, J., & zhong Zou, J. (2015). Automatic detection of alertness/drowsiness from physiological signals using wavelet-based nonlinear features and machine learning. Expert Systems with Applications , 42 , 7344 –795 7355. Chen, Q.-L., Wu, K.-J., & He, C.-H. (2014). Thermal conductivity of ionic liquids at atmospheric pressure: Database, analysis, and prediction using a topological index method. Industrial and Engineering Chemistry Research, 53 , 7224–7232.800 Cherkassky, V., & Ma, Y. (2004). Practical selection of svm parameters and noise estimation for svm regression. Neural Networks , 17 , 113 – 126. Comrie, A. C. (1997). Comparing neural networks and regression models for ozone forecasting. Journal of the Air and Waste Management Association, 47 , 653–663.805 Constantinou, A. C., Freestone, M., Marsh, W., Fenton, N., & Coid, J. (2015). Risk assessment and risk management of violent reoffending among prisoners. Expert Systems with Applications , 42 , 7511 – 7529. Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision810 Support Systems , 47 , 547 – 553. Smart Business Networks: Concepts and Empirical Evidence. Cozad, A., Sahinidis, N. V., & Miller, D. C. (2014). Learning surrogate models for simulation-based optimization. AIChE Journal , 60 , 2211–2227. Davis, E., & Ierapetritou, M. (2008). A kriging-based approach to minlp con-815 taining black-box models and noise. Industrial and Engineering Chemistry Research, 47 , 6101–6125. Demšar, J., Curk, T., Erjavec, A., Črt Gorup, Hočevar, T., Milutinovič, M., Možina, M., Polajnar, M., Toplak, M., Starič, A., Štajdohar, M., Umek, L., 35 Žagar, L., Žbontar, J., Žitnik, M., & Zupan, B. (2013). Orange: Data mining820 toolbox in python. Journal of Machine Learning Research, 14 , 2349–2353. Dua, V. (2010). A mixed-integer programming approach for optimal configura- tion of artificial neural networks. Chemical Engineering Research and Design, 88 , 55 – 60. Eronen, A., & Klapuri, A. (2010). Music tempo estimation with k -nn regression.825 Audio, Speech, and Language Processing, IEEE Transactions on, 18 , 50–57. Fanelli, G., Gall, J., & Van Gool, L. (2011). Real time head pose estimation with random regression forests. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on (pp. 617–624). Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals830 of Statistics , 19 , pp. 1–67. GAMS Development Corporation (2013). General Algebraic Modeling System (GAMS) Release 24.2.1.. Washington, DC, USA. Genuer, R., Poggi, J.-M., & Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters , 31 , 2225 – 2236.835 Gevrey, M., Dimopoulos, I., & Lek, S. (2003). Review and comparison of meth- ods to study the contribution of variables in artificial neural network models. Ecological Modelling , 160 , 249 – 264. Modelling the structure of acquatic communities: concepts, methods and problems. Ghasemi, J., Saaidpour, S., & Brown, S. D. (2007). Qspr study for estimation840 of acidity constants of some aromatic acids derivatives using multiple linear regression (mlr) analysis. Journal of Molecular Structure: THEOCHEM , 805 , 27 – 32. Greene, M., Rolfson, O., Garellick, G., Gordon, M., & Nemes, S. (2015). Improved statistical analysis of pre- and post-treatment patient-reported845 36 outcome measures (proms): the applicability of piecewise linear regression splines. Quality of Life Research, 24 , 567–573. Gudise, V., & Venayagamoorthy, G. (2003). Comparison of particle swarm optimization and backpropagation as training algorithms for neural networks. In Swarm Intelligence Symposium, 2003. SIS ’03. Proceedings of the 2003850 IEEE (pp. 110–117). Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: An update. SIGKDD Explorations Newsletter , 11 , 10–18. Helton, J., & Davis, F. (2003). Latin hypercube sampling and the propagation855 of uncertainty in analyses of complex systems. Reliability Engineering and System Safety , 81 , 23 – 69. Henao, C. A., & Maravelias, C. T. (2010). Surrogate-based process synthesis. In S. Pierucci, & G. B. Ferraris (Eds.), 20th European Symposium on Computer Aided Process Engineering (pp. 1129 – 1134). Elsevier volume 28 of Computer860 Aided Chemical Engineering . Henao, C. A., & Maravelias, C. T. (2011). Surrogate-based superstructure optimization framework. AIChE Journal , 57 , 1216–1232. Hill, T., Marquez, L., O’Connor, M., & Remus, W. (1994). Artificial neural network models for forecasting and decision making. International Journal of865 Forecasting , 10 , 5 – 15. Jekabsons, G. (2011). Areslab: Adaptive regression splines toolbox for mat- lab/octave. Khayet, M., Cojocaru, C., & Zakrzewska-Trznadel, G. (2008). Response surface modelling and optimization in pervaporation. Journal of Membrane Science,870 321 , 272 – 283. 37 Khuri, A. I., & Mukhopadhyay, S. (2010). Response surface methodology. Wiley Interdisciplinary Reviews: Computational Statistics , 2 , 128–149. Kleijnen, J. P. (2009). Kriging metamodeling in simulation: A review. European Journal of Operational Research, 192 , 707 – 716.875 Kleijnen, J. P. C., & Beers, W. C. M. v. (2004). Application-driven sequential designs for simulation experiments: Kriging metamodelling. The Journal of the Operational Research Society , 55 , pp. 876–883. Kone, E. R. S., & Karwan, M. H. (2011). Combining a new data classification technique and regression analysis to predict the cost-to-serve new customers.880 Computer and Industrial Engineering , 61 , 184–197. Korhonen, K. T., & Kangas, A. (1997). Application of nearestneighbour regres- sion for generalizing sample tree information. Scandinavian Journal of Forest Research, 12 , 97–101. Leathwick, J., Elith, J., & Hastie, T. (2006). Comparative performance of885 generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions. Ecological Modelling , 199 , 188 – 196. Predicting Species Distributions Results from a Second Workshop on Advances in Predictive Species Distribution Models, held in Riederalp, Switzerland, 2004.890 Leong, L.-Y., Hew, T.-S., Lee, V.-H., & Ooi, K.-B. (2015). An semartificial- neural-network analysis of the relationships between servperf, customer sat- isfaction and loyalty among low-cost and full-service airline. Expert Systems with Applications , 42 , 6620 – 6634. Levis, A. A., & Papageorgiou, L. G. (2005). Customer demand forecasting895 via support vector regression analysis. Chemical Engineering Research and Design, 83 , 1009 – 1018. 38 Li, B., Zhang, L., Yan, Q., & Xue, Y. (2014a). Application of piecewise lin- ear regression in the detection of vegetation greenness trends on the tibetan plateau. International Journal of Remote Sensing , 35 , 1526–1539.900 Li, S., Feng, L., P., B., & Seidel-Morgenstern, A. (2014b). Using surrogate models for efficient optimization of simulated moving bed chromatography. Computers and Chemical Engineering , 67 , 121 – 132. Li, Y., Gong, S., & Liddell, H. (2000). Support vector regression and classifica- tion based multi-view face detection and recognition. In Automatic Face and905 Gesture Recognition, 2000. Proceedings. Fourth IEEE International Confer- ence on (pp. 300–305). Lloyd, C. D., & Atkinson, P. M. (2002). Deriving dsms from lidar data with kriging. International Journal of Remote Sensing , 23 , 2519–2524. Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary910 Reviews: Data Mining and Knowledge Discovery , 1 , 14–23. Lu, C.-J., Lee, T.-S., & Chiu, C.-C. (2009). Financial time series forecasting us- ing independent component analysis and support vector regression. Decision Support Systems , 47 , 115 – 125. Magnani, A., & Boyd, S. (2009). Convex piecewise-linear fitting. Optimization915 and Engineering , 10 , 1–17. Malash, G. F., & El-Khaiary, M. I. (2010). Piecewise linear regression: A statistical method for the analysis of experimental adsorption data by the intraparticle-diffusion models. Chemical Engineering Journal , 163 , 256 – 263.920 Matthews, T. J., Steinbauer, M. J., Tzirkalli, E., Triantis, K. A., & Whittaker, R. J. (2014). Thresholds and the speciesarea relationship: a synthetic analysis of habitat island datasets. Journal of Biogeography , 41 , 1018–1028. URL: http://dx.doi.org/10.1111/jbi.12286. doi:10.1111/jbi.12286. 39 http://dx.doi.org/10.1111/jbi.12286 http://dx.doi.org/10.1111/jbi.12286 Miller, D. C., Syamlal, M., Mebane, D. S., Storlie, C., Bhattacharyya, D.,925 Sahinidis, N. V., Agarwal, D., Tong, C., Zitney, S. E., Sarkar, A., Sun, X., Sundaresan, S., Ryan, E., Engel, D., & Dale, C. (2014). Carbon capture simu- lation initiative: A case study in multiscale modeling and new challenges. An- nual Review of Chemical and Biomolecular Engineering , 5 , 301–323. PMID: 24797817.930 Minjares-Fuentes, R., Femenia, A., Garau, M., Meza-Velzquez, J., Simal, S., & Rossell, C. (2014). Ultrasound-assisted extraction of pectins from grape pomace using citric acid: A response surface methodology approach. Carbo- hydrate Polymers , 106 , 179 – 189. Muggeo, V. M. (2008). Segmented: an r package to fit regression models with935 broken-line relationships. R news , 8 , 20–25. Nuchitprasittichai, A., & Cremaschi, S. (2013). An algorithm to determine sample sizes for optimization with artificial neural networks. AIChE Journal , 59 , 805–812. Paliwal, M., & Kumar, U. A. (2009). Neural networks and statistical techniques:940 A review of applications. Expert Systems with Applications , 36 , 2 – 17. Palmer, K., & Realff, M. (2002). Metamodeling approach to optimization of steady-state flowsheet simulations: Model generation. Chemical Engineering Research and Design, 80 , 760 – 772. Pan, J., Kung, P., Bretholt, A., & Lu, J. (2014). Prediction of energys environ-945 mental impact using a three-variable time series model. Expert Systems with Applications , 41 , 1031 – 1040. Papadopoulos, H., Vovk, V., & Gammerman, A. (2011). Regression confor- mal prediction with nearest neighbours. Journal of Artificial Intelligence Research, 40 , 815–840.950 40 Quinlan, J. R. (1992). Learning with continuous classes. In Proceedings of the Australian Joint Conference on Artificial Intelligence (pp. 343–348). World Scientific. R Development Core Team (2008). R: A Language and Environment for Sta- tistical Computing . R Foundation for Statistical Computing Vienna, Austria.955 ISBN 3-900051-07-0. Rafiq, M., Bugmann, G., & Easterbrook, D. (2001). Neural network design for engineering applications. Computers and Structures , 79 , 1541 – 1552. Sampson, P. D., Richards, M., Szpiro, A. A., Bergen, S., Sheppard, L., Larson, T. V., & Kaufman, J. D. (2013). A regionalized national universal kriging960 model using partial least squares regression for estimating annual pm2.5 con- centrations in epidemiology. Atmospheric Environment , 75 , 383 – 392. Sarimveis, H., Alexandridis, A., Mazarakis, S., & Bafas, G. (2004). A new algorithm for developing dynamic radial basis function neural net- work models based on genetic algorithms. Computers and Chem-965 ical Engineering , 28 , 209 – 217. URL: http://www.sciencedirect. com/science/article/pii/S0098135403001698. doi:http://dx.doi.org/ 10.1016/S0098-1354(03)00169-8. Escape 12. Scheuber, M. (2010). Potentials and limits of the k-nearest-neighbour method for regionalising sample-based data in forestry. European Journal of Forest970 Research, 129 , 825–832. Smola, A., & Schlkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing , 14 , 199–222. Strikholm, B. (2006). Determining the number of breaks in a piecewise lin- ear regression model . Working Paper Series in Economics and Finance 648975 Stockholm School of Economics. Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B , 58 , 267–288. 41 http://www.sciencedirect.com/science/article/pii/S0098135403001698 http://www.sciencedirect.com/science/article/pii/S0098135403001698 http://www.sciencedirect.com/science/article/pii/S0098135403001698 http://dx.doi.org/http://dx.doi.org/10.1016/S0098-1354(03)00169-8 http://dx.doi.org/http://dx.doi.org/10.1016/S0098-1354(03)00169-8 http://dx.doi.org/http://dx.doi.org/10.1016/S0098-1354(03)00169-8 Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a ret- rospective. Journal of the Royal Statistical Society: Series B (Statistical980 Methodology), 73 , 273–282. Toms, J. D., & Lesperance, M. L. (2003). Piecewise regression: A tool for identifying ecological thresholds. Ecology , 84 , 2034–2041. Toriello, A., & Vielma, J. P. (2012). Fitting piecewise linear continuous func- tions. European Journal of Operational Research, 219 , 86–95.985 Tsanas, A., & Xifara, A. (2012). Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings , 49 , 560 – 567. Venkatesh, K., Ravi, V., Prinzie, A., & den Poel, D. V. (2014). Cash demand forecasting in atms by clustering and neural networks. European Journal of990 Operational Research, 232 , 383 – 392. Viana, F. A. C., Simpson, T. W., Balabanov, V., & Toropov, V. (2014). Meta- modeling in Multidisciplinary Design Optimization: How Far Have We Really Come? AIAA Journal , 52 , 670–690. Wu, K.-J., Chen, Q.-L., & He, C.-H. (2014). Speed of sound of ionic liquids:995 Database, estimation, and its application for thermal conductivity prediction. AIChE Journal , 60 , 1120–1131. Xue, Y., Liu, S., Zhang, L., & Hu, Y. (2013). Integrating fuzzy logic with piecewise linear regression for detecting vegetation greenness change in the yukon river basin, alaska. International Journal of Remote Sensing , 34 , 4242–1000 4263. Yeh, I.-C. (1998). Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28 , 1797 – 1808. Zhang, J.-R., Zhang, J., Lok, T.-M., & Lyu, M. R. (2007). A hybrid particle swarm optimizationback-propagation algorithm for feedforward neural net-1005 42 work training. Applied Mathematics and Computation, 185 , 1026 – 1037. Special Issue on Intelligent Computing Theory and Methodology. Zhang, Y., & Sahinidis, N. V. (2013). Uncertainty quantification in co2 seques- tration using surrogate models from polynomial chaos expansion. Industrial and Engineering Chemistry Research, 52 , 3121–3132.1010 Zhu, Q., & Lin, H. (2010). Comparing ordinary kriging and regression kriging for soil properties in contrasting landscapes. Pedosphere, 20 , 594 – 606. 43 Introduction Method A novel regression method An illustrative example Results and Discussion Sensitivity analysis for Prediction performance comparison Strength and weakness of the proposed piece-wise regression method Concluding Remarks Acknowledgements