A reduced data set method for support vector regression A reduced data set method for support vector regression Horng-Lin Shieh *, Cheng-Chien Kuo Department of Electrical Engineering, Saint John’s University, Taipei, Taiwan a r t i c l e i n f o Keywords: Support vector regression Outlier Fuzzy clustering Robust fuzzy c-means a b s t r a c t Support vector regression (SVR) has been very successful in pattern recognition, text categorization, and function approximation. The theory of SVR is based on the idea of structural risk minimization. In real application systems, data domain often suffers from noise and outliers. When there is noise and/or out- liers exist in sampling data, the SVR may try to fit those improper data, and obtained systems may have the phenomenon of overfitting. In addition, the memory space for storing the kernel matrix of SVR will be increment with O(N2), where N is the number of training data. Hence, for a large training data set, the kernel matrix cannot be saved in the memory. In this paper, a reduced support vector regression is pro- posed for nonlinear function approximation problems with noise and outliers. The core idea of this approach is to adopt fuzzy clustering and a robust fuzzy c-means (RFCM) algorithm to reduce the com- putational time of SVR and greatly mitigates the influence of data noise and outliers. Crown Copyright � 2010 Published by Elsevier Ltd. All rights reserved. 1. Introduction The theory of support vector machines (SVM) developed by Vapnik (1995) in 1995 is gaining in popularity due to its many attractive features. The SVM is based on the idea of structural risk minimization (SRM) and has been shown to be superior to tradi- tional empirical risk minimization (ERM) principles employed by conventional neural networks (Gunn, 1998). SVM has been suc- cessfully applied to a number of applications, such as classification, time predictions, pattern recognition, and regression (Burges, 1998; Jair, Xiaoou, Wen, & Kang, 2008; Kamruzzaman & Begg, 2006; Kumar, Kulkarni, Jayaraman, & Kulkarni, 2004; Lijuan, 2003; Wong & Hsu, 2006; Zhou, Zhang, & Jiao, 2002). In many intelligent systems, SVM has been shown to provide higher perfor- mance than traditional learning machines, and has thus been adopted as a tool for solving classification issues (Lin & Wang, 2002). Over the past few years, a lot of researchers of neural net- works and machine learning fields are attracted to devoting them- selves to research on SVM (Wang & Xu, 2004). The SVM is systematic and properly motivated by the statistical learning theory (Vapnik, 1998). Training of the SVM involves opti- mization of a convex cost function and globally minimizes to com- plete the learning process (Campbell, 2002). In addition, SVM can handle large input, and can automatically identify a small subset consisting of informative points, namely support vectors (Gustavo et al., 2004). The SVM can also be applied to regression problems by the introduction of an alternative loss function (Gunn, 1998). Such approaches are often called support vector regression (SVR). SVM maps the input data into a high-dimensional feature space, and searches a separate hyperplane that maximizes the margin be- tween two classes. SVM adopts quadratic programming (QP) to maximize the margin as computing tasks become very challenging when the number of data is beyond a few thousand (Hu & Song, 2004). For example, in Fig. 1, there are 500 sampling data gener- ated from a sin wave with Gaussian noise N(0, 0.447). The SVR algo- rithm is adopted to construct this function. The entries of the kernel matrix of SVR are floating-points numbers, and each float- ing-point number requires 4 bytes for storing. Therefore, the total memory required is 500 � 500 � 4 = 1000,000 bytes. The SVR algo- rithm is performed on a Pentium 4, 1.8 GHz with 128 MB of mem- ory running Windows XP. The total execution time of the simulation is 21941 s (above 6 h). This execution time is very long and the memory requirements are very large for real applications of science. Osuna, Freund, and Girosi (1997) proposed a generalized decomposition strategy for the standard SVM, in which the original QP problem is replaced by a series of smaller sub-problems, which are proved able to converge to a global optimum point. However, it is well known that the decomposition process relies heavily on the selection of a good working set of the data, which normally starts with a random subset (Hu & Song, 2004). Lee and Huang (2007) proposed to restrict the number of support vectors by solving re- duced support vector machines (RSVM). The main characteristic of this method is to reduce the matrix from l � l to l � m, where m is the size of a randomly selected subset of training data that are considered as candidates of support vectors. The smaller matrix 0957-4174/$ - see front matter Crown Copyright � 2010 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.04.062 * Corresponding author. E-mail address: shieh@mail.sju.edu.tw (H.-L. Shieh). Expert Systems with Applications 37 (2010) 7781–7787 Contents lists available at ScienceDirect Expert Systems with Applications j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a http://dx.doi.org/10.1016/j.eswa.2010.04.062 mailto:shieh@mail.sju.edu.tw http://www.sciencedirect.com/science/journal/09574174 http://www.elsevier.com/locate/eswa can easily be stored in memory, and then optimization algorithms, such as the Newton method, can be applied (Lin & Lin, 2003). How- ever, as shown by Lin and Lin (2003), numerical experiments show that the accuracy of RSVM is usually lower than that of SVM. Because support vectors can state the distributional features of all data, according to the characteristics of SVM, removing trivial data from the whole training set will not greatly affect the out- come, but will effectively increase the training process (Wang & Xu, 2004). A reduced set method based on the measurement of similarities between samples is developed by Wang and Xu (2004). In this paper, the samples similar to some data points will be discarded under a pre-established similarity threshold. In other words, these samples are so similar to the special data point that their influence on the prediction function can be ignored. Accord- ing to this method, a large number of training vectors are dis- carded, and then a faster SVM training can be obtained without compromising the generalization capability of SVM. However, like the K-means clustering algorithm, the disadvantage of this algo- rithm is that the number of clusters must be predetermined, but in some real applications, there is no information to predefine the number of the clusters. In real applications, data is bound to have noise and outliers, and algorithms utilized in engineering and scientific applications must be robust in order to process these data. In system modeling with noise and/or outliers existing in the sampling data, the system models may try to fit those improper data, and the output may have the phenomenon of overfitting (Chung, Su, & Hsiao, 2000; Shieh, Yang, Chang, & Jeng, 2009). SVR has been shown to have excellent performance for both the e-insensitive and Huber’s ro- bust function for matching the correct type of noise in an applica- tion of time series prediction (Mukherjee, Osuna, & Girosi, 1997). However, in this SVR approach, outliers may possibly be taken as support vectors, and such an inclusion of outliers in support vec- tors may lead to serious overfitting phenomena (Chung, 2000). In this paper, in order to overcome the above problems, a robust fuzzy clustering method is proposed to greatly mitigate the influ- ence of noise and outliers in sampling data, and then the SVR method is used to construct the system models. Three experiments are illustrated, and their results have shown the proposed ap- proach has better performance and less execution time than the original SVR method in various kinds of data domains with data noise and outliers. 2. Support vector regression The model of learning from examples can be considered a gen- eral statistic framework of minimizing expected loss using sam- pling data. Suppose there are n random independent identically distributed (i.i.d.) data (x1, y1), (x2, y2), . . ., (xn, yn), where xi 2 Rd, yi 2 R, i = 1, 2, . . ., n drawn according to the uniform probability dis- tribution function P(x, y) = P(x)P(y—x). Given a set of functions f(x, a), a 2 K, where K is a parameter set, from which the goal of the learning process is to choose a function f(x, a0) that can obtain the best relationship between input and output pairs. Consider a measure of the loss L(y, f(x, a0)) between the output y of the sam- pling data to a given input x, and the response f(x, a0), provided by the learning machine. In order to obtain f(x, a0), one has to min- imize the expected risk functional RðaÞ¼ Z Lðy; fðx; a0ÞÞdPðx; yÞ; ð1Þ A common choice for the loss function is L2-norm; i.e., L(e) = e2 = (y � f(x, a0))2. However, because P(x, y) is unknown, R(a) cannot be directly evaluated from Eq. (1). In general, the ex- pected risk function is replaced by the empirical risk functional RempðaÞ¼ 1 n Xn i¼1 Lðy; fðx; a0ÞÞ: ð2Þ There is no probability distribution in Eq. (2). However, in real application systems, data domains often suffer from noise and out- liers. When there is noise and/or outliers exist in sampling data, Eq. (2) may try to fit those improper data and obtained systems may have the phenomenon of overfitting. Let the sampling data be represented as {(xi, yi)jxi 2 Rd, yi 2 {�1, 1}}, i = 1, 2, . . . n. In the SVR method, the regression func- tion is approximated by the following function as: f ¼ Xn i¼1 wiuðxiÞþ b; ð3Þ where fuðxiÞg n i¼1 are the features of inputs, fwig n i¼1 and b are coeffi- cients. The coefficients are estimated by minimizing the regularized risk function (Wang & Xu, 2004) RðCÞ¼ C 1 n Xn i¼1 Lðy; fÞþ 1 2 kwk2; ð4Þ where L(y, f) adopt the e-insensitive loss function, and is defined as follows: Lðy; fÞ¼ jy � f j� e; jy � f j P e; 0; otherwise � ð5Þ and e P 0 is a predefined parameter. In Eq. (4), the second term, 12kwk 2 , is used for the flatness mea- surement of function (3), and C is a regular constant determining the tradeoff between the training error and the model flatness. SVR introduces slack variables n, n* and leads Eq. (4) to the follow- ing constrained function (Wang & Xu, 2004): minimize Rðw; n�Þ¼ C� Xn i¼1 ðni þ n � i Þþ 1 2 kwk2; ð6Þ subject to wuðxiÞþ b � yi 6 e þ n � i ; ð7Þ yi � wuðxiÞ� bi 6 e þ ni; n; n� P 0; where n, n* are slack variables representing upper and lower con- straints on the outputs of the system. Thus, function (3) becomes the explicit form: fðx; ai; a�i Þ¼ Xn i¼1 wiuðxiÞþ b ¼ Xn i¼1 ðai � a�i ÞuðxiÞ T uðxiÞþ b ð8Þ 0 1 2 3 4 5 6 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 X Y Fig. 1. Sin wave with noise N(0, 0.447). 7782 H.-L. Shieh, C.-C. Kuo / Expert Systems with Applications 37 (2010) 7781–7787 https://isiarticles.com/article/25301