A reduced data set method for support vector regression


A reduced data set method for support vector regression

Horng-Lin Shieh *, Cheng-Chien Kuo
Department of Electrical Engineering, Saint John’s University, Taipei, Taiwan

a r t i c l e i n f o

Keywords:
Support vector regression
Outlier
Fuzzy clustering
Robust fuzzy c-means

a b s t r a c t

Support vector regression (SVR) has been very successful in pattern recognition, text categorization, and
function approximation. The theory of SVR is based on the idea of structural risk minimization. In real
application systems, data domain often suffers from noise and outliers. When there is noise and/or out-
liers exist in sampling data, the SVR may try to fit those improper data, and obtained systems may have
the phenomenon of overfitting. In addition, the memory space for storing the kernel matrix of SVR will be
increment with O(N2), where N is the number of training data. Hence, for a large training data set, the
kernel matrix cannot be saved in the memory. In this paper, a reduced support vector regression is pro-
posed for nonlinear function approximation problems with noise and outliers. The core idea of this
approach is to adopt fuzzy clustering and a robust fuzzy c-means (RFCM) algorithm to reduce the com-
putational time of SVR and greatly mitigates the influence of data noise and outliers.

Crown Copyright � 2010 Published by Elsevier Ltd. All rights reserved.

1. Introduction

The theory of support vector machines (SVM) developed by
Vapnik (1995) in 1995 is gaining in popularity due to its many
attractive features. The SVM is based on the idea of structural risk
minimization (SRM) and has been shown to be superior to tradi-
tional empirical risk minimization (ERM) principles employed by
conventional neural networks (Gunn, 1998). SVM has been suc-
cessfully applied to a number of applications, such as classification,
time predictions, pattern recognition, and regression (Burges,
1998; Jair, Xiaoou, Wen, & Kang, 2008; Kamruzzaman & Begg,
2006; Kumar, Kulkarni, Jayaraman, & Kulkarni, 2004; Lijuan,
2003; Wong & Hsu, 2006; Zhou, Zhang, & Jiao, 2002). In many
intelligent systems, SVM has been shown to provide higher perfor-
mance than traditional learning machines, and has thus been
adopted as a tool for solving classification issues (Lin & Wang,
2002). Over the past few years, a lot of researchers of neural net-
works and machine learning fields are attracted to devoting them-
selves to research on SVM (Wang & Xu, 2004).

The SVM is systematic and properly motivated by the statistical
learning theory (Vapnik, 1998). Training of the SVM involves opti-
mization of a convex cost function and globally minimizes to com-
plete the learning process (Campbell, 2002). In addition, SVM can
handle large input, and can automatically identify a small subset
consisting of informative points, namely support vectors (Gustavo
et al., 2004). The SVM can also be applied to regression problems

by the introduction of an alternative loss function (Gunn, 1998).
Such approaches are often called support vector regression (SVR).

SVM maps the input data into a high-dimensional feature space,
and searches a separate hyperplane that maximizes the margin be-
tween two classes. SVM adopts quadratic programming (QP) to
maximize the margin as computing tasks become very challenging
when the number of data is beyond a few thousand (Hu & Song,
2004). For example, in Fig. 1, there are 500 sampling data gener-
ated from a sin wave with Gaussian noise N(0, 0.447). The SVR algo-
rithm is adopted to construct this function. The entries of the
kernel matrix of SVR are floating-points numbers, and each float-
ing-point number requires 4 bytes for storing. Therefore, the total
memory required is 500 � 500 � 4 = 1000,000 bytes. The SVR algo-
rithm is performed on a Pentium 4, 1.8 GHz with 128 MB of mem-
ory running Windows XP. The total execution time of the
simulation is 21941 s (above 6 h). This execution time is very long
and the memory requirements are very large for real applications
of science.

Osuna, Freund, and Girosi (1997) proposed a generalized
decomposition strategy for the standard SVM, in which the original
QP problem is replaced by a series of smaller sub-problems, which
are proved able to converge to a global optimum point. However, it
is well known that the decomposition process relies heavily on the
selection of a good working set of the data, which normally starts
with a random subset (Hu & Song, 2004). Lee and Huang (2007)
proposed to restrict the number of support vectors by solving re-
duced support vector machines (RSVM). The main characteristic
of this method is to reduce the matrix from l � l to l � m, where
m is the size of a randomly selected subset of training data that
are considered as candidates of support vectors. The smaller matrix

0957-4174/$ - see front matter Crown Copyright � 2010 Published by Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2010.04.062

* Corresponding author.
E-mail address: shieh@mail.sju.edu.tw (H.-L. Shieh).

Expert Systems with Applications 37 (2010) 7781–7787

Contents lists available at ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a

http://dx.doi.org/10.1016/j.eswa.2010.04.062
mailto:shieh@mail.sju.edu.tw
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa


can easily be stored in memory, and then optimization algorithms,
such as the Newton method, can be applied (Lin & Lin, 2003). How-
ever, as shown by Lin and Lin (2003), numerical experiments show
that the accuracy of RSVM is usually lower than that of SVM.

Because support vectors can state the distributional features of
all data, according to the characteristics of SVM, removing trivial
data from the whole training set will not greatly affect the out-
come, but will effectively increase the training process (Wang &
Xu, 2004). A reduced set method based on the measurement of
similarities between samples is developed by Wang and Xu
(2004). In this paper, the samples similar to some data points will
be discarded under a pre-established similarity threshold. In other
words, these samples are so similar to the special data point that
their influence on the prediction function can be ignored. Accord-
ing to this method, a large number of training vectors are dis-
carded, and then a faster SVM training can be obtained without
compromising the generalization capability of SVM. However, like
the K-means clustering algorithm, the disadvantage of this algo-
rithm is that the number of clusters must be predetermined, but
in some real applications, there is no information to predefine
the number of the clusters.

In real applications, data is bound to have noise and outliers,
and algorithms utilized in engineering and scientific applications
must be robust in order to process these data. In system modeling
with noise and/or outliers existing in the sampling data, the system
models may try to fit those improper data, and the output may
have the phenomenon of overfitting (Chung, Su, & Hsiao, 2000;
Shieh, Yang, Chang, & Jeng, 2009). SVR has been shown to have
excellent performance for both the e-insensitive and Huber’s ro-
bust function for matching the correct type of noise in an applica-
tion of time series prediction (Mukherjee, Osuna, & Girosi, 1997).
However, in this SVR approach, outliers may possibly be taken as
support vectors, and such an inclusion of outliers in support vec-
tors may lead to serious overfitting phenomena (Chung, 2000).

In this paper, in order to overcome the above problems, a robust
fuzzy clustering method is proposed to greatly mitigate the influ-
ence of noise and outliers in sampling data, and then the SVR
method is used to construct the system models. Three experiments
are illustrated, and their results have shown the proposed ap-
proach has better performance and less execution time than the
original SVR method in various kinds of data domains with data
noise and outliers.

2. Support vector regression

The model of learning from examples can be considered a gen-
eral statistic framework of minimizing expected loss using sam-
pling data. Suppose there are n random independent identically

distributed (i.i.d.) data (x1, y1), (x2, y2), . . ., (xn, yn), where xi 2 Rd,
yi 2 R, i = 1, 2, . . ., n drawn according to the uniform probability dis-
tribution function P(x, y) = P(x)P(y—x). Given a set of functions
f(x, a), a 2 K, where K is a parameter set, from which the goal of
the learning process is to choose a function f(x, a0) that can obtain
the best relationship between input and output pairs. Consider a
measure of the loss L(y, f(x, a0)) between the output y of the sam-
pling data to a given input x, and the response f(x, a0), provided
by the learning machine. In order to obtain f(x, a0), one has to min-
imize the expected risk functional

RðaÞ¼
Z

Lðy; fðx; a0ÞÞdPðx; yÞ; ð1Þ

A common choice for the loss function is L2-norm; i.e.,
L(e) = e2 = (y � f(x, a0))2. However, because P(x, y) is unknown,
R(a) cannot be directly evaluated from Eq. (1). In general, the ex-
pected risk function is replaced by the empirical risk functional

RempðaÞ¼
1
n

Xn
i¼1

Lðy; fðx; a0ÞÞ: ð2Þ

There is no probability distribution in Eq. (2). However, in real
application systems, data domains often suffer from noise and out-
liers. When there is noise and/or outliers exist in sampling data, Eq.
(2) may try to fit those improper data and obtained systems may
have the phenomenon of overfitting.

Let the sampling data be represented as {(xi, yi)jxi 2 Rd,
yi 2 {�1, 1}}, i = 1, 2, . . . n. In the SVR method, the regression func-
tion is approximated by the following function as:

f ¼
Xn
i¼1

wiuðxiÞþ b; ð3Þ

where fuðxiÞg
n
i¼1 are the features of inputs, fwig

n
i¼1 and b are coeffi-

cients. The coefficients are estimated by minimizing the regularized
risk function (Wang & Xu, 2004)

RðCÞ¼ C
1
n

Xn
i¼1

Lðy; fÞþ
1
2
kwk2; ð4Þ

where L(y, f) adopt the e-insensitive loss function, and is defined as
follows:

Lðy; fÞ¼
jy � f j� e; jy � f j P e;
0; otherwise

�
ð5Þ

and e P 0 is a predefined parameter.
In Eq. (4), the second term, 12kwk

2 , is used for the flatness mea-
surement of function (3), and C is a regular constant determining
the tradeoff between the training error and the model flatness.
SVR introduces slack variables n, n* and leads Eq. (4) to the follow-
ing constrained function (Wang & Xu, 2004):

minimize

Rðw; n�Þ¼ C�
Xn
i¼1
ðni þ n

�
i Þþ

1
2
kwk2; ð6Þ

subject to

wuðxiÞþ b � yi 6 e þ n
�
i ; ð7Þ

yi � wuðxiÞ� bi 6 e þ ni;
n; n� P 0;

where n, n* are slack variables representing upper and lower con-
straints on the outputs of the system. Thus, function (3) becomes
the explicit form:

fðx; ai; a�i Þ¼
Xn
i¼1

wiuðxiÞþ b ¼
Xn
i¼1
ðai � a�i ÞuðxiÞ

T uðxiÞþ b ð8Þ

0 1 2 3 4 5 6

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

X

Y

Fig. 1. Sin wave with noise N(0, 0.447).

7782 H.-L. Shieh, C.-C. Kuo / Expert Systems with Applications 37 (2010) 7781–7787


https://isiarticles.com/article/25301