optimization algorithm with kernel pca to support …...optimization algorithm with kernel pca to...

8
Optimization Algorithm with Kernel PCA to Support Vector Machines for Time Series Prediction Qisong Chen College of Computer Science and Technology, Guizhou University, Guiyang, China Email: [email protected] Xiaowei Chen and Yun Wu College of Computer Science and Technology, Guizhou University, Guiyang, China Email: [email protected], [email protected] Abstract—As an effective tool in pattern recognition and machine learning, support vector machine (SVM) has been adopted abroad. In developing a successful SVM classifier, eliminating noise and extracting feature are very important. This paper proposes the application of kernel Principal Component Analysis (KPCA) to SVM for feature extraction. Then PSO Algorithm is adopted to optimization of these parameters in SVM. The novel time series analysis model integrates the advantage of wavelet, PSO, KPCA and SVM. Compared with other predictors, this model has greater generality ability and higher accuracy. Index Terms—KPCA, SVM, wavelet transform, PSO, Time series, Prediction I. INTRODUCTION In pattern recognition, especially in machine learning, successful time series prediction is an important practical problem. Two of the key problems in time series prediction are noise and non-stationarity. Milidiu et.al. [1][2][3] generalized the ME architecture into a two- stage architecture to handle the non-stationarity in the data. Statistical Learning Theory (SLT) was introduced by Vapnik in 1995 which focused on machine learning approach in small sample set [4][5][6] and support vector machines (SVM) was a new classification and regression tool based on this theory. Established on the unique theory of the structural risk minimization principle, SVM is resistant to over-fitting problem and shown to avoid the above limits of ANN. Recently SVM has been very successful in pattern recognition, function estimation problems, and modeling of nonlinear dynamic systems [5]. Unlike those traditional methods which implement the Empirical Risk Minimization Principal, SVM implements the Structural Risk Minimization Principal which seeks to minimize an upper bound of the generalization error rather than minimize the training error. This eventually results in better generalization performance in SVM than other traditional methods. In developing a SVM classifier, the first important step is feature selection and extraction. In the modeling, all available features can be used as the inputs of SVM, but irrelevant features or correlated features could deteriorate the generalization performance of SVM. The purpose of this paper is to propose the application of KPCA to SVM for feature extraction. The new features are then used as the inputs of SVM to solve classification problem. By examining time series prediction data, the simulation show SVM by feature extraction using KPCA can perform better than that without feature extraction. The rest of this paper is organized as follows. In Section 2, some basic theories are introduction briefly. In Section 3, the prediction model of PSO-SVM for time series is presented. Section 4 gives the experimental results. Section 5 gives the application and analysis, and followed by the conclusions in the last section. II. BACKGROUND INFORMATION A. Discrete Wavelet Transform Many transform techniques have been previously proposed for signal compression [7][8]. Among these methods, wavelet transform has shown promising results in various areas [9][10]. Discrete wavelet transform (DWT) provides a representation of the signal on coefficients partly localized in time and frequency, is not redundant, and invertible [11]. Processing of the coefficients may allow their representation on a lower number of bits than needed for representing the original signal. The coefficients of the DWT represent the projection of the signal over a set of basis functions generated as translation and dilatation of a protype function, called mother wavelet. The selection of the mother wavelet determines the signal representation. Many algorithms based on the DWT have been proposed for signal compression. Suppose the function ) ( ) ( 2 R L t ϕ and its Fourier transform ) ( ω ψ satisfies the condition: 380 JOURNAL OF COMPUTERS, VOL. 5, NO. 3, MARCH 2010 © 2010 ACADEMY PUBLISHER doi:10.4304/jcp.5.3.380-387

Upload: others

Post on 13-Apr-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization Algorithm with Kernel PCA to Support …...Optimization Algorithm with Kernel PCA to Support Vector Machines for Time Series Prediction Qisong Chen College of Computer

Optimization Algorithm with Kernel PCA to

Support Vector Machines for Time Series Prediction

Qisong Chen

College of Computer Science and Technology, Guizhou University, Guiyang, China Email: [email protected]

Xiaowei Chen and Yun Wu

College of Computer Science and Technology, Guizhou University, Guiyang, China Email: [email protected], [email protected]

Abstract—As an effective tool in pattern recognition and machine learning, support vector machine (SVM) has been adopted abroad. In developing a successful SVM classifier, eliminating noise and extracting feature are very important. This paper proposes the application of kernel Principal Component Analysis (KPCA) to SVM for feature extraction. Then PSO Algorithm is adopted to optimization of these parameters in SVM. The novel time series analysis model integrates the advantage of wavelet, PSO, KPCA and SVM. Compared with other predictors, this model has greater generality ability and higher accuracy. Index Terms—KPCA, SVM, wavelet transform, PSO, Time series, Prediction

I. INTRODUCTION

In pattern recognition, especially in machine learning, successful time series prediction is an important practical problem. Two of the key problems in time series prediction are noise and non-stationarity. Milidiu et.al. [1][2][3] generalized the ME architecture into a two-stage architecture to handle the non-stationarity in the data.

Statistical Learning Theory (SLT) was introduced by Vapnik in 1995 which focused on machine learning approach in small sample set [4][5][6] and support vector machines (SVM) was a new classification and regression tool based on this theory. Established on the unique theory of the structural risk minimization principle, SVM is resistant to over-fitting problem and shown to avoid the above limits of ANN. Recently SVM has been very successful in pattern recognition, function estimation problems, and modeling of nonlinear dynamic systems [5].

Unlike those traditional methods which implement the Empirical Risk Minimization Principal, SVM implements the Structural Risk Minimization Principal which seeks to minimize an upper bound of the generalization error rather than minimize the training error. This eventually results in better generalization performance in SVM than other traditional methods.

In developing a SVM classifier, the first important step is feature selection and extraction. In the modeling, all available features can be used as the inputs of SVM, but irrelevant features or correlated features could deteriorate the generalization performance of SVM.

The purpose of this paper is to propose the application of KPCA to SVM for feature extraction. The new features are then used as the inputs of SVM to solve classification problem. By examining time series prediction data, the simulation show SVM by feature extraction using KPCA can perform better than that without feature extraction.

The rest of this paper is organized as follows. In Section 2, some basic theories are introduction briefly. In Section 3, the prediction model of PSO-SVM for time series is presented. Section 4 gives the experimental results. Section 5 gives the application and analysis, and followed by the conclusions in the last section.

II. BACKGROUND INFORMATION

A. Discrete Wavelet Transform Many transform techniques have been previously

proposed for signal compression [7][8]. Among these methods, wavelet transform has shown promising results in various areas [9][10]. Discrete wavelet transform (DWT) provides a representation of the signal on coefficients partly localized in time and frequency, is not redundant, and invertible [11]. Processing of the coefficients may allow their representation on a lower number of bits than needed for representing the original signal. The coefficients of the DWT represent the projection of the signal over a set of basis functions generated as translation and dilatation of a protype function, called mother wavelet. The selection of the mother wavelet determines the signal representation. Many algorithms based on the DWT have been proposed for signal compression.

Suppose the function )()( 2 RLt ∈ϕ and its Fourier transform )(ωψ satisfies the condition:

380 JOURNAL OF COMPUTERS, VOL. 5, NO. 3, MARCH 2010

© 2010 ACADEMY PUBLISHERdoi:10.4304/jcp.5.3.380-387

Page 2: Optimization Algorithm with Kernel PCA to Support …...Optimization Algorithm with Kernel PCA to Support Vector Machines for Time Series Prediction Qisong Chen College of Computer

∞<∫R

dωωωψ 2)(

(1)

Then )(tϕ can be called mother wavelet. By dilations and translations of mother wavelet, a family of wavelet functions as follows can be obtained:

)(1)(, adt

atda

−= ψψ Rda ∈≠ ,0 (2)

where a is the dilation factor and d is the translation factor. Let ja 2= and jkd 2= , discrete wavelet transform (DWT) can be realized:

)2(2)( 2, ktt j

j

kj −= −−

ψψ (3)

where k is the shift parameter and j is the resolution level. The larger the value of j , the lower the frequency. According to (3), the reconstruction expression of )(tf can be presented as follows:

∑+=

∑ ∑+∑=

jjj

k jkjkj

kkjkj

tdta

tdtctf

)()(

)()()( ,,,, ψϕ (4)

where ja and jd are the approximate and detail parts of original signal, respectively.

B. Kernel PCA non-linear feature extraction Given an input data set denoted by matrix mMX ×

(usually mM < ). X is consisted of some centered data

set )...1(}{ mixi = . Where mi Rx ∈ and ∑ =

=

m

iix

10 . Using

(5), PCA [12] transform vector ix to new one iS

ii xUS ′= (5)

The traditional PCA is a kind of linear algorithm and fails to extract the non-linear structure from input data set. The general idea of KPCA [12] is first to perform a non-linear transformation via a non-linear function Φ to map the original input vector ix to a high dimensional feature space F , and then calculates linear PCA in

)( ixΦ , whose dimension is assumed to be larger than m . The mapping function are represent as follows:

Fxx ∈Φ→Φ )(: (6)

Where )( ixΦ is sample of F and ∑ =Φ=

m

iix

10)( . The

covariance matrix of the sample in F is C ,

)()(11

′Φ∑ Φ==

i

m

ii xx

mC (7)

The corresponding eigenvalue problem is: ΦΦΦ = VCVλ , ni ...2,1= (8)

The corresponding eigenvector iV can be represent as

∑ Φ==

m

iiii xV

1)(α (9)

Hence, we take >Φ⋅Φ=< )()(, jiji xxK as kernel function, for all mji ...2,1, = .

The 8 can be transformed to the eigenvalue problem as (10),

ii Kαλα = , mi ...2,1= (10)

Finally, based on the estimated iα , the principal components for ix can be calculated by

∑ ⋅=Φ⋅==

m

jjiii

Ti KjxuiS

1,)()()( α , mi ...2,1= (11)

The sample data in real world do not always satisfy

∑ =Φ=

m

iix

10)( , and then the K in Eq.11 is replaced by

mmmm lKllKKlKK ⋅⋅+⋅−⋅−= , where ml is a mm×

unit matrix whose coefficient is m1 .

In the kernel methods, the inner product of feature vectors is replaced to a nonlinear kernel function, through which a nonlinear mapping of the feature vector to a high dimensional space is performed in an implicit manner. Using appropriate kernel functions, the kernel methods show better classification performance than linear machines and standard multivariate analysis.

C. Principle of LS-SVM for Regression Support Vector Machines (SVMs) are one of the most

effective tools for tackling classification and regression problems in complex, nonlinear data distributions [4][5]. Indeed, SVM has been recently successfully applied to the fuzzy methodology [13]. The training of SVMs is characterized by convex optimization problem; thus local minima can be avoided by using polynomial-complexity Quadratic Programming (QP) methods.

Besides, kernel-based representations allow SVMs to handle arbitrary distributions that may not be linearly separable in the data space. The ability to handle huge masses of data is indeed an important feature; in such significant cases, SVM training algorithms [14][15] typically adopt an iterative selection strategy: first, limited subsets of (supposedly) critical patterns are identified and undergo partial optimization; then the local solution thus obtained is projected onto the whole data set to verify consistency and global optimality. Such a process iterates until convergence

In the following, LS-SVM regression is briefly introduced. Consider a given training set of N data points N

iii yx 1},{ = with input data ni Rx ∈ and output data

Ryi ∈ . In feature space LS-SVM models take the form:

bxwxy T += )()( ϕ (12)

JOURNAL OF COMPUTERS, VOL. 5, NO. 3, MARCH 2010 381

© 2010 ACADEMY PUBLISHER

Page 3: Optimization Algorithm with Kernel PCA to Support …...Optimization Algorithm with Kernel PCA to Support Vector Machines for Time Series Prediction Qisong Chen College of Computer

where the nonlinear mapping )(⋅ϕ maps the input data into a higher dimensional feature space. In LS-SVM for function estimation the following optimization problem is formulated:

∑+==

n

ii

T ewwewJ1

2

21

21),(min γ (13)

Subject to the equality constraints: Niebxwy ii

Ti ,...,2,1,)( =++= ϕ (14)

The solution is obtained after constructing the Lagrangian with Lagrange multipliers ia .

∑ ∑ −++−

+=

= =

n

i

n

iiii

Tii

T

ebxwae

wwabwL

1 1

2 ])([

21),,,(

γϕγ

ξ (15)

After optimizing (15) and eliminating ie , w , the solution is given by the following set of linear equations:

⎥⎦

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡+ − YA

bxxI

I

jT

i

T 0)()(

01γϕϕ

(16)

where TNyyY ]...,,[ 1= , TI ]1...,,1[= , T

NA ]...,,[ 1 αα= and the Mercer’s condition (shows in (17) ) have been applied.

)()(),( jT

iji xxxxK ϕϕ= , Nji ,...,2,1, = (17)

This finally results into the following LS-SVM model for function regression:

∑ +==

N

iii bxxKaxy

1),()( (18)

where ia , b are the solutions to the linear system, ),( ⋅⋅K represents the high dimensional feature space that

is nonlinearly mapped from the input space x . The LS-SVM approximates the nonlinear function using the (18). In this research, the radial basis function (RBF) is used as the kernel function, because RBF kernels tend to give good performance under general smoothness assumptions. Consequently, they are especially useful if no additional knowledge of the data is available.

)/||||exp(),( 22 σjiji xxxxK −−= (19)

where σ is a positive real constant. Note that in the case of RBF kernels, there are only two additional tuning parameters, γ in (13) and σ in (19). LS-SVM is similar to neural networks in structure. Its output is the linear combinations of the middle joints that are equivalent to the support vectors, and its training algorithm is solving a system of linear equations rather than nonlinear optimization in neural networks easily causing local minima.

Ⅲ. PREDICTION MODEL

A. Data Preprocessing The objective of our model as well as the aim of the

competition is to forecast the power load of 3 days. Before the research is begun, the following data is ready: (1) Electricity load demand recorded every half hour, from 2005 to 2006; (2) Average daily temperature, from 2003 to 2006; (3) Dates of holidays, from 2005 to 2007.

There are some public holidays for these years. These holidays can make erroneous results so in the preprocessing stage the holidays' data has been averaged over one day previous and one day next load values.

2)1()1()( ++−

=hPhPhP (20)

where h is represent a holiday. These holidays are not included weekends.

B. Feature selection by DWT-KPCA In order to build an accurate and fast prediction, a

lower dimension representation from the raw time series data must be obtained. Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA) and Independent Component Analysis (ICA) have been used for this purpose.

The traditional Principal Component Analysis (PCA) [12] is a liner feature extraction algorithm; it only can extract linear features while the data in real world are not all linear, it causes that the scope of its application is necessarily somewhat limited. Kernel PCA, a non-linear presentation of the linear PCA, maps the data space of inputs to a high dimension eigenspace, and transform to calculate the eigenvalue and eigenvector of the kernel matrix. The projection of inputs in eigenvalue is transformed to calculate the linear combination of kernel function, thus the computation is reduced.

Db3 wavelet is selected as the wavelet function and decomposition level is selected at 3. The decomposed series is used to predict the power demand of the next 3 days by LS-SVM predictor.

The decomposed level is hard to choose. If the level is too low, prediction error will increase; when it increases, prediction result will be improved greatly, but when the level is more than three, there is only tiny improvement. At last, db3 is chosen as the mother wavelet and decomposed level is three. The original series and the decomposed series are shown in Fig. 1.

C. Fast Training Methods for SVM It is well known that, however, SVM suffers from

exceeded computation and long training time caused by large-scale time series samples. The prevailing efficient methods proposed to speed up SVM are subset selection techniques and preprocessing methods. Generally speaking, a subset selection is a technique to speed up SVM training by dividing the original QP (quadratic programming) problem into small pieces, thus reducing the size of each QP problem. There have been three efficient subset techniques such as decomposition

382 JOURNAL OF COMPUTERS, VOL. 5, NO. 3, MARCH 2010

© 2010 ACADEMY PUBLISHER

Page 4: Optimization Algorithm with Kernel PCA to Support …...Optimization Algorithm with Kernel PCA to Support Vector Machines for Time Series Prediction Qisong Chen College of Computer

Figure 1. The original series and the decomposed series.

methods [16], chunking methods [17] and SMO (sequential minimal optimization) [18][19]. Although these methods do accelerate training, it is still time-consuming for necessity of learning all training data.

D. Feature Forecasting model base on LS-SVM The parameters of SVM have a great effect on the

generalization performance of SVM. An appropriate parameter combination corresponds to a high generalization performance of the SVM. In the past

decade, GA was considered as an excellent technique to solve the combinatorial optimization problems [20].

But in 1995, as an alternate tool to genetic algorithms (GA), Particle Swarm Optimization (PSO) [21] was proposed by Kennedy and Eberhart, and gained a lot of attention in various optimal control system applications. It is a stochastic optimization method based on swarm intelligence. The fundamental idea is that the optimal solution can be found through cooperation and information sharing among individuals in the swarm.

The input vectors and output vectors for LS-SVM predictor are obtained. In this paper, RBF is chosen as the kernel function. We use the software LS-SVM which includes the implementation of solving (11).

For the algorithm, the kernel function is RBF function and σ is selected as 0.005, γ is selected as 1000 using PSO algorithms.

Ⅳ EXPERIMENTAL RESULTS

Load forecasting, especially short-term forecasting of electricity demand, has always been a very important issue in economic and reliable power systems planning and operation. Load forecasting for several days is categorized as short-term load prediction and plays an important role in the day-to-day activities of every power producing company [22]. The short-term electricity demand, which is a chaotic time series in essence, is one of most key factors for power system. It has very highly social and economic benefits to strengthen the forecast and control system.

The aim of the task is to use known values of the time series up to the point tx = to predict the value at some point in the future τ+= tx . The method of forecasting is to create a mapping from several ( d ) points of the time series spaced τ apart, that is,

))(),(),...,)1((( txtxdtx ττ −−− , to a forecasting future value )( τ+tx .

Fig. 2. shows how the model is built.

Figure 2. The architecture of model for time series forecasting.

KPCA Similarity

Analysis LS-SVM kx kS H Y

Phase space reconstruction

JOURNAL OF COMPUTERS, VOL. 5, NO. 3, MARCH 2010 383

© 2010 ACADEMY PUBLISHER

Page 5: Optimization Algorithm with Kernel PCA to Support …...Optimization Algorithm with Kernel PCA to Support Vector Machines for Time Series Prediction Qisong Chen College of Computer

A Similarity Analysis to Reduce the Dimension of the Embedding Phase Space

The kernel principal components kS in feature space can be computed as B in Section Ⅱ, which are denoted by 1H , 2H ,…, lH ( Ti

liii HHHH ),...,,( 21= for

li ,...2,1= ) in this section for convenience. Where i

jH is the thi principal component of thj sample. The first q principal components are chosen such that their accumulative contribution ratio is big enough, which form reconstructed the phase space. As (21), training sample pairs for chaotic time series modeling can be formed as below.

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

=

ql

ll

q

q

HHH

HHHHHH

X

..............

...

...~

21

222

12

12

111

, ⎥⎥⎥⎥

⎢⎢⎢⎢

=

ly

yy

Y...

2

1

(21)

Modeling for time series, which is based on KPCA phase space reconstruction, is to find the hidden function f~

between input X and output Y such that )(~

ii xfy = . The above KPCA-based phase space reconstruction

choose The first q principal components successively only according to their accumulative contribution ratio (their accumulative contribution ratio must be big enough so that they can stand for most information of original variables), not considering the similarity between every principal component )1( qiH i ≤≤ chosen and output variable Y . This article analyses the similarity between principal components and output variables on the basis of KPCA. Set a threshold θ , compute the similarity coefficient between principal component

)1( qiH i ≤≤ and output Y ,

),(),(),(

YYCovHHCovYHCov

ii

i

i⋅

=ρ (22)

where ),( YHCov i is covariance of vector iH and Y . Choose principal components )1( qiH i ≤≤ such that the similarity coefficient θρ ≥i to form the reconstructed phase space H .

The process of predicting chaotic time series is completed. The detailed step of algorithm is illustrated as the following:

Step 1. For a chaotic time series kx , KPCA is applied to assign the embedding dimension with the accumulative contribution ratio. The principal components kS is obtained, whose number of dimension is less than kx .

Step 2. kS is used for the input vector of similarity analysis, and select appropriate threshold θ according to the satisfied result. The dimension of the final embedding phase space is assigned.

Step 3. In the reconstructed phase space, the structure of LS-SVM model is built, trained and validated by the partitioned data set respectively to determine the kernel parameters 2σ and γ of LS-SVM with RBF kernel function. Choose the most adequate LS-SVM that produces the smallest error on the validating data set for chaotic time series forecasting.

B Experimental results In many practical pattern classification tasks

researchers are confronted with the problem, that we have a very high dimensional input space and we want to find out the combination of the original input features that contribute most to the classification. Unlike e.g. Gaussian Processes in regression, Support Vector Machines (SVMs) do not offer the opportunity of automated internal relevance detection and hence algorithms for feature selection play an important role.

For any classification task, that may deteriorate its performance if embedded hyper-parameters are not well chosen, picking the best values for parameters is a non-trivial problem. Kernel parameters and tradeoff parameter in SVM are hyper-parameters that need to find their optimal values so that minimizes the expectation of test error that needs adapting multiple parameters values at the same time.

The SVM algorithm usually depends on several parameters. One of them denoted γ controls the tradeoff between margin maximization and error minimization. Other parameters appear in the kernel function called kernel parameters. Referred to the influence of these parameters on the performance of the SVM, the number of the support vectors and the maximization margin of the SVM are considered. The less support vectors are, the quickly the test data checks out. And the bigger the maximization margin is, the better the generalization of the SVM has, the smaller the verification error for the test data is.

In order to verify this algorithm, researchers compared the prediction errors, correct prediction steps and time consumed of PSO-SVM predictor with those of neural network predictors that are linear neural network predictor, RBF neural network predictor and BP neural network predictor. The training data sets of the three neural network predictors are the same as PSO-SVM predictor. Table 1 shows the detailed comparison results of PSO-SVM predictor with other neural network predictors. The prediction results of the PSO-SVM are shown in Fig. 3. Table Ⅱ shows the result of prediction performance and computing time for training and testing data by SVM using all features. From the table we can see that the training time and testing time are decreased using extracted features, and The performance of using extracted features has a good performance in accuracy than that of using all features while does not decrease the real-time performance.

384 JOURNAL OF COMPUTERS, VOL. 5, NO. 3, MARCH 2010

© 2010 ACADEMY PUBLISHER

Page 6: Optimization Algorithm with Kernel PCA to Support …...Optimization Algorithm with Kernel PCA to Support Vector Machines for Time Series Prediction Qisong Chen College of Computer

Ⅴ APPLICATION AND ANALYSIS

As a time series in essence, the short-term power load is very significant to the electric network’s reliable and economic running. The short-term load forecasting can plan the open or stop of generators and keep the electric network running safely and stably with the exact load prediction. With the development of electric market, people have been paying more and more attention to the load forecasting.

To illustrate the effectiveness of using the KPCA-SVM combined with PSO for short-term power load forecasting, this method is used in the power load predictor to compare the prediction errors, correct prediction steps and time consumed of this novel predictor with those of current predictors which are linear

neural network predictor, RBF neural network predictor, BP neural network predictor and SVM predictor. The training data sets of the three neural network predictors are the same as the KPCA-SVM predictor.

Db3 wavelet is selected as the wavelet function and decomposition level is selected at 3 for the original series. The decomposed series is used to predict the power demand of the next 3 days by KPCA-SVM predictor. The decomposed level is hard to choose. If the level is too low, prediction error will increase; when it increases, prediction result will be improved greatly, but when the level is more than three, there is only tiny improvement. At last, db3 is chosen as the mother wavelet and decomposed level is three. The original series and the decomposed series are shown in Fig. 3.

Figure 3. The forecast results of the KPCA-SVM with PSO prediction model in 3 days.

TABLE Ⅰ. COMPARISON RESULTS OF KPCA-SVM PREDICTION WITH NEURAL NETWORK PREDICTION

parameters Average relative error(%) Correct prediction steps Time consumed/s KPCA-SVM 0.0197 95 21.3 Linear NN 0.0226 72 82.1 RBF NN 0.0215 39 153.2 BP NN 0.1349 35 324.8

TABLE I. COMPARISON OF EXPERIMENTAL RESULTS BY SVM AND KPCA-SVM

Class Accuracy(%) Train(s) Test(s) SVM 93.7 0.89 2.023

KPCA-SVM 94.62 0.84 1.912

JOURNAL OF COMPUTERS, VOL. 5, NO. 3, MARCH 2010 385

© 2010 ACADEMY PUBLISHER

Page 7: Optimization Algorithm with Kernel PCA to Support …...Optimization Algorithm with Kernel PCA to Support Vector Machines for Time Series Prediction Qisong Chen College of Computer

The results of load forecasting of power systems is influenced by such factors as day sort, week sort, day weather sort, day temperature, etc. In this paper, these factors are included in one eigenvector at the day’s value. 1200 samples are considered to reconstruct the phase space using KPCA. Through several trials, 2σ = 75 and number of principal components is 28 where the accumulative contribution ratio is 0.95. Then the embedding dimension of phase space is 15 through similarity analysis with θ = 0.90. So the embedding phase space is reconstructed with the values of the parameters d = 45 and τ = 4 in the experiment. From the power load time series )(tx , we extracted 1200 input-output data pairs. The first 500 pairs is used as the training data set, the second 200 pairs is used as validating data set for finding the optimal parameters of LS-SVM, while the remaining 500 pairs are used as testing data set for testing the predictive power of the model. The prediction performance is evaluated using the RMSE and the NMSE .

∑ −==

n

iii yy

nNMSE

1

22

)ˆ(1δ

(25)

∑ −−

==

n

ii yy

n 1

2)(1

1δ (26)

where n represents the total number of data points in the test set, yyy ii ,ˆ, are the actual value, prediction value and the mean of the actual values respectively.

The parameters of PSO are given by: 30],100,100[,2,5.0,7.0 max21 =−∈==== vxccwa i ,

the momentum operator is 0.48. When we are applying LS-SVM to modeling, the first

thing that needs to be considered is what kernel function is to be used. As the dynamics of chaotic time series are strongly nonlinear, it is intuitively believed that using nonlinear kernel functions could achieve better performance than the linear kernel. In this investigation, the RBF kernel functions trend to give good performance under general smoothness assumptions. The second thing that needs to be considered is what values of the kernel parameters ),( 2 γσ are to be used. As there is no structured way to choose the optimal parameters of LS-SVM, the values of the parameters that produce the best result in the validation set are used for LS-SVM. Through several trials, it can be get that 2σ and γ play an important role on the generalization performance of LS-SVM, so 2σ and γ are, respectively, fixed at 0.15 and 25 for following experiments.

Ⅵ CONCLUSIONS

In this paper, as a novel method, LS-SVM with KPCA and PSO is used for the short-term power load forecasting. KPCA method is adopted to extract features of chaotic time series, reflecting its nonlinear characteristic fully. The embedding dimension of the

phase space of time series is reduced according to the similarity degree of the principal components to the model output. Then PSO algorithm is employed to optimization of these parameters in LS-SVM. The application results show that PSO-LS-SVM with KPCA has great effectiveness for short-term power load forecasting. This method has higher prediction accuracy, longer correct prediction steps and consumes less time than ANN and PSO-SVM method. Comparing with the ANN and PSO-SVM, it can be proved that this method not only improves accuracy of short-term load forecasting, practicability of system, but also can be accomplished by software. So LS-SVM with KPCA and PSO forecasting model is a reliable method for short-term power demand forecasting.

ACKNOWLEDGMENT

This project is supported by Education Foundation of Guizhou Province (No. QJK[2007]010). The authors are grateful for the anonymous reviewers who made constructive comments.

REFERENCES [1] R.A. Jacobs, M.A. Jordan, S.J. Nowlan, G.E. Hinton,

“Adaptive Mixtures of Local Experts”, Neural Computation, 3, pp.79~87,1991.

[2] M.I. Jordan, R.A. Jacobs, “Hierarchical Mixtures of Experts and the EM Algorithm”, Neural Computation, 6, pp. 181~214, 1994.

[3] R.L. Milidiu, R.J. Machado, R.P. Rentera, “Time-Series Forecasting through Wavelets Transformation and a Mixture of Expert Models”, Neurocomputing, 28, pp. 145~146, 1999.

[4] Vapnik V, The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1999.

[5] Vapnik V. “An Overview of Statistical Learning Theory”. IEEE Trans. on Neural Network, Vol. 10(5), pp. 988~999, 1999.

[6] Zhang Xuegong. “Introduction to Statistical Learning Theory and Support Vector Machines”. Acta Automation Sinica, Vo1.26(1), pp. 32~41, 2000.

[7] M. S. Jalaleddine, C. G. Hutchens, R. D. Strattan, and W. A. Coberly, “ECG data compression techniques—A unified approach,” IEEE Trans. Biomed. Eng., vol. 37, no. 4, pp: 329~343, 1990.

[8] J. Yloestalo, “Data compression methods for EEG,” Technol. Health Care, 1999, vol. 7, pp. 285~300.

[9] M. Hilton, “Wavelet and wavelet packet compression of electrocardiograms,” IEEE Trans. Biomed. Eng., vol. 44, no. 5, pp. 394~402, 1997.

[10] P. Wellig, Z. Chang, M. Semling, and G. S. Moschytz, “Electromyogram data compression using single-tree and modified zero-tree wavelet encoding,” in Proc. IEEE-EMBS, pp. 1303~1306, 1998.

[11] S. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Trans. Pattern Anal. Machine Intell. vol. 11, no. 7, pp. 674~693, 1989.

[12] Cao L J, Chua K S, Chong W K, “A comparison of PCA, KPCA, and ICA for dimensionality reduction in support vector machine”, Neural computing 55 (1-2), pp. 321~336, 2003.

386 JOURNAL OF COMPUTERS, VOL. 5, NO. 3, MARCH 2010

© 2010 ACADEMY PUBLISHER

Page 8: Optimization Algorithm with Kernel PCA to Support …...Optimization Algorithm with Kernel PCA to Support Vector Machines for Time Series Prediction Qisong Chen College of Computer

[13] Chen, Y., Wang, J.Z. “Support vector learning for fuzzy rule-based classification systems”, IEEE Trans. Fuzzy System 11, pp. 716~728, 2003.

[14] Joachims, T. Making large-Scale SVM Learning Practical. In: Scholkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, MIT Press, Cambridge, 1999.

[15] Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: “Improvements to Platt’s SMO algorithm for SVM classifier design”, Neural Computation 13, pp. 637~649, 2001.

[16] Lin, Chih-Jen. “On the Convergence of the Decomposition Method for Support Vector Machines”, IEEE Transactions on Neural Networks, Vol. 12, pp. 1288~1298, 2001.

[17] Rychetsky, M., Ortmann, S., Ullmann, M., Glesner, M.: “Accelerated Training of Support Vector Machines”, IJCNN '99, International Joint Conference on Neural Networks, Vol. 2, pp. 998~1003, 1999.

[18] Platt, J. Fast Training of SVMs Using Sequential Minimal Optimization. Advances in Kernel Methods - Support Vector Learning, MIT press, Cambridge, MA, pp. 185~208, 1999.

[19] Takahashi, N., Nishi, T. “Rigorous Proof of Termination of SMO Algorithm for Support Vector Machines”’ IEEE Transactions on Neural Networks, Vol.16, pp. 774~776, 2005.

[20] G.N. Xuan, R.W. Cheng, Genetic algorithms and engineering optimization, Beijing: Tsinghua University Press, 2004. (in Chinese)

[21] J. Kennedy and R. C. Eberhart, "Particle swarm optimization", Proceedings of the 1995 IEEE International Conference on Neural Networks, vol. 4, pp. 1942~1948, 1998.

[22] QI-SONG CHEN, XIN ZHANG, SHI-HUAN XIONG and Xiao-Wei Chen, " Short-term power load forecasting with least squares support vector machines and wavelet transform", Machine Learning and Cybernetics, 2008 International Conference on, vol. 3: 1425~1429, 2008.

Qisong Chen was born in Guiyang, China, in 1974. He received his B.S. degree in Automatization in 1996 from Guizhou University of Technology and his M.S. degree in Computer Application in 2002 from Zhejiang University. Currently, he is a doctorial candidate in Guizhou University. His research interests include blind signal processing, pattern recognition.

Xiaowei Chen was born in Guiyang, China, in 1945. Currently, he is a professor in Guizhou University. His research interests include computer graphics, pattern recognition.

Yun Wu was born in Jindezheng, China, in 1973. He received his M.S. degree in Computer Application in 2006 from Guizhou University. Currently, he is a doctorial candidate in Guizhou University. His research interests include signal processing, audio and video processing.

JOURNAL OF COMPUTERS, VOL. 5, NO. 3, MARCH 2010 387

© 2010 ACADEMY PUBLISHER