smooth support vector machines for classification and ... · smooth support vector machines for...

Smooth Support Vector Machinesfor Classification and Regression

Yuh-Jye Lee

National Taiwan University of Science and Technology

International Summer Workshop on the Economics, Financialand Managerial Applications of Computational Intelligence

August 16~20, 2004

Fundamental Problems in Data Mining

Supervised learning:

Feature selection

Classification problems

Too many features could degrade generalizationperformance, curse of dimensionality

Regression problems

Unsupervised learning:Clustering algorithms

Association Rules

Binary Classification Problem

Supervised learning in Machine Learning

Successful applications:

(A Fundamental Problem in Data Mining)

Fisher Linear Discriminator

Find a decision function (rule) to discriminatetwo categories data sets.

Discrimination Analysis in Statistics

Decision Tree, Neural Network, k-NN andSupport Vector Machines, etc.

Marketing, Bioinformatics, Fraud detection

Bankruptcy PredictionBinary classification of firms: solvent vs. bankrupt

The data are financial indicators from middle-marketcapitalization firms in Benelux. From a total of 422firms, 74 went bankrupt and 348 were solvent. Thevariables to be used in the model as explanatory inputsare 40 financial indicators such as: liquidity, profitabilityand solvency measurements.

T. Van Gestel, B. Baesens, J. A. K. Suykens, M. Espinoza, D. Baestaens, J. Vanthienen and B. De Moor, “Bankruptcy Prediction with Least SquaresSupport Vector Machine Classifiers”, International Conference in Computational Intelligence and FinancialEngineering, 2003.

Binary Classification ProblemLinearly Separable Case

A-

A+

x0w + b = à 1

wx0w + b = + 1x0w + b = 0

Bankrupt

Solvent

Why Use Support Vector Machines?Powerful tools for Data Mining

SVM classifier is an optimally defined surfaceSVMs have a good geometric interpretationSVMs can be generated very efficientlyCan be extended from linear to nonlinear case

Typically nonlinear in the input spaceLinear in a higher dimensional “feature” spaceImplicitly defined by a kernel function

Have a sound theoretical foundationBased on Statistical Learning Theory

Support Vector MachinesMaximizing the Margin between Bounding Planes

x0w + b = + 1

x0w + b = à 1

A+

A-

w

||w||22 = Margin

Algebra of the Classification ProblemLinearly Separable Case

Given l points in the n dimensional real space Rn

Represented by an `â n matrix A

orMembership of each point Ai in the classes Aà A+

is specified by an `â ` diagonal matrix D :Dii = à 1 if Ai ∈ Aà and Dii = 1 Ai ∈ A+if

Separate Aà and A+ by two bounding planes such that:

Aiw+ b > + 1, for Dii = + 1,Aiw+ b 6 à 1, for Dii = à 1

More succinctly: D(Aw+ eb)>e

e = [1, 1, . . ., 1]0 ∈ R`.

, where

Support Vector Classification(Linearly Separable Case)

Let S = {(x1, y1), (x2, y2), . . .(xl, yl)}be a linearly

separable training sample and represented bymatrices

A =

(x1)0

(x2)0...(xl)0

⎡⎢⎣⎤⎥⎦ ∈ Rlân, D =

y1 á á á 0.... . .

...0 á á á yl

" #∈ Rlâl

D(Aw+ eb) >e+ ø

ø>0

where ø : nonnegative slack (error) vector

The term e0ø , 1-norm measure of error vector, is called the training error.

minw, b, ø

e0ø

s.t. (LP)

Robust Linear ProgrammingPreliminary Approach to SVM

For the linearly separable case, at solution of (LP):

ø = 0

xj

x

x

x

x

x

x

x

x

o

o

o

o

o

o

o

oi

í

í

øj

øi

Two Different Measures of Training Error

min(w,b,ø)∈Rn+1+l

21||w||22 + 2

C||ø||22D(Aw+ eb) + ø>e

2-Norm Soft Margin:

1-Norm Soft Margin:min

(w,b,ø)∈Rn+1+l21||w||22 + Ce 0ø

D(Aw+ eb) + ø>e, ø > 0

Margin is maximized by minimizing reciprocal of

margin.

1-Norm Soft Margin Dual Formulation

The Lagrangian for 1-norm soft margin:L(w, b, ø, ë, r) =

21w 0w + Ce 0ø+ë 0[eàD(Aw + eb)à ø]à r 0ø

where ë>0 & r>0

The partial derivatives with respect to primalvariables equal zeros

∂w∂L(w,b,ø,ë) = w à A 0Dë = 0

∂b∂L(w,b,ø,ë) = e 0Dë = 0,

∂ø∂L(w,b,ø,ë) = C à ë à r = 0

Dual Maximization Problemfor 1-Norm Soft Margin

Dual:

0 6 ë 6 Ce

maxë∈Rl

e 0ë à21ë 0DAA 0Dë

e 0Dë = 0

The corresponding KKT complementarity:

06ë ⊥ D(Aw + eb) + øà e>0

06ø ⊥ ëà Ce60

Slack Variables for 1-Norm Soft Margin ( )

Non-zero slack can only occur when ëãi = C

The points for which 0 < ëãi < C lie at the

bounding planesThis will help us to find bã

The contribution of outlier in the decisionrule will be at most CThe trade-off between accuracy and regularization directly controls by C

wã = A 0Dëã

Tuning Procedure

overfitting

The final value of parameter is one with the maximum testing set correctness !

SVM as anUnconstrained Minimization Problem

Hence (QP) is equivalent to the nonsmooth SVM:minw , b 2

C k(e à D(Aw + eb))+k22 + 2

1(kwk22 + b2)

2C køk2

2 + 21(kwk2

2 + b2)

D(Aw + eb) + ø>eø>0,w, bmin

s. t.(QP)

Change (QP) into an unconstrained MP

Reduce (n+1+m) variables to (n+1) variables

At the solution of (QP) :where (á )+= max{á ,0}

ø = (e à D (Aw + eb))+

.

Smooth the Plus Function: Integrate

Step function: xã Sigmoid function:(1+εà5x)

1

Plus function: x+ p-function: p(x, 5)

(1+εàìx)1

p(x, ì) := x +ì1 log(1 + εàìx)

SSVM: Smooth Support Vector Machine

(á )+Replacing the plus function in the nonsmoothSVM by the smooth p(á , ì) , gives our SSVM:

ìnonsmooth SVM as goes to infinity.The solution of SSVM converges to the solution of

ì = 5(Typically, )

min(w, b) ∈ Rn+1 2

Ckp((e à D(Aw + eb)), ì)k22 + 2

1(kwk22 + b2)

, obtained by integrating the sigmoid function(á )+ofHere, p ( á , ì ) is an accurate smooth approximation

of neural networks. (sigmoid = smoothed step)

Newton-Armijo Method: Quadratic Approximation of SSVM

(w i, b i)è é

generated by solving aThe sequence

(wã, bã)quadratic approximation of SSVM, converges to the

of SSVM at a quadratic rate.

At each iteration we solve a linear system of:n+1 equations in n+1 variablesComplexity depends on dimension of input space

Converges in 6 to 8 iterations

unique solution

It might be needed to select a stepsize

Newton-Armijo Algorithm

Start with any (w0, b0) ∈ Rn+1 . Having (wi, bi),

stop if ∇Φì(wi, bi) = 0, else :

(i) Newton Direction :

∇2Φì(wi, bi)d

i = à∇Φì(wi, bi)

0

(ii) Armijo Stepsize :

(wi+1, bi+1) = (wi, bi) + õidi

õ i = { 1 , 21 , 4

1 , . . . }

globally and globally and quadraticallyquadraticallyconverge to converge to unique unique solution in a solution in a finite number finite number of stepsof steps

It can not converge to optimum solution !

f(x) = à 61x 6 + 4

1x 4 + 2x 2

g(x) = f(xi) + f0(xi)(x à xi) + 21 f00(xi)(x à xi)

Why use Why use stepsizestepsize

Comparisons of SSVM with other SVMs

89.17128.15

86.1042.41

89.633.69

Ionosphere 351 x 34

61.834.91

66.233.72

68.181.03

WPBC(60 months)110 x 22

82.0212.50

71.086.25

83.472.32

WPBC(24 months)155 x 32

77.071138.0

74.47286.59

78.121.54

Pima Indians768 x 8

69.86124.23

64.0319.94

70.331.05

BUPA Liver345 x 6

72.1267.55

84.5518.71

86.131.63

Cleveland Heart297 x 13

mâ nDataset Size SSVM SVMíí á

íí2

2SVMíí á

íí1

Tenfold test set correctness % (best in Red)CPU time in seconds

QPLPLinear Eqns.

Two-spiral Dataset(94 White Dots & 94 Red Dots)

X Fþ

þ( ) þ( )

þ( )þ( )

þ( )

þ( )

þ( )þ( )

Dual Representation of SVM(Key of Kernel Methods: )

The hypothesis is determined by (ëã, bã)

h(x) = sgn(êx, A 0Dëãë+ bã)

= sgn(Pi=1

l

yiëãi

êxi, x

ë+ bã)

= sgn(Pëã

i>0

yiëãi

êxi, x

ë+ bã)

w = A0Dëã =Pi=1

`

yiëãiA

0i

Remember : A0i = xi

f(x) =ðP

i=1

?

wiþi(x)ñ+ b

Linear Machine in Feature Space

Let þ : X→ F be a nonlinear map from the

input space to some feature space

The classifier will be in the form (Primal):

Make it in the dual form:

f(x) =ðP

i=1

l

ëiyiêþ(xi) á þ(x)ëñ+ b

K(x, z) =êþ(x) á þ(z)ë

Kernel: Represent Inner Product in Feature Space

The classifier will become:

f(x) =ðP

i=1

l

ëiyiK(xi, x)ñ+ b

Definition: A kernel is a function K : XâX→ R

such that for all x, z ∈ X

where þ : X→ F

A Simple Example of Kernel

Polynomial Kernel of Degree 2: K(x, z) =êx, z

ë2Let x =

x1

x2

ô õ, z =

z1z2

ô õ∈ R2 and the nonlinear map

þ : R2 7→R3 defined by þ(x) =x21

x22

2√

x1x2

⎡⎣ ⎤⎦ .

Thenêþ(x), þ(z)

ë=

êx, z

ë2= K(x, z).

There are many other nonlinear maps, ψ(x), that

satisfy the relation:êψ(x),ψ(z)

ë=

êx, z

ë2= K(x, z)

Power of the Kernel Technique

Consider a nonlinear map þ : Rn 7→Rp that consistsof distinct features of all the monomials of degree d.

Then p = n+ dà 1d

ð ñ.

For example: n = 11, d = 10, p = 92378

Is it necessary? We only need to knowêþ(x), þ(z)

ë!

This can be achieved K(x, z) =êx, z

ëd

More Examples of KernelK(A,B) : Rmân âRnâl 7à→Rmâl

A ∈ Rmân, a ∈ Rm, ö ∈ R, d is an integer:

Polynomial Kernel : (AA 0 + öaa 0)dï)(Linear KernelAA 0 : ö = 0, d = 1

Gaussian (Radial Basis) Kernel :

εàökAiàAjk22, i, j=1, . . .,mK(A,A0)ij =

The ij-entry of K(A,A0) represents the “similarity”of data pointsAi Ajand

The value of kernel function represents the inner product of two training points in feature space

Kernel functions merge two steps1. map input data from input space to

feature space (might be infinite dim.)2. do inner product in the feature space

Kernel TechniqueBased on Mercer’s Condition (1909)

Nonlinear SVM Motivation

Linear SVM: (Linear separator:x 0w + b = 0 )

2C k øk 2

2 + 21(kw k 2

2 + b 2 )

D (Aw + eb) + ø> eø>0, w, bmin

s. t.(QP)

By QP “duality”, w = A 0Dë. Maximizing the marginin the “dual space” gives:

2Ckp(e à D(AA 0Dë + eb), ì)k2

2+ 21(këk22 + b2)

ë, bmin

Dual SSVM with separator: x0A0Dë+ b = 0

2C k øk 2

2 + 21(kë k 2

2 + b 2 )

D (AA 0Dë + eb) + ø> eø>0, ë, bmin

s. t.

Nonlinear Smooth SVMK (x 0, A 0)Dë + b = 0

K (A, A 0)ReplaceAA 0by a nonlinear kernel :

2Ckp(eàD(K(A,A0)Dë+ eb, ì)k22+ 2

1(këk22 + b2)

ë, bmin

Use Newton-Armijo algorithm to solve the problem

Each iteration solves m+1 linear equations in m+1 variables

Nonlinear classifier depends on entire dataset :

K (x 0, A 0)Dë + b = 0

Nonlinear Classifier:

Difficulties with Nonlinear SVM for Large Problems

The nonlinear kernel K(A,A0) ∈ Rlâl is fully dense

Computational complexity depends on # of example

Separating surface depends on almost entire datasetComplexity of nonlinear SSVM ø O((l+ 1)3)

Runs out of memory while storing the kernel matrixLong CPU time to compute the dense kernel matrix

O(l2)Need to generate and store entries

Need to store the entire dataset even after solvingthe problem

Solving the SVM with Massive Dataset

Limit the SVM to dataset of a few thousand points

Solution I: SMO (Sequential Minimal Optimization)

Standard optimization techniques require that the the data are held in memory

Solve the sub-optimization problem defined by the working set (size =2)Increase the objective function iteratively

Solution II: RSVM (Reduced Support Vector Machine)

Support Vector Regression(Linear Case: f(x) = x0w+ b)

Given the training set:S = {(xi, yi)| xi ∈ Rn, yi ∈ R, i = 1, . . ., l}

Find a linear function, f(x) = x0w+ b where(w, b) is determined by solving a minimizationproblem that guarantees the smallest overallexperiment error made by f(x) = x0w+ b

Motivated by SVM:||w||2 should be as small as possibleSome tiny error should be discard

-Insensitive Loss Functionï

-insensitive loss function:ï

|yi à f(xi)|ï = max{0, |yi à f(xi)| à ï}

The loss made by the estimation function, fat the data point (xi, yi) is

|ø|ï = max{0, |ø| à ï}= 0 if |ø|6ï|ø| à ï otherwise

úIf ø ∈ Rn then |ø|ï ∈ Rn is defined as:

(|ø|ï)i = |øi|ï , i = 1. . .n

x

x

x

x

x

x

x

x

x

ε

ε

-Insensitive Linear Regressionε

f(x) = x0w + b

yj à f(xj)à εf(xk)à yk à ε

Find (w, b) with the smallest overall error

ï - insensitive Support Vector Regression Model

Motivated by SVM:||w||2 should be as small as possible

Some tiny error should be discarded

min(w,b,ø)∈Rn+1+m

21||w||22 + Ce0 ø| |ï

where ø| |ï ∈ Rm, ( ø| |ï)i = max(0, Aiw+ bà yi| | à ï)

Reformulated - SVR as a Constrained Minimization Problem

min(w,b,ø,øã)∈Rn+1+2m

21w0w+ Ce0(ø+ øã)

yàAwà eb 6 eï+ øAw+ ebà y 6 eï+ øã

ø, øã > 0

subject to

n+1+2m variables and 2m constrains minimization problem

ï

Enlarge the problem size and computational complexity for solving the problem

Five Wild Used Loss Functions

SV Regression by Minimizing Quadratic -Insensitive Lossï

We minimize ||(w, b)||22 at the same timeOccam’s razor: the simplest is the best

min(w,b,ø)∈Rn+1+l

21(||w||22 + b2) +

2C ||(|ø|ε)||22

We have the following (nonsmooth) problem:

where (|ø|ε)i = |yi à (w0xi + b)|ε

Have the strong convexity of the problem

- insensitive Loss Functionï

(à x à ï)+ (xà ï)+x| |ï

Quadratic -insensitive Loss Functionï

x| |2ï = ((xà ï)+ + (à xà ï)+)2

= (x à ï)2+ + (à x à ï)

2+

(x à ï ) + á (à x à ï ) + = 0

p2ï-function replace UseïQuadratic -insensitive Function

p(x, ì)

p2ï(x, ì) = (p(xà ï, ì))2 + (p(à xà ï, ì))2

which is defined byp(x, ì) = x+

ì1 log(1 + expàìx)

p -function withë=10, p(x,10), x∈[à3,3]

x| | 2ï p 2ï(x, ì), ï = 1, ì = 5

-insensitive Smooth Support Vector Regression

ï

min(w,b)∈Rn+1

21(w0w+ b2) +

2CPi=1

m

p2ï(Aiw+ bà yi, ì)2CPi=1

m

A iw + b à y i| |2ï

This problem is a strongly convexstrongly convexminimization problem without any constrainsThe object function is twice differentiabletwice differentiablethus we can use a fast NewtonNewton--ArmijoArmijomethodmethod to solve this problem

min(w,b)∈Rn+1

Φï,ë(w, b) :=

Nonlinear -SVRï

w = A0ë, ë ∈ Rm

Based on duality theorem and KKT–optimalityconditions

y ù Aw + eb

y ù AA0ë + eb

y ù K(A,A0)ë + ebIn nonlinear case :

Nonlinear SVR

min(ë,b)∈Rm+1

21||ë||22 + C

Pi=1

m

K(Ai,A0)ë + b à yi| |ï

A ∈ Rmân, B ∈ Rnâ l

K(A, B)

ï à

Letand Rmân â Rnâl=⇒Rmâl

K(Ai,A0) ∈ R1âmand

Nonlinear regression function : f(x) = K(x,A0)ë + b

min(ë,b)∈Rm+1

21(ë0ë+ b2)

+ 2CPi=1

m

p2ï(K(Ai,A0)u+ bà yi, ë)+ 2

CPi=1

m

K (A i, A0)ë + b à y i| |2ï

Nonlinear Smooth Support Vector -insensitive Regressionï

Training set and testing set (Slice methodSlice method)Gaussian kernelGaussian kernel is used to generate nonlinear -SVR in all experimentsReduced kernel techniqueReduced kernel technique is utilized when training dataset is bigger then 1000Error measure : 2-norm relative error

Numerical Results

ï

yk k 2

yà yêk k 2 : observations: predicted valuesyê

y

f(x) = 0.5 ã sinc(ù10x)+noise

Noise: mean=0

x ∈ [à 1, 1], 101 points

û = 0.04

Parameter:÷ = 50, ö = 5, ε = 0.02

Training time : 0.3 sec.

101 Data Points inRâR

Nonlinear SSVR with Kernel: expàö||xiàxj||22

First Artificial Dataset

f(x) = 0.5 ãù30x

sinc( ù30x)

+ ú ú random noise with mean=0,standard deviation 0.04

Training Time : 0.016 sec.Error : 0.059

Training Time : 0.015 sec.Error : 0.068

ï - SSVR LIBSVM

Original Function

Noise : mean=0 , û = 0.4

Parameter : ÷ = 50, ö = 1, ε = 0.5

Training time : 9.61 sec.Mean Absolute Error (MAE) of 49x49 mesh points : 0.1761

Estimated Function

481 Data Points in R2 â R

Noise : mean=0 , û = 0.4

Estimated Function Original Function

Using Reduced Kernel: K(A,A 0) ∈ R28900â300

Parameter :

Training time : 22.58 sec.MAE of 49x49 mesh points : 0.0513

C = 10000, ö = 1, ï = 0.2

Real Datasets

Linear -SSVRTenfold Numerical Result

ï

Nonlinear -SSVRTenfold Numerical Result 1/2

ï

Nonlinear -SSVRTenfold Numerical Result 2/2

ï

We introduced SVMs for classification and regression.SVMs can be extended from linear to nonlinear by using kernel trick.

We applied smooth technique to SVMs to propose smooth SVMs for classification and regression.

We also described the Newton-Armijo algorithm to solve smooth SVMswhich has been shown convergent globally and quadratically in finite steps to the solution.

The numerical results show the effectiveness and correctness of linear and nonlinear smooth SVMs.

Our smooth SVMs formulation is only need to solve a system of linear equations iteratively instead of solving a convex quadratic problem.

Conclusion

smooth support vector machines for classification and ... · smooth support vector machines for...

Documents