smooth support vector machines for classification and ... · smooth support vector machines for...
TRANSCRIPT
Smooth Support Vector Machinesfor Classification and Regression
Yuh-Jye Lee
National Taiwan University of Science and Technology
International Summer Workshop on the Economics, Financialand Managerial Applications of Computational Intelligence
August 16~20, 2004
Fundamental Problems in Data Mining
Supervised learning:
Feature selection
Classification problems
Too many features could degrade generalizationperformance, curse of dimensionality
Regression problems
Unsupervised learning:Clustering algorithms
Association Rules
Binary Classification Problem
Supervised learning in Machine Learning
Successful applications:
(A Fundamental Problem in Data Mining)
Fisher Linear Discriminator
Find a decision function (rule) to discriminatetwo categories data sets.
Discrimination Analysis in Statistics
Decision Tree, Neural Network, k-NN andSupport Vector Machines, etc.
Marketing, Bioinformatics, Fraud detection
Bankruptcy PredictionBinary classification of firms: solvent vs. bankrupt
The data are financial indicators from middle-marketcapitalization firms in Benelux. From a total of 422firms, 74 went bankrupt and 348 were solvent. Thevariables to be used in the model as explanatory inputsare 40 financial indicators such as: liquidity, profitabilityand solvency measurements.
T. Van Gestel, B. Baesens, J. A. K. Suykens, M. Espinoza, D. Baestaens, J. Vanthienen and B. De Moor, “Bankruptcy Prediction with Least SquaresSupport Vector Machine Classifiers”, International Conference in Computational Intelligence and FinancialEngineering, 2003.
Binary Classification ProblemLinearly Separable Case
A-
A+
x0w + b = à 1
wx0w + b = + 1x0w + b = 0
Bankrupt
Solvent
Why Use Support Vector Machines?Powerful tools for Data Mining
SVM classifier is an optimally defined surfaceSVMs have a good geometric interpretationSVMs can be generated very efficientlyCan be extended from linear to nonlinear case
Typically nonlinear in the input spaceLinear in a higher dimensional “feature” spaceImplicitly defined by a kernel function
Have a sound theoretical foundationBased on Statistical Learning Theory
Support Vector MachinesMaximizing the Margin between Bounding Planes
x0w + b = + 1
x0w + b = à 1
A+
A-
w
||w||22 = Margin
Algebra of the Classification ProblemLinearly Separable Case
Given l points in the n dimensional real space Rn
Represented by an `â n matrix A
orMembership of each point Ai in the classes Aà A+
is specified by an `â ` diagonal matrix D :Dii = à 1 if Ai ∈ Aà and Dii = 1 Ai ∈ A+if
Separate Aà and A+ by two bounding planes such that:
Aiw+ b > + 1, for Dii = + 1,Aiw+ b 6 à 1, for Dii = à 1
More succinctly: D(Aw+ eb)>e
e = [1, 1, . . ., 1]0 ∈ R`.
, where
Support Vector Classification(Linearly Separable Case)
Let S = {(x1, y1), (x2, y2), . . .(xl, yl)}be a linearly
separable training sample and represented bymatrices
A =
(x1)0
(x2)0...(xl)0
⎡⎢⎣⎤⎥⎦ ∈ Rlân, D =
y1 á á á 0.... . .
...0 á á á yl
" #∈ Rlâl
D(Aw+ eb) >e+ ø
ø>0
where ø : nonnegative slack (error) vector
The term e0ø , 1-norm measure of error vector, is called the training error.
minw, b, ø
e0ø
s.t. (LP)
Robust Linear ProgrammingPreliminary Approach to SVM
For the linearly separable case, at solution of (LP):
ø = 0
xj
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
oi
í
í
øj
øi
Two Different Measures of Training Error
min(w,b,ø)∈Rn+1+l
21||w||22 + 2
C||ø||22D(Aw+ eb) + ø>e
2-Norm Soft Margin:
1-Norm Soft Margin:min
(w,b,ø)∈Rn+1+l21||w||22 + Ce 0ø
D(Aw+ eb) + ø>e, ø > 0
Margin is maximized by minimizing reciprocal of
margin.
1-Norm Soft Margin Dual Formulation
The Lagrangian for 1-norm soft margin:L(w, b, ø, ë, r) =
21w 0w + Ce 0ø+ë 0[eàD(Aw + eb)à ø]à r 0ø
where ë>0 & r>0
The partial derivatives with respect to primalvariables equal zeros
∂w∂L(w,b,ø,ë) = w à A 0Dë = 0
∂b∂L(w,b,ø,ë) = e 0Dë = 0,
∂ø∂L(w,b,ø,ë) = C à ë à r = 0
Dual Maximization Problemfor 1-Norm Soft Margin
Dual:
0 6 ë 6 Ce
maxë∈Rl
e 0ë à21ë 0DAA 0Dë
e 0Dë = 0
The corresponding KKT complementarity:
06ë ⊥ D(Aw + eb) + øà e>0
06ø ⊥ ëà Ce60
Slack Variables for 1-Norm Soft Margin ( )
Non-zero slack can only occur when ëãi = C
The points for which 0 < ëãi < C lie at the
bounding planesThis will help us to find bã
The contribution of outlier in the decisionrule will be at most CThe trade-off between accuracy and regularization directly controls by C
wã = A 0Dëã
Tuning Procedure
overfitting
The final value of parameter is one with the maximum testing set correctness !
SVM as anUnconstrained Minimization Problem
Hence (QP) is equivalent to the nonsmooth SVM:minw , b 2
C k(e à D(Aw + eb))+k22 + 2
1(kwk22 + b2)
2C køk2
2 + 21(kwk2
2 + b2)
D(Aw + eb) + ø>eø>0,w, bmin
s. t.(QP)
Change (QP) into an unconstrained MP
Reduce (n+1+m) variables to (n+1) variables
At the solution of (QP) :where (á )+= max{á ,0}
ø = (e à D (Aw + eb))+
.
Smooth the Plus Function: Integrate
Step function: xã Sigmoid function:(1+εà5x)
1
Plus function: x+ p-function: p(x, 5)
(1+εàìx)1
p(x, ì) := x +ì1 log(1 + εàìx)
SSVM: Smooth Support Vector Machine
(á )+Replacing the plus function in the nonsmoothSVM by the smooth p(á , ì) , gives our SSVM:
ìnonsmooth SVM as goes to infinity.The solution of SSVM converges to the solution of
ì = 5(Typically, )
min(w, b) ∈ Rn+1 2
Ckp((e à D(Aw + eb)), ì)k22 + 2
1(kwk22 + b2)
, obtained by integrating the sigmoid function(á )+ofHere, p ( á , ì ) is an accurate smooth approximation
of neural networks. (sigmoid = smoothed step)
Newton-Armijo Method: Quadratic Approximation of SSVM
(w i, b i)è é
generated by solving aThe sequence
(wã, bã)quadratic approximation of SSVM, converges to the
of SSVM at a quadratic rate.
At each iteration we solve a linear system of:n+1 equations in n+1 variablesComplexity depends on dimension of input space
Converges in 6 to 8 iterations
unique solution
It might be needed to select a stepsize
Newton-Armijo Algorithm
Start with any (w0, b0) ∈ Rn+1 . Having (wi, bi),
stop if ∇Φì(wi, bi) = 0, else :
(i) Newton Direction :
∇2Φì(wi, bi)d
i = à∇Φì(wi, bi)
0
(ii) Armijo Stepsize :
(wi+1, bi+1) = (wi, bi) + õidi
õ i = { 1 , 21 , 4
1 , . . . }
globally and globally and quadraticallyquadraticallyconverge to converge to unique unique solution in a solution in a finite number finite number of stepsof steps
It can not converge to optimum solution !
f(x) = à 61x 6 + 4
1x 4 + 2x 2
g(x) = f(xi) + f0(xi)(x à xi) + 21 f00(xi)(x à xi)
Why use Why use stepsizestepsize
Comparisons of SSVM with other SVMs
89.17128.15
86.1042.41
89.633.69
Ionosphere 351 x 34
61.834.91
66.233.72
68.181.03
WPBC(60 months)110 x 22
82.0212.50
71.086.25
83.472.32
WPBC(24 months)155 x 32
77.071138.0
74.47286.59
78.121.54
Pima Indians768 x 8
69.86124.23
64.0319.94
70.331.05
BUPA Liver345 x 6
72.1267.55
84.5518.71
86.131.63
Cleveland Heart297 x 13
mâ nDataset Size SSVM SVMíí á
íí2
2SVMíí á
íí1
Tenfold test set correctness % (best in Red)CPU time in seconds
QPLPLinear Eqns.
Two-spiral Dataset(94 White Dots & 94 Red Dots)
X Fþ
þ( ) þ( )
þ( )þ( )
þ( )
þ( )
þ( )þ( )
Dual Representation of SVM(Key of Kernel Methods: )
The hypothesis is determined by (ëã, bã)
h(x) = sgn(êx, A 0Dëãë+ bã)
= sgn(Pi=1
l
yiëãi
êxi, x
ë+ bã)
= sgn(Pëã
i>0
yiëãi
êxi, x
ë+ bã)
w = A0Dëã =Pi=1
`
yiëãiA
0i
Remember : A0i = xi
f(x) =ðP
i=1
?
wiþi(x)ñ+ b
Linear Machine in Feature Space
Let þ : X→ F be a nonlinear map from the
input space to some feature space
The classifier will be in the form (Primal):
Make it in the dual form:
f(x) =ðP
i=1
l
ëiyiêþ(xi) á þ(x)ëñ+ b
K(x, z) =êþ(x) á þ(z)ë
Kernel: Represent Inner Product in Feature Space
The classifier will become:
f(x) =ðP
i=1
l
ëiyiK(xi, x)ñ+ b
Definition: A kernel is a function K : XâX→ R
such that for all x, z ∈ X
where þ : X→ F
A Simple Example of Kernel
Polynomial Kernel of Degree 2: K(x, z) =êx, z
ë2Let x =
x1
x2
ô õ, z =
z1z2
ô õ∈ R2 and the nonlinear map
þ : R2 7→R3 defined by þ(x) =x21
x22
2√
x1x2
⎡⎣ ⎤⎦ .
Thenêþ(x), þ(z)
ë=
êx, z
ë2= K(x, z).
There are many other nonlinear maps, ψ(x), that
satisfy the relation:êψ(x),ψ(z)
ë=
êx, z
ë2= K(x, z)
Power of the Kernel Technique
Consider a nonlinear map þ : Rn 7→Rp that consistsof distinct features of all the monomials of degree d.
Then p = n+ dà 1d
ð ñ.
For example: n = 11, d = 10, p = 92378
Is it necessary? We only need to knowêþ(x), þ(z)
ë!
This can be achieved K(x, z) =êx, z
ëd
More Examples of KernelK(A,B) : Rmân âRnâl 7à→Rmâl
A ∈ Rmân, a ∈ Rm, ö ∈ R, d is an integer:
Polynomial Kernel : (AA 0 + öaa 0)dï)(Linear KernelAA 0 : ö = 0, d = 1
Gaussian (Radial Basis) Kernel :
εàökAiàAjk22, i, j=1, . . .,mK(A,A0)ij =
The ij-entry of K(A,A0) represents the “similarity”of data pointsAi Ajand
The value of kernel function represents the inner product of two training points in feature space
Kernel functions merge two steps1. map input data from input space to
feature space (might be infinite dim.)2. do inner product in the feature space
Kernel TechniqueBased on Mercer’s Condition (1909)
Nonlinear SVM Motivation
Linear SVM: (Linear separator:x 0w + b = 0 )
2C k øk 2
2 + 21(kw k 2
2 + b 2 )
D (Aw + eb) + ø> eø>0, w, bmin
s. t.(QP)
By QP “duality”, w = A 0Dë. Maximizing the marginin the “dual space” gives:
2Ckp(e à D(AA 0Dë + eb), ì)k2
2+ 21(këk22 + b2)
ë, bmin
Dual SSVM with separator: x0A0Dë+ b = 0
2C k øk 2
2 + 21(kë k 2
2 + b 2 )
D (AA 0Dë + eb) + ø> eø>0, ë, bmin
s. t.
Nonlinear Smooth SVMK (x 0, A 0)Dë + b = 0
K (A, A 0)ReplaceAA 0by a nonlinear kernel :
2Ckp(eàD(K(A,A0)Dë+ eb, ì)k22+ 2
1(këk22 + b2)
ë, bmin
Use Newton-Armijo algorithm to solve the problem
Each iteration solves m+1 linear equations in m+1 variables
Nonlinear classifier depends on entire dataset :
K (x 0, A 0)Dë + b = 0
Nonlinear Classifier:
Difficulties with Nonlinear SVM for Large Problems
The nonlinear kernel K(A,A0) ∈ Rlâl is fully dense
Computational complexity depends on # of example
Separating surface depends on almost entire datasetComplexity of nonlinear SSVM ø O((l+ 1)3)
Runs out of memory while storing the kernel matrixLong CPU time to compute the dense kernel matrix
O(l2)Need to generate and store entries
Need to store the entire dataset even after solvingthe problem
Solving the SVM with Massive Dataset
Limit the SVM to dataset of a few thousand points
Solution I: SMO (Sequential Minimal Optimization)
Standard optimization techniques require that the the data are held in memory
Solve the sub-optimization problem defined by the working set (size =2)Increase the objective function iteratively
Solution II: RSVM (Reduced Support Vector Machine)
Support Vector Regression(Linear Case: f(x) = x0w+ b)
Given the training set:S = {(xi, yi)| xi ∈ Rn, yi ∈ R, i = 1, . . ., l}
Find a linear function, f(x) = x0w+ b where(w, b) is determined by solving a minimizationproblem that guarantees the smallest overallexperiment error made by f(x) = x0w+ b
Motivated by SVM:||w||2 should be as small as possibleSome tiny error should be discard
-Insensitive Loss Functionï
-insensitive loss function:ï
|yi à f(xi)|ï = max{0, |yi à f(xi)| à ï}
The loss made by the estimation function, fat the data point (xi, yi) is
|ø|ï = max{0, |ø| à ï}= 0 if |ø|6ï|ø| à ï otherwise
úIf ø ∈ Rn then |ø|ï ∈ Rn is defined as:
(|ø|ï)i = |øi|ï , i = 1. . .n
x
x
x
x
x
x
x
x
x
ε
ε
-Insensitive Linear Regressionε
f(x) = x0w + b
yj à f(xj)à εf(xk)à yk à ε
Find (w, b) with the smallest overall error
ï - insensitive Support Vector Regression Model
Motivated by SVM:||w||2 should be as small as possible
Some tiny error should be discarded
min(w,b,ø)∈Rn+1+m
21||w||22 + Ce0 ø| |ï
where ø| |ï ∈ Rm, ( ø| |ï)i = max(0, Aiw+ bà yi| | à ï)
Reformulated - SVR as a Constrained Minimization Problem
min(w,b,ø,øã)∈Rn+1+2m
21w0w+ Ce0(ø+ øã)
yàAwà eb 6 eï+ øAw+ ebà y 6 eï+ øã
ø, øã > 0
subject to
n+1+2m variables and 2m constrains minimization problem
ï
Enlarge the problem size and computational complexity for solving the problem
Five Wild Used Loss Functions
SV Regression by Minimizing Quadratic -Insensitive Lossï
We minimize ||(w, b)||22 at the same timeOccam’s razor: the simplest is the best
min(w,b,ø)∈Rn+1+l
21(||w||22 + b2) +
2C ||(|ø|ε)||22
We have the following (nonsmooth) problem:
where (|ø|ε)i = |yi à (w0xi + b)|ε
Have the strong convexity of the problem
- insensitive Loss Functionï
(à x à ï)+ (xà ï)+x| |ï
Quadratic -insensitive Loss Functionï
x| |2ï = ((xà ï)+ + (à xà ï)+)2
= (x à ï)2+ + (à x à ï)
2+
(x à ï ) + á (à x à ï ) + = 0
p2ï-function replace UseïQuadratic -insensitive Function
p(x, ì)
p2ï(x, ì) = (p(xà ï, ì))2 + (p(à xà ï, ì))2
which is defined byp(x, ì) = x+
ì1 log(1 + expàìx)
p -function withë=10, p(x,10), x∈[à3,3]
x| | 2ï p 2ï(x, ì), ï = 1, ì = 5
-insensitive Smooth Support Vector Regression
ï
min(w,b)∈Rn+1
21(w0w+ b2) +
2CPi=1
m
p2ï(Aiw+ bà yi, ì)2CPi=1
m
A iw + b à y i| |2ï
This problem is a strongly convexstrongly convexminimization problem without any constrainsThe object function is twice differentiabletwice differentiablethus we can use a fast NewtonNewton--ArmijoArmijomethodmethod to solve this problem
min(w,b)∈Rn+1
Φï,ë(w, b) :=
Nonlinear -SVRï
w = A0ë, ë ∈ Rm
Based on duality theorem and KKT–optimalityconditions
y ù Aw + eb
y ù AA0ë + eb
y ù K(A,A0)ë + ebIn nonlinear case :
Nonlinear SVR
min(ë,b)∈Rm+1
21||ë||22 + C
Pi=1
m
K(Ai,A0)ë + b à yi| |ï
A ∈ Rmân, B ∈ Rnâ l
K(A, B)
ï à
Letand Rmân â Rnâl=⇒Rmâl
K(Ai,A0) ∈ R1âmand
Nonlinear regression function : f(x) = K(x,A0)ë + b
min(ë,b)∈Rm+1
21(ë0ë+ b2)
+ 2CPi=1
m
p2ï(K(Ai,A0)u+ bà yi, ë)+ 2
CPi=1
m
K (A i, A0)ë + b à y i| |2ï
Nonlinear Smooth Support Vector -insensitive Regressionï
Training set and testing set (Slice methodSlice method)Gaussian kernelGaussian kernel is used to generate nonlinear -SVR in all experimentsReduced kernel techniqueReduced kernel technique is utilized when training dataset is bigger then 1000Error measure : 2-norm relative error
Numerical Results
ï
yk k 2
yà yêk k 2 : observations: predicted valuesyê
y
f(x) = 0.5 ã sinc(ù10x)+noise
Noise: mean=0
x ∈ [à 1, 1], 101 points
û = 0.04
Parameter:÷ = 50, ö = 5, ε = 0.02
Training time : 0.3 sec.
101 Data Points inRâR
Nonlinear SSVR with Kernel: expàö||xiàxj||22
First Artificial Dataset
f(x) = 0.5 ãù30x
sinc( ù30x)
+ ú ú random noise with mean=0,standard deviation 0.04
Training Time : 0.016 sec.Error : 0.059
Training Time : 0.015 sec.Error : 0.068
ï - SSVR LIBSVM
Original Function
Noise : mean=0 , û = 0.4
Parameter : ÷ = 50, ö = 1, ε = 0.5
Training time : 9.61 sec.Mean Absolute Error (MAE) of 49x49 mesh points : 0.1761
Estimated Function
481 Data Points in R2 â R
Noise : mean=0 , û = 0.4
Estimated Function Original Function
Using Reduced Kernel: K(A,A 0) ∈ R28900â300
Parameter :
Training time : 22.58 sec.MAE of 49x49 mesh points : 0.0513
C = 10000, ö = 1, ï = 0.2
Real Datasets
Linear -SSVRTenfold Numerical Result
ï
Nonlinear -SSVRTenfold Numerical Result 1/2
ï
Nonlinear -SSVRTenfold Numerical Result 2/2
ï
We introduced SVMs for classification and regression.SVMs can be extended from linear to nonlinear by using kernel trick.
We applied smooth technique to SVMs to propose smooth SVMs for classification and regression.
We also described the Newton-Armijo algorithm to solve smooth SVMswhich has been shown convergent globally and quadratically in finite steps to the solution.
The numerical results show the effectiveness and correctness of linear and nonlinear smooth SVMs.
Our smooth SVMs formulation is only need to solve a system of linear equations iteratively instead of solving a convex quadratic problem.
Conclusion