svm and svr as convex optimization techniques

SVM and SVR as Convex Optimization Techniques

Mohammed NasserDepartment of Statistics

Rajshahi UniversityRajshahi 6205

AcknowledgementAndrew W. Moore

ProfessorSchool of Computer ScienceCarnegie Mellon University

Kenji FukumizuInstitute of Statistical Mathematics, ROIS

Department of Statistical Science, Graduate University for Advanced Studies

Georgi NalbantovEconometric Institute, School of Economics, Erasmus

University Rotterdam

Contents

Glimpses of Historical Development

Optimal Separating Hyperplane

Soft Margin Support Vector Machine

Support Vector Regression

Convex Optimization

Use of Lagrange and Duality Theory

Example

Conclusion

Early History

In 1900 Karl Pearson published his famous article on goodness of fit, judged as one of first best twelve scientific articles in twentieth century.

In 1902 Jacques Hadamard pointed that mathematical models of physical phenomena should have the properties that

A solution exists The solution is unique

The solution depends continuously on the data, in some reasonable topology

( Well-Posed Problem)

Early History In 1940 Fréchet, PhD student of Hadamard highly criticized

mean and standard deviation as measures of location and scale respectively. But he did express his belief in development of statistics without proposing any alternative.

During sixties and seventies Tukey, Huber and Hampel tried to develop Robust Statistics in order to remove ill-posedness of classical statistics.

Robustness means insensitivity to minor change in both model and sample, high tolerance to major changes and good performance at model.

Data Mining onslaught and the problem of non-linearity and nonvectorial data have made robust statistics somewhat nonattractive.

Let Us See What KM present…………….

6

Support Vector Machines (SVM) introduced in COLT-92 (conference on learning theory) greatly developed since then.

Result: a class of algorithms for Pattern Recognition(Kernel Machines)Now: a large and diverse community, from machinelearning, optimization, statistics, neural networks,functional analysis, etcCentralized website: www.kernel-machines.orgFirst Text book (2000): see www.support-vector.net Now ( 2012): At least twenty books of different taste are

avialable in international marketThe book, “ The Elements of Statistical Learning”(2001)

by Hastie,Tibshirani and Friedman went into second edition within seven years.

Recent History

http://www.support-vector.net/

7

The common characteristic (structure) among the following statistical methods?

1. Principal Components Analysis2. (Ridge ) regression3. Fisher discriminant analysis4. Canonical correlation analysis5.Singular value decomposition6. Independent component analysis

Kernel methods: Heuristic View

We consider linear combinations of input vector: ( ) Tf x w x

We make use concepts of length and dot product available in Euclidean space.

8

• Linear learning typically has nice properties– Unique optimal solutions, Fast learning algorithms– Better statistical analysis

• But one big problem– Insufficient capacity

That means, in many data sets it fails to detect nonlinearship among the variables.

• The other demerits

- Cann’t handle non-vectorial data


9

Kernel Methods

Classical KernelPCA KPCA

CCA KCCA

FLDA KFLDA

ICA KICA

Regression SVR

Classification SVM

More…………………………………….

Test of independence

Test of equalityofdistributions

Outliers detection

Data depth function

10


In Classical Multivariate Analysis we consider linear combinations of input vector:

( ) Tf x w x

We make use concepts of length and dot product/inner product available in Euclidean/non-Euclidean space.

In Modern Multivariate Analysis we consider linear combinations of feature vector::

1 1

( ) ( ) ( ), ( ) ( , )n n

Ti i i i

i i

f x w x x x k x x

We make use concepts of length and dot product available in Euclidean space.

Some Review of College Geometry

y+x-1=0

y+x-1>0

y+x-1<0

ky+kx-k=0 90

(1,1)

Different effect of k on two signed regions

Some Review of College GeometryIn General Form

wx+b=0

wx+b>0

wx+b<0

kwx+kb=0 90

w

Different effect of k on two signed regions

Some Review of College GeometryIn General Form

wx+b=0

90

w

Effect of change in w

Effect of change in b

Linear Kernel

,

Its RKHS,

. It can be shown,

Let

,

=

Linearly Seperable Classes

Linear Classifiersf x

yest

denotes +1denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

w x + b

=0w x + b<0

w x + b>0


yest





yest



Any of these would be fine..

..but which is the best?


yest




Misclassified to +1 class

Classifier Marginf x

yest



Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

f x

yest



Geometric margin versus functional margin

Maximum Marginf x

yest



The maximum margin linear classifier is the linear classifier with the, um, maximum margin.This is the simplest kind of SVM (Called an LSVM)Linear SVM

Support Vectors are those datapoints that the margin pushes up against

1. Maximizing the margin is good according to intuition and PAC theory

2. Implies that only support vectors are important; other training examples are ignorable.

3. Empirically it works very very well.

Linear SVM Mathematically

1) Correctly classify all training data if yi = +1 if yi = -1 for all i

wM 2

1bwxi

1iwx b

1)( bwxy ii

wwt

21

Our Goal

2) Maximize the Margin

same as minimize

Minimize

subject to

www t

21)(

1)( bwxy ii i

Linear SVM Mathematically

We can formulate a Quadratic Optimization Problem and solve for w and b

Strictly convex quadratic function

Linear inequality constraints

Slack variables ξi can be added to allow misclassification of difficult or noisy examples.

wx+b=1

wx+b=0

wx+b=-1

e7

e11 e2

Soft Margin Classification

What should our quadratic optimization criterion be?

Minimize

R

kk

T εC12

1 ww

Hard Margin v.s. Soft Margin The old formulation:

The new formulation incorporating slack variables:

Parameter C can be viewed as a way to control overfitting.

Find w and b such thatΦ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1

Find w and b such thatΦ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

Linear Support Vector Regression • Marketing Problem

Given variables:– person’s age– income group– season– holiday duration– location– number of children– etc. (12 variables)E

xpen

ditu

res

Age

●

●

●●

●

●

●

Predict:the level of holiday Expenditures

Linear Support Vector Regression

“Suspiciouslysmart case”(overfitting)

“Lazy case”

(underfitting)

Exp

endi

ture

s

Age

●●

●● ●

●

●

Exp

endi

ture

s

Age

●●

●● ●

●

●

Exp

endi

ture

s

Age

●●

●● ●

●

●

“Compromise case”, SVR(good generalizability)

●●

●●

●●


error,

penalty

• The epsilon-insensitive loss function

penalty

●45°

0

1

2

3

4

●


“Suspiciouslysmart case”

(overfitting)

“Compromise case”, SVR(good generalizability)

“Lazy case”(underfitting)

Exp

endi

ture

s

Age

●●

●● ●

●

●

Exp

endi

ture

s

Age

●●

●● ●

●

●

• The thinner the “tube”, the more complex the model

biggest area small area

Exp

endi

ture

s

Age

●●

●● ●

●

●middle-sized area

“Support vectors”

Non-linear Support Vector RegressionE

xpen

ditu

res

Age

●●

●● ●

●

●

• Map the data into a higher-dimensional space:

Non-linear Support Vector RegressionE

xpen

ditu

res

Age

●●

●● ●

●

●

• Finding the value of a new point:

Linear SVR: Derivation

• Given training data

• Find: ,such that optimally describes the data:

Exp

endi

ture

s

Age

●●

●● ●

●

●

{ , }, 1, ,i ix y i l

(1)

First Formulation

(2)

Regularized Error Function

22

1

1 ( ( ) )2 2

l

i ii

f x y wl

2

1

1( ( ) )2

l

ii

C E f x y we

In linear regression, we minimize the error function:

Replace the quadratic error function by Є-insensitive error function:

An example of Є-insensitive error function:


Meaning of equation 3

●●


●

●

●●

●

Complexity Sum of errors

vs.

Case I:

Case II:

“tube” complexity



Case I:

Case II:



• The role of C

●●

●● ●

●

●

C is small

●●

●● ●

●

●

C is big

●●


●

●

●●

● Subject to:

Back

Review of Convex Optimization


Back

Weak DualityReview of Convex Optimization

Strong DualityReview of Convex Optimization

Lagrangian

2* * * * *

1 1 1 1

*

1

*

1

* **

1( ) ( ) ( , ) ( , )2

0 ( )

0 ( ) 0

0

0

l l l l

n n n n n n n n n n n n n nn n n n

l

n n nn

l

n nn

n nn

n nn

L C w y w x b y w x b

L w xwLbL C

L a C

e e

Minimize:

f(x)=<w,x>= * *

1 1

( ) , ( ) ,l l

n n n n n nn n

x x x x

Dual var. α

n, αn *,μ

n ,μ*n >=0

Support Vector Regression

Dual Form of Lagrangian

* * * * *

1 1 1 1

*

*

1

1( , ) ( )( ) , ( ) ( )2

0

0

( ) 0

l l l l

n n m m n m n n n n nn m n n

n

n

l

n nn

W a a x x y

C

C

e

Prediction can be made using:

*

1

( ) ( ) ,l

n n nn

f x x x b

Maximize:

???

How to Determine b?Karush-Kuhn-Tucker (KKT) conditions implies( at the optimal solutions):

* *

* *

( , ) 0

( , ) 0( ) 0

( ) 0

n n n n

n n n n

n n

n n

y w x b

y w x bC

C

e

e

Support vectors are points that lie on the boundary or outside the tube

These equations implies many important things.

Important Interpretations

* *0, . . 0 (why??)i i i ii e

* *

*

, 0

,,

i n n n

n n n

n n

C y w x b

w x b yw x b y

e

e e

*

*

0 0,

and 0

0

i i

i

i

Support Vector: The Sparsity of SV Expansion

*

0 ( )

0 ( )i i i

i i i

y f x

f x y

e

e

and

*

( ) 0

( ) 0i i i

i i i

y f x

f x y

e e

Dual Form of Lagrangian(Nonlinear case)

* * * * *

1 1 1 1

*

*

1

1( , ) ( )( ) ( , ) ( ) ( )2

0

0

( ) 0

l l l l

n n m m n m n n n n nn m n n

n

n

l

i ii

W k x x y

C

C

e

e

Prediction can be made using:

*

1

( ) ( ) ( , )l

n n nn

f x a a k x x b

Maximize:

Non-linear SVR: Derivation

Subject to:

Non-linear SVR: Derivation

Subject to:

Saddle point of L has to be found:

min with respect to

max with respect to

Non-linear SVR: derivation

...

• Strengths of SVR:– No local minima– It scales relatively well to high dimensional data– Trade-off between classifier complexity and error can be controlled

explicitly via C and epsilon– Overfitting is avoided (for any fixed C and epsilon)– Robustness of the results– The “curse of dimensionality” is avoided– “[Huber (1964) demonstrated that the best cost function over the worst

model over any pdf of y given x is the linear cost function. Therefore, if the pdf p(y/x) is unknown the best cost function is the linear penalization over the errors” (Perez-Cruz et al., 2003)

• Weaknesses of SVR:

– What is the best trade-off parameter C and best epsilon? – What is a good transformation of the original space

Strengths and Weaknesses of SVR

Experiments and Results

• The vacation problem (again)• Given training data of input-output pairs

where output is “Expenditures” and inputs are “Age”, “Duration” of holiday, “Income group”, “Number of children”, etc.

• Predict on the basis of , • The training set consists of 600, and the test set of 108

observations

Experiments and Results• The SVR

function:

Subject to:

• To find the unknown parameters of the SVR function, solve:

• How to choose , , ,

= RBF kernel: Find , , and from a cross-validation procedure

Experiments and Results

• Do 5-fold cross-validation to find and for several fixed values of .

0 5 10 150.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

C

gam

ma

CV_MSE, epsilon = 0.15

0.0588

0.0588

0.0588

0.0588

0.0592

0.0592

0.0592

0.0592

0.0592

0.0598

0.0598

0.0598

0.0598

0.061

05 10

150

0.01

0.020.058

0.059

0.06

0.061

0.062

0.063

0.064

gamma

CV_MSE, epsilon = 0.15

C

CV

MS

E

0 5 10 15 20 25 30 35 402

2.5

3

3.5

4Holiday Data, training set, the epsilon-insensitive tube = 0.45

Obserlation

Exp

endi

ture

Experiments and Results• The effect changes in : as it increases, the functional

relationship gets flatter in the higher-dimensional space, but also in the original space

= 0.45

= 0.15

0 5 10 15 20 25 30 35 402

2.5

3

3.5

4

Observation

Exp

endi

ture

Holiday Data, training set, the epsilon-insensitive tube = 0.15

Experiments and Results• Performance on the test set

0 5 10 15 20 25 30 35 402

2.5

3

3.5

4

Observation

Exp

endi

ture

s

Holiday Data, test set, epsilon = 0.15

MSE= 0.04

0 5 10 15 20 25 30 35 402

2.5

3

3.5

4

Obserlation

Exp

endi

ture

Holiday Data, test set, OLS solution

MSE= 0.23

svm and svr as convex optimization techniques

Documents

linear learning

robust statistics

heuristic view

following statistical

dot product available

use concepts of length

nonvectorial data

development of statistics