svm and svr as convex optimization techniques

82
SVM and SVR as Convex Optimization Techniques Mohammed Nasser Department of Statistics Rajshahi University Rajshahi 6205

Upload: fergus

Post on 23-Feb-2016

107 views

Category:

Documents


0 download

DESCRIPTION

SVM and SVR as Convex Optimization Techniques. Mohammed Nasser Department of Statistics Rajshahi University Rajshahi 6205. Acknowledgement. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. Kenji Fukumizu Institute of Statistical Mathematics, ROIS - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SVM   and SVR as   Convex Optimization  Techniques

SVM and SVR as Convex Optimization Techniques

Mohammed NasserDepartment of Statistics

Rajshahi UniversityRajshahi 6205

Page 2: SVM   and SVR as   Convex Optimization  Techniques

AcknowledgementAndrew W. Moore

ProfessorSchool of Computer ScienceCarnegie Mellon University

Kenji FukumizuInstitute of Statistical Mathematics, ROIS

Department of Statistical Science, Graduate University for Advanced Studies

Georgi NalbantovEconometric Institute, School of Economics, Erasmus

University Rotterdam

Page 3: SVM   and SVR as   Convex Optimization  Techniques

Contents

Glimpses of Historical Development

Optimal Separating Hyperplane

Soft Margin Support Vector Machine

Support Vector Regression

Convex Optimization

Use of Lagrange and Duality Theory

Example

Conclusion

Page 4: SVM   and SVR as   Convex Optimization  Techniques

Early History

In 1900 Karl Pearson published his famous article on goodness of fit, judged as one of first best twelve scientific articles in twentieth century.

In 1902 Jacques Hadamard pointed that mathematical models of physical phenomena should have the properties that

A solution exists The solution is unique

The solution depends continuously on the data, in some reasonable topology

( Well-Posed Problem)

Page 5: SVM   and SVR as   Convex Optimization  Techniques

Early History In 1940 Fréchet, PhD student of Hadamard highly criticized

mean and standard deviation as measures of location and scale respectively. But he did express his belief in development of statistics without proposing any alternative.

During sixties and seventies Tukey, Huber and Hampel tried to develop Robust Statistics in order to remove ill-posedness of classical statistics.

Robustness means insensitivity to minor change in both model and sample, high tolerance to major changes and good performance at model.

Data Mining onslaught and the problem of non-linearity and nonvectorial data have made robust statistics somewhat nonattractive.

Let Us See What KM present…………….

Page 6: SVM   and SVR as   Convex Optimization  Techniques

6

Support Vector Machines (SVM) introduced in COLT-92 (conference on learning theory) greatly developed since then.

Result: a class of algorithms for Pattern Recognition(Kernel Machines)Now: a large and diverse community, from machinelearning, optimization, statistics, neural networks,functional analysis, etcCentralized website: www.kernel-machines.orgFirst Text book (2000): see www.support-vector.net Now ( 2012): At least twenty books of different taste are

avialable in international marketThe book, “ The Elements of Statistical Learning”(2001)

by Hastie,Tibshirani and Friedman went into second edition within seven years.

Recent History

Page 7: SVM   and SVR as   Convex Optimization  Techniques

7

The common characteristic (structure) among the following statistical methods?

1. Principal Components Analysis2. (Ridge ) regression3. Fisher discriminant analysis4. Canonical correlation analysis5.Singular value decomposition6. Independent component analysis

Kernel methods: Heuristic View

We consider linear combinations of input vector: ( ) Tf x w x

We make use concepts of length and dot product available in Euclidean space.

Page 8: SVM   and SVR as   Convex Optimization  Techniques

8

• Linear learning typically has nice properties– Unique optimal solutions, Fast learning algorithms– Better statistical analysis

• But one big problem– Insufficient capacity

That means, in many data sets it fails to detect nonlinearship among the variables.

• The other demerits

- Cann’t handle non-vectorial data

Kernel methods: Heuristic View

Page 9: SVM   and SVR as   Convex Optimization  Techniques

9

Kernel Methods

Classical KernelPCA KPCA

CCA KCCA

FLDA KFLDA

ICA KICA

Regression SVR

Classification SVM

More…………………………………….

Test of independence

Test of equalityofdistributions

Outliers detection

Data depth function

Page 10: SVM   and SVR as   Convex Optimization  Techniques

10

Kernel methods: Heuristic View

In Classical Multivariate Analysis we consider linear combinations of input vector:

( ) Tf x w x

We make use concepts of length and dot product/inner product available in Euclidean/non-Euclidean space.

In Modern Multivariate Analysis we consider linear combinations of feature vector::

1 1

( ) ( ) ( ), ( ) ( , )n n

Ti i i i

i i

f x w x x x k x x

We make use concepts of length and dot product available in Euclidean space.

Page 11: SVM   and SVR as   Convex Optimization  Techniques

Some Review of College Geometry

y+x-1=0

y+x-1>0

y+x-1<0

ky+kx-k=0 90

(1,1)

Different effect of k on two signed regions

Page 12: SVM   and SVR as   Convex Optimization  Techniques

Some Review of College GeometryIn General Form

wx+b=0

wx+b>0

wx+b<0

kwx+kb=0 90

w

Different effect of k on two signed regions

Page 13: SVM   and SVR as   Convex Optimization  Techniques

Some Review of College GeometryIn General Form

wx+b=0

90

w

Effect of change in w

Effect of change in b

Page 14: SVM   and SVR as   Convex Optimization  Techniques

Linear Kernel

,

Its RKHS,

. It can be shown,

Let

,

=

Page 15: SVM   and SVR as   Convex Optimization  Techniques

Linearly Seperable Classes

Page 16: SVM   and SVR as   Convex Optimization  Techniques

Linear Classifiersf x

yest

denotes +1denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

w x + b

=0w x + b<0

w x + b>0

Page 17: SVM   and SVR as   Convex Optimization  Techniques

Linear Classifiersf x

yest

denotes +1denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

Page 18: SVM   and SVR as   Convex Optimization  Techniques

Linear Classifiersf x

yest

denotes +1denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

Page 19: SVM   and SVR as   Convex Optimization  Techniques

Linear Classifiersf x

yest

denotes +1denotes -1

f(x,w,b) = sign(w x + b)

Any of these would be fine..

..but which is the best?

Page 20: SVM   and SVR as   Convex Optimization  Techniques

Linear Classifiersf x

yest

denotes +1denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

Misclassified to +1 class

Page 21: SVM   and SVR as   Convex Optimization  Techniques

Classifier Marginf x

yest

denotes +1denotes -1

f(x,w,b) = sign(w x + b)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

f x

yest

denotes +1denotes -1

f(x,w,b) = sign(w x + b)

Page 22: SVM   and SVR as   Convex Optimization  Techniques

Geometric margin versus functional margin

Page 23: SVM   and SVR as   Convex Optimization  Techniques
Page 24: SVM   and SVR as   Convex Optimization  Techniques
Page 25: SVM   and SVR as   Convex Optimization  Techniques
Page 26: SVM   and SVR as   Convex Optimization  Techniques
Page 27: SVM   and SVR as   Convex Optimization  Techniques
Page 28: SVM   and SVR as   Convex Optimization  Techniques
Page 29: SVM   and SVR as   Convex Optimization  Techniques
Page 30: SVM   and SVR as   Convex Optimization  Techniques
Page 31: SVM   and SVR as   Convex Optimization  Techniques
Page 32: SVM   and SVR as   Convex Optimization  Techniques
Page 33: SVM   and SVR as   Convex Optimization  Techniques
Page 34: SVM   and SVR as   Convex Optimization  Techniques
Page 35: SVM   and SVR as   Convex Optimization  Techniques
Page 36: SVM   and SVR as   Convex Optimization  Techniques
Page 37: SVM   and SVR as   Convex Optimization  Techniques

Maximum Marginf x

yest

denotes +1denotes -1

f(x,w,b) = sign(w x + b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.This is the simplest kind of SVM (Called an LSVM)Linear SVM

Support Vectors are those datapoints that the margin pushes up against

1. Maximizing the margin is good according to intuition and PAC theory

2. Implies that only support vectors are important; other training examples are ignorable.

3. Empirically it works very very well.

Page 38: SVM   and SVR as   Convex Optimization  Techniques

Linear SVM Mathematically

1) Correctly classify all training data if yi = +1 if yi = -1 for all i

wM 2

1bwxi

1iwx b

1)( bwxy ii

wwt

21

Our Goal

2) Maximize the Margin

same as minimize

Page 39: SVM   and SVR as   Convex Optimization  Techniques

Minimize

subject to

www t

21)(

1)( bwxy ii i

Linear SVM Mathematically

We can formulate a Quadratic Optimization Problem and solve for w and b

Strictly convex quadratic function

Linear inequality constraints

Page 40: SVM   and SVR as   Convex Optimization  Techniques
Page 41: SVM   and SVR as   Convex Optimization  Techniques

Slack variables ξi can be added to allow misclassification of difficult or noisy examples.

wx+b=1

wx+b=0

wx+b=-1

e7

e11 e2

Soft Margin Classification

What should our quadratic optimization criterion be?

Minimize

R

kk

T εC12

1 ww

Page 42: SVM   and SVR as   Convex Optimization  Techniques

Hard Margin v.s. Soft Margin The old formulation:

The new formulation incorporating slack variables:

Parameter C can be viewed as a way to control overfitting.

Find w and b such thatΦ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1

Find w and b such thatΦ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

Page 43: SVM   and SVR as   Convex Optimization  Techniques

Linear Support Vector Regression • Marketing Problem

Given variables:– person’s age– income group– season– holiday duration– location– number of children– etc. (12 variables)E

xpen

ditu

res

Age

●●

Predict:the level of holiday Expenditures

Page 44: SVM   and SVR as   Convex Optimization  Techniques

Linear Support Vector Regression

“Suspiciouslysmart case”(overfitting)

“Lazy case”

(underfitting)

Exp

endi

ture

s

Age

●●

●● ●

Exp

endi

ture

s

Age

●●

●● ●

Exp

endi

ture

s

Age

●●

●● ●

“Compromise case”, SVR(good generalizability)

Page 45: SVM   and SVR as   Convex Optimization  Techniques

●●

●●

●●

Linear Support Vector Regression

error,

penalty

• The epsilon-insensitive loss function

penalty

●45°

0

1

2

3

4

Page 46: SVM   and SVR as   Convex Optimization  Techniques

Linear Support Vector Regression

“Suspiciouslysmart case”

(overfitting)

“Compromise case”, SVR(good generalizability)

“Lazy case”(underfitting)

Exp

endi

ture

s

Age

●●

●● ●

Exp

endi

ture

s

Age

●●

●● ●

• The thinner the “tube”, the more complex the model

biggest area small area

Exp

endi

ture

s

Age

●●

●● ●

●middle-sized area

“Support vectors”

Page 47: SVM   and SVR as   Convex Optimization  Techniques

Non-linear Support Vector RegressionE

xpen

ditu

res

Age

●●

●● ●

• Map the data into a higher-dimensional space:

Page 48: SVM   and SVR as   Convex Optimization  Techniques

Non-linear Support Vector RegressionE

xpen

ditu

res

Age

●●

●● ●

• Map the data into a higher-dimensional space:

Page 49: SVM   and SVR as   Convex Optimization  Techniques

Non-linear Support Vector RegressionE

xpen

ditu

res

Age

●●

●● ●

• Finding the value of a new point:

Page 50: SVM   and SVR as   Convex Optimization  Techniques

Linear SVR: Derivation

• Given training data

• Find: ,such that optimally describes the data:

Exp

endi

ture

s

Age

●●

●● ●

{ , }, 1, ,i ix y i l

(1)

Page 51: SVM   and SVR as   Convex Optimization  Techniques

First Formulation

(2)

Page 52: SVM   and SVR as   Convex Optimization  Techniques
Page 53: SVM   and SVR as   Convex Optimization  Techniques

Regularized Error Function

22

1

1 ( ( ) )2 2

l

i ii

f x y wl

2

1

1( ( ) )2

l

ii

C E f x y we

In linear regression, we minimize the error function:

Replace the quadratic error function by Є-insensitive error function:

An example of Є-insensitive error function:

Page 54: SVM   and SVR as   Convex Optimization  Techniques

Linear SVR: Derivation

Meaning of equation 3

Page 55: SVM   and SVR as   Convex Optimization  Techniques

●●

Linear SVR: Derivation

●●

Complexity Sum of errors

vs.

Case I:

Case II:

“tube” complexity

“tube” complexity

Page 56: SVM   and SVR as   Convex Optimization  Techniques

Linear SVR: Derivation

Case I:

Case II:

“tube” complexity

“tube” complexity

• The role of C

●●

●● ●

C is small

●●

●● ●

C is big

Page 57: SVM   and SVR as   Convex Optimization  Techniques

●●

Linear SVR: Derivation

●●

● Subject to:

Back

Page 58: SVM   and SVR as   Convex Optimization  Techniques

Review of Convex Optimization

Page 59: SVM   and SVR as   Convex Optimization  Techniques

Review of Convex Optimization

Page 60: SVM   and SVR as   Convex Optimization  Techniques

Review of Convex Optimization

Back

Page 61: SVM   and SVR as   Convex Optimization  Techniques

Weak DualityReview of Convex Optimization

Page 62: SVM   and SVR as   Convex Optimization  Techniques

Strong DualityReview of Convex Optimization

Page 63: SVM   and SVR as   Convex Optimization  Techniques

Review of Convex Optimization

Page 64: SVM   and SVR as   Convex Optimization  Techniques

Review of Convex Optimization

Page 65: SVM   and SVR as   Convex Optimization  Techniques

Review of Convex Optimization

Page 66: SVM   and SVR as   Convex Optimization  Techniques

Review of Convex Optimization

Page 67: SVM   and SVR as   Convex Optimization  Techniques

Lagrangian

2* * * * *

1 1 1 1

*

1

*

1

* **

1( ) ( ) ( , ) ( , )2

0 ( )

0 ( ) 0

0

0

l l l l

n n n n n n n n n n n n n nn n n n

l

n n nn

l

n nn

n nn

n nn

L C w y w x b y w x b

L w xwLbL C

L a C

e e

Minimize:

f(x)=<w,x>= * *

1 1

( ) , ( ) ,l l

n n n n n nn n

x x x x

Dual var. α

n, αn *,μ

n ,μ*n >=0

Support Vector Regression

Page 68: SVM   and SVR as   Convex Optimization  Techniques

Dual Form of Lagrangian

* * * * *

1 1 1 1

*

*

1

1( , ) ( )( ) , ( ) ( )2

0

0

( ) 0

l l l l

n n m m n m n n n n nn m n n

n

n

l

n nn

W a a x x y

C

C

e

Prediction can be made using:

*

1

( ) ( ) ,l

n n nn

f x x x b

Maximize:

???

Page 69: SVM   and SVR as   Convex Optimization  Techniques

How to Determine b?Karush-Kuhn-Tucker (KKT) conditions implies( at the optimal solutions):

* *

* *

( , ) 0

( , ) 0( ) 0

( ) 0

n n n n

n n n n

n n

n n

y w x b

y w x bC

C

e

e

Support vectors are points that lie on the boundary or outside the tube

These equations implies many important things.

Page 70: SVM   and SVR as   Convex Optimization  Techniques

Important Interpretations

* *0, . . 0 (why??)i i i ii e

* *

*

, 0

,,

i n n n

n n n

n n

C y w x b

w x b yw x b y

e

e e

*

*

0 0,

and 0

0

i i

i

i

Page 71: SVM   and SVR as   Convex Optimization  Techniques

Support Vector: The Sparsity of SV Expansion

*

0 ( )

0 ( )i i i

i i i

y f x

f x y

e

e

and

*

( ) 0

( ) 0i i i

i i i

y f x

f x y

e e

Page 72: SVM   and SVR as   Convex Optimization  Techniques

Dual Form of Lagrangian(Nonlinear case)

* * * * *

1 1 1 1

*

*

1

1( , ) ( )( ) ( , ) ( ) ( )2

0

0

( ) 0

l l l l

n n m m n m n n n n nn m n n

n

n

l

i ii

W k x x y

C

C

e

e

Prediction can be made using:

*

1

( ) ( ) ( , )l

n n nn

f x a a k x x b

Maximize:

Page 73: SVM   and SVR as   Convex Optimization  Techniques

Non-linear SVR: Derivation

Subject to:

Page 74: SVM   and SVR as   Convex Optimization  Techniques

Non-linear SVR: Derivation

Subject to:

Saddle point of L has to be found:

min with respect to

max with respect to

Page 75: SVM   and SVR as   Convex Optimization  Techniques

Non-linear SVR: derivation

...

Page 76: SVM   and SVR as   Convex Optimization  Techniques

• Strengths of SVR:– No local minima– It scales relatively well to high dimensional data– Trade-off between classifier complexity and error can be controlled

explicitly via C and epsilon– Overfitting is avoided (for any fixed C and epsilon)– Robustness of the results– The “curse of dimensionality” is avoided– “[Huber (1964) demonstrated that the best cost function over the worst

model over any pdf of y given x is the linear cost function. Therefore, if the pdf p(y/x) is unknown the best cost function is the linear penalization over the errors” (Perez-Cruz et al., 2003)

• Weaknesses of SVR:

– What is the best trade-off parameter C and best epsilon? – What is a good transformation of the original space

Strengths and Weaknesses of SVR

Page 77: SVM   and SVR as   Convex Optimization  Techniques

Experiments and Results

• The vacation problem (again)• Given training data of input-output pairs

where output is “Expenditures” and inputs are “Age”, “Duration” of holiday, “Income group”, “Number of children”, etc.

• Predict on the basis of , • The training set consists of 600, and the test set of 108

observations

Page 78: SVM   and SVR as   Convex Optimization  Techniques

Experiments and Results• The SVR

function:

Subject to:

• To find the unknown parameters of the SVR function, solve:

• How to choose , , ,

= RBF kernel: Find , , and from a cross-validation procedure

Page 79: SVM   and SVR as   Convex Optimization  Techniques

Experiments and Results

• Do 5-fold cross-validation to find and for several fixed values of .

0 5 10 150.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

C

gam

ma

CV_MSE, epsilon = 0.15

0.0588

0.0588

0.0588

0.0588

0.0592

0.0592

0.0592

0.0592

0.0592

0.0598

0.0598

0.0598

0.0598

0.061

05 10

150

0.01

0.020.058

0.059

0.06

0.061

0.062

0.063

0.064

gamma

CV_MSE, epsilon = 0.15

C

CV

MS

E

Page 80: SVM   and SVR as   Convex Optimization  Techniques

0 5 10 15 20 25 30 35 402

2.5

3

3.5

4Holiday Data, training set, the epsilon-insensitive tube = 0.45

Obserlation

Exp

endi

ture

Experiments and Results• The effect changes in : as it increases, the functional

relationship gets flatter in the higher-dimensional space, but also in the original space

= 0.45

= 0.15

0 5 10 15 20 25 30 35 402

2.5

3

3.5

4

Observation

Exp

endi

ture

Holiday Data, training set, the epsilon-insensitive tube = 0.15

Page 81: SVM   and SVR as   Convex Optimization  Techniques

Experiments and Results• Performance on the test set

0 5 10 15 20 25 30 35 402

2.5

3

3.5

4

Observation

Exp

endi

ture

s

Holiday Data, test set, epsilon = 0.15

MSE= 0.04

0 5 10 15 20 25 30 35 402

2.5

3

3.5

4

Obserlation

Exp

endi

ture

Holiday Data, test set, OLS solution

MSE= 0.23

Page 82: SVM   and SVR as   Convex Optimization  Techniques

Experiments and Results• Performance on the test set

0 5 10 15 20 25 30 35 402

2.5

3

3.5

4

Observation

Exp

endi

ture

s

Holiday Data, test set, epsilon = 0.15

MSE= 0.04

0 5 10 15 20 25 30 35 402

2.5

3

3.5

4

Obserlation

Exp

endi

ture

Holiday Data, test set, OLS solution

MSE= 0.23