basis expansion and nonlinear svm - tsinghua...

Basis Expansion and Nonlinear SVM

Kai Yu

Linear Classifiers

8/7/12 2

•  Help to learn more general cases, e.g., nonlinear models

z(x) = sign(f(x))

f(x) = w

>x+ b

Nonlinear Classifiers via Basis Expansion

8/7/12 3

•  Nonlinear basis functions h(x)=[h1(x), h2(x), …, hm(x)] •  f(x) = wTx+b is a special case where h(x)=x

•  This explains a lot of classification models, including SVMs.

f(x) = w

>h(x) + b

z(x) = sign(f(x))

Outline

§  Representation theorem §  Kernel trick §  Understand regularization

§  Nonlinear logistic regression §  General basis expansion functions §  Summary

8/7/12 4

Review the QP for linear SVMs

§  After a lot of “stuff”, we obtain the Lagrange dual

§  The solution has the form

§  In other words, the solution w is in

8/7/12 5

January 2003 Trevor Hastie, Stanford University 9

Quadratic Programming Solution

After a lot of *stu!* we arrive at a Lagrange dual

LD =N!

i=1

!i !12

N!

i=1

N!

i!=1

!i!i!yiyi!xTi xi!

which we maximize subject to constraints (involving B as well).

The solution is expressed in terms of fitted Lagrange multipliers !̂i:

"̂ =N!

i=1

!̂iyixi

Some fraction of !̂i are exactly zero (from KKT conditions); the xi

for which !̂i > 0 are called support points S.

f̂(x) = xT "̂ + "̂0 =!

i!S!̂iyix

T xi + "̂0

w =NX

i=1

↵iyixi

span(x1, x2, . . . , xN )

A more general result – RKHS ��representation theorem (Wahba, 1971)

§  In its simplest form, L(wTx,y) is covex w.r.t. w, the solution of

has the form §  Proof sketch …

§  Note: the conclusion is general, not only for SVMs. 8/7/12 6

minw

NX

i=1

L(wTxi, yi) + �kwk2

w =NX

i=1

↵ixi

For general basis expansion functions

The solution of

has the form

8/7/12 7

w =NX

i=1

↵ih(xi)

minw

NX

i=1

L(w>h(xi), yi) + �kwk2

Outline



8/7/12 8

Kernel

§  Define the Mercer kernel as

8/7/12 9

k(xi, xj) = h(xi)>h(xj)

Kernel trick

§  Apply the representation theorem

§  we have

8/7/12 10

min↵

NX

i=1

L

NX

i=1

↵ik(xi, x), yi

!+ �↵

>K↵

w =NX

i=1

↵ih(xi)

f(x) =NX

i=1

↵ik(xi, x) kwk2 =NX

i,j=1

↵i↵jk(xi, xj) = ↵

TK↵

Primal and Kernel formulations

8/7/12 11

min↵

NX

i=1

L

NX

i=1

↵ik(xi, x), yi

!+ �↵

>K↵

§  Given a kernel, we don’t even need h(x)! …really?

minw

NX

i=1

L

�w

>h(x), yi

�+ �kwk2

k(xi, xj) = h(xi)>h(xj)

Popular kernels

§  k(x,x’) is a symmetric, positive (semi-) definite function

§  Example:

8/7/12 12


Popular Kernels

K(x, x!) is a symmetric, positive (semi-)definite function.

dth deg. poly.: K(x, x!) = (1 + !x, x!")d

radial basis: K(x, x!) = exp(#$x # x!$2/c)

Example: 2nd degree polynomial in IR2.

K(x, x!) = (1 + !x, x!")2

= (1 + x1x!1 + x2x

!2)

2

= 1 + 2x1x!1 + 2x2x

!2 + (x1x

!1)

2 + (x2x!2)

2 + 2x1x!1x2x

!2

Then M = 6, and if we chooseh1(x) = 1, h2(x) =

%2x1, h3(x) =

%2x2, h4(x) = x2

1, h5(x) = x22,

and h6(x) =%

2x1x2,then K(x, x!) = !h(x), h(x!)".


Popular Kernels





K(x, x!) = (1 + !x, x!")2

= (1 + x1x!1 + x2x

!2)

2

= 1 + 2x1x!1 + 2x2x

!2 + (x1x

!1)

2 + (x2x!2)

2 + 2x1x!1x2x

!2


%2x1, h3(x) =

%2x2, h4(x) = x2

1, h5(x) = x22,

and h6(x) =%

2x1x2,then K(x, x!) = !h(x), h(x!)".


Popular Kernels





K(x, x!) = (1 + !x, x!")2

= (1 + x1x!1 + x2x

!2)

2

= 1 + 2x1x!1 + 2x2x

!2 + (x1x

!1)

2 + (x2x!2)

2 + 2x1x!1x2x

!2


%2x1, h3(x) =

%2x2, h4(x) = x2

1, h5(x) = x22,

and h6(x) =%

2x1x2,then K(x, x!) = !h(x), h(x!)".

Non-linear feature mapping

§  Datasets that are linearly separable

§  But what if the dataset is just too hard?

§  How about mapping data to a higher-dimensional space:

0

0

0

x2

x

x

x

Nonlinear feature mapping

§  General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

h: x → h(x)

Outline



8/7/12 15

Various equivalent formulations

8/7/12 16

§  Parametric form

§  Dual form

§  Nonparametric form

minf

NX

i=1

L(f(xi), yi) + �kfk2Hk

min↵

NX

i=1

L

NX

i=1

↵ik(xi, x), yi

!+ �↵

>K↵

minw

NX

i=1

L

�w

>h(x), yi

�+ �kwk2

Various equivalent formulations

8/7/12 17


§  Dual form


minf

NX

i=1

L(f(xi), yi) + �kfk2Hk

min↵

NX

i=1

L

NX

i=1

↵ik(xi, x), yi

!+ �↵

>K↵

minw

NX

i=1

L

�w

>h(x), yi

�+ �kwk2

Telling what kind of f(x) is preferred

Regularization induced by kernel (or basis functions)

8/7/12 18

§  Desired kernel is a smoothing operator, smoother

eigenfunctions ϕi tend to have larger eigenvalues γi

§ What does this mean ?


Aside: RKHS

Function space HK generated by a positive (semi-) definitefunction K(x, x!).

Eigen expansion: K(x, y) ="!

i=1

!i"i(x)"i(y)

with !i ! 0,""

i=1 !2i < ". f # HK if

f(x) ="!

i=1

ci"i(x)

ci =#

"i(t)f(t)dt

||f ||2HK

def="!

i=1

c2i /!i < "

The squared norm J(f) = ||f ||2HKis viewed as a roughness penalty.


Aside: RKHS



i=1

!i"i(x)"i(y)

with !i ! 0,""

i=1 !2i < ". f # HK if

f(x) ="!

i=1

ci"i(x)

ci =#

"i(t)f(t)dt

||f ||2HK

def="!

i=1

c2i /!i < "



Aside: RKHS



i=1

!i"i(x)"i(y)

with !i ! 0,""

i=1 !2i < ". f # HK if

f(x) ="!

i=1

ci"i(x)

ci =#

"i(t)f(t)dt

||f ||2HK

def="!

i=1

c2i /!i < "


Understand regularization

8/7/12 19

§  If push down this regularization term

§  In f(x), minor components ϕi(x) with smaller γi are

penalized more heavily. à principle components are preferred in f(x)!

§  A desired kernel is a smoothing operator, i.e., principle components are smoother functions à the regularization encourages f(x) to be smooth!


Aside: RKHS



i=1

!i"i(x)"i(y)

with !i ! 0,""

i=1 !2i < ". f # HK if

f(x) ="!

i=1

ci"i(x)

ci =#

"i(t)f(t)dt

||f ||2HK

def="!

i=1

c2i /!i < "


Understanding regularization

§  Using what kernel? §  Using what feature (for linear model) ? §  Using what h(x)?

§  Using what functional norm

8/7/12 20

kfk2Hk

All pointing to one thing – what kind of functions are preferred apriori


Aside: RKHS



i=1

!i"i(x)"i(y)

with !i ! 0,""

i=1 !2i < ". f # HK if

f(x) ="!

i=1

ci"i(x)

ci =#

"i(t)f(t)dt

||f ||2HK

def="!

i=1

c2i /!i < "


Outline



8/7/12 21

Nonlinear Logistic Regression

So far, things we discussed, including -  representation theorem,

-  kernel trick, -  regularization,

are not limited to SVMs. They are all applicable to logistic regression. The only difference is the loss function.

8/7/12 22

Nonlinear Logistic Regression



8/7/12 23


SVM via Loss + Penalty

-3 -2 -1 0 1 2 3

0.00.5

1.01.5

2.02.5

3.0

Binomial Log-likelihoodSupport Vector

yf(x) (margin)

Los

s

With f(x) = h(x)T !+!0 and

yi ! {"1, 1}, consider

min!0, !

N!

i=1

[1"yif(xi)]++"#!#2

Solution identical to SVM so-

lution, with " = "(B).

In general min!0, !

N!

i=1

L[yi, f(xi)]+"#!#2

minf

NX

i=1

ln⇣1 + e�yiw

>h(xi)

⌘+ �kwk2

minf

NX

i=1

ln⇣1 + e�yif(xi)

⌘+ �kfk2Hk

Logistic Regression vs. SVM

§  Both can be linear or nonlinear, parametric or nonparametric, the main difference is the loss;

§  They are very similar in performance;

§  Outputs probabilities, useful for scoring confidence;

§  Logistic regression is easier for multiple classes.

§  10 years ago, one was old, the other is new. Now, both are old.

8/7/12 24

Outline



8/7/12 25

Many known classification models follow a similar structure

§  Neural networks

§  RBF networks

§  Learning VQ (LVQ)

§  Boosting These models all learn w and h(x) together …

8/7/12 26

Many known classification models follow a similar structure

§  Neural networks §  RBF networks §  Learning VQ (LVQ)

§  Boosting §  SVMs §  Linear Classifier

§  Logistic Regression § …

8/7/12 27

Develop your own stuff !

By deciding

–  Which loss function? – hinge, least square, …

–  What form of h(x)? – RBF, logistic, tree, …

–  Infinite h(x) or h(x)? –  Learning h(x) or not?

–  How to optimize? – QP, LBFGS, functional gradient, …

you can obtain various classification algorithms.

8/7/12 28

Parametric vs. nonparametric models

§  h(x) is finite dim, parametric model f(x)=wTh(x). Training complexity is O(Nm3)

§  h(x) is nonlinear and infinite dim, then has to use kernel trick. This is a nonparametric model. The training complexity is around O(N3)

§  Nonparametric models, including kernel SVMs, Gaussian processes, Dirichlet processes etc., are elegant in math, but nontrivial for large-scale computation.

8/7/12 29

Outline



8/7/12 30

Summary

§  Representation theorem and kernels

§  Regularization prefers principle eigenfunctions of the kernel (induced by basis functions)

§  Basis expansion - a general framework for classification models, e.g., nonlinear logistic regression, SVMs, …

8/7/12 31

basis expansion and nonlinear svm - tsinghua...

Documents