kernel ridge regression - rensselaer polytechnic institute

Kernel Ridge Regression

Prof. BennettBased on Chapter 2 of

Shawe-Taylor and Cristianini

Outline

OverviewRidge RegressionKernel Ridge RegressionOther KernelsSummary

Recall E&K model

R(t)=at2+bt+cIs linear is in its parametersDefine mapping θ(t) and make linear

function in the θ(t) or feature space2

2 2

( ) [ 1]' [ ]'

( ) ( ) [ 1]'

t t t s a b ca

R t t s t t b at bt cc

θ

θ

= =

⎡ ⎤⎢ ⎥= ⋅ = = + +⎢ ⎥⎢ ⎥⎣ ⎦

Linear versus Nonlinear

f(t)=bt+c f(θ(t))=at2+bt+c

Kernel Method

Two parts (+):Mapping into embedding or feature space defined by kernel.Learning algorithm for discovering linear patterns in that space.

Illustrate using linear ridge regression.

Linear Regression in Feature Space

Key Idea:Map data to higher dimensional space (feature space) and perform linear regression in embedded space.

Embedding Map:: n NR F R N nφ ∈ → ⊆ >>x

Nonlinear Regression in Feature Space

Input

[ ]1 2

2 2

2 21 2 3

,,

( ) , , 2

( ) ( ) ,

2F

r sw r w s

r s r s

g

w r w s w r s

θ

θ

== +

↓

⎡ ⎤= ⎣ ⎦=

= + +

xx w

x

x x w

Feature


[ ]1 2

3 3 2 2 2 2

3 31 2 1 0

,,

( ) , , , , , , , ,

( ) ( ) ,

. . . . 2F

r sw r w s

r s r s r s s r s r s r

g

w r w s w r s

θ

θ

== +

↓

⎡ ⎤= ⎣ ⎦=

= + + +

xx w

x

x x w

Input

Feature

Let’s try a quadratic on Aquasol

What are all the terms we need to add to our 525 dimensional space to map it into feature space?

Just do squared terms and cross terms.

Kernel and Duality to the rescue

Duality Alternative but equivalent view of the problemsKernel trick – makes mapping to the feature space efficient

Linear Regression

Given training data:

Construct linear function:

( ) ( ) ( ) ( )( )1 1 2 2, , , , , , , , ,

points and labels i

ni

S y y y y

R y R

=

∈ ∈

i

i

x x x x

x

… …

1( ) , '

n

i ii

g w x=

= = = ∑x w x w x

Least Squares Approximation

Want

Define error

Minimize loss

( )g x y≈

( , ) ( )f y y g ξ= − =x x

( )

( )( )

2

1

2

1 1

( , ) ( , ) ( )

, ,

i ii

i i ii i

L g s L w S y g

l y gξ

=

= =

= = −

= =

∑

∑ ∑

x

x

Ridge Regression

Use least norm solution for fixedRegularized problem

Optimality Condition:

2 2min ( , )L Sλ λ= + −w w w y Xw

( , )2 2 ' 2 ' 0

L Sλ λ∂

= − + =∂w

w X y X Xww

( )' 'nλ+ =X X I w X y

0.λ >

Requires 0(n3) operations

Ridge Regression (cont)

Inverse always exists for any

Alternative representation:( ) 1' 'λ −= +w X X I X y

0.λ >

( ) ( )( )

1

1

' ' ' '

' '

λ λ

λ

−

−

+ = ⇒ = −

⇒ = − =

X X I w X y w X y X Xw

w X y Xw X α

( )1λ−= −α y Xw

Solving l×l equation is

Ridge Regression (cont)

Inverse always exists for any

Alternative representation:( ) 1' 'λ −= +w X X I X y

0.λ >

( ) ( )( )

1

1

' ' ' '

' '

λ λ

λ

−

−

+ = ⇒ = −

⇒ = − =

X X I w X y w X y X Xw

w X y Xw X α( )( ) ( )

( )

1

1

''

where '

λ

λλ

λ

−

−

= −

⇒ = − = −

⇒ + =

⇒ = + =

α y Xw

α y Xw y XX αXX α α y

α G I y G XXSolving l×l equation is

Gram or Kernel Matrix

Gram Matrix

Composed of inner products of data

'K G XX= =

, ,i j i jK x x=

Dual Ridge Regression

To predict new point:

Note need only compute G, the Gram Matrix

( ) 1

1( ) , , '

where ,

i ii

i i

g α λ −

=

= = = +

=

∑x w x x x y G I z

z x x

' ,ij i jG= =G XX x x

Ridge Regression requires only inner products between data points

Efficiency

To compute w in primal ridge regression is 0(n3)

α in dual ridge regression is 0(l3)

To predict new point xprimal 0(n)

dual 0(nl)( ) ( )

1 1 1( ) , ,

n

i i i i j ji i j

g α α= = =

⎛ ⎞= = = ⎜ ⎟

⎝ ⎠∑ ∑ ∑x wx x x x x

Dual is better if n>>l

( )1

( ) ,n

i ii

g w=

= = ∑x w x x

Notes on Ridge Regression

“Regularization” is key to addressstability and regularization.Regularization lets method work when n>>l.Dual more efficient when n>>l.Dual only requires inner products of data.


1

( ) ( ) ,

( ) , ( )

F

i ii

g φ

α φ φ=

=

= ∑

x x w

x x

In dual representation:

So if we can efficiently compute inner product, our metis efficient

Let try it for our sample problem

( )

2 2 2 21 2 1 2 1 2 1 2

2 2 2 21 1 2 2 1 2 1 2

21 1 2 2

2

( ) , ( )

( , , 2 ) , ( , , 2 )

2

,

u u u u v v v v

u v u v u u v v

u v u v

φ φ

=

= + +

= +

=

u v

u v

2D e f i n e : ( , ) ,K =u v u v

Kernel Function

A kernel is a function K such that

There are many possible kernels.Simplest is linear kernel.

, ( ), ( )

where is a mapping from input space to feature space .

FK

F

φ φ

φ

=x u x u

, ,K =x u x u

Dual Ridge Regression


Note need only compute G, the Gram Matrix

( ) 1

1( ) , , '

where ,

i ii

i i

g α λ −

=

= = = +

=


z x x

' ,ij i jG= =G XX x x

Ridge Regression requires only inner products between data points

Ridge Regression in Feature Space


To compute the Gram Matrix

( ) 1

1( ( )) , ( ) ( ), ( ) '

where ( ), ( )

i ii

i i

g φ φ α φ φ λ

φ φ

−

=

= = = +

=


z x x

( ) ( ) ' ( ), ( ) ( , )ij i j i jG Kφ φ φ φ= = =G X X x x x x

Use kernel to compute inner product


1

1

( ) ( ) ,

( ) , ( )

( , )

F

i ii

i ii

g

K

φ

α φ φ

α

=

=

=

=

=

∑

∑

x x w

x x

x x

Kernel trick works for any dual representation:

Popular Kernels based on vectors

By Hilbert-Schmidt Kernels (Courant and Hilbert 1953)

( ), ( ) ( , )u v K u vθ θ ≡for certain η and K, e.g.

2

( ) ( , )Degree polynomial ( , 1)

|| ||Radial Basis Function M achine exp

Two Layer Neural Network ( , )

d

u K u vd u v

u v

sigmoid u v c

θ

ση

+

⎛ ⎞−−⎜ ⎟

⎝ ⎠+

Kernels Intuition

Kernels encode the notion of similarity to be used for a specific applications.

Document use cosine of “bags of text”.Gene sequences can used edit distance.

Similarity defines distance:

Trick is to get right encoding for domain.

2|| || ( ) '( ) , 2 , ,− = − − = − +u v u v u v u u u v v v

Important Points

Kernel method = linear method + embedding in feature space.Kernel functions used to do embedding efficiently.Feature space is higher dimensional space so must regularize.Choose kernel appropriate to domain.

Kernel Ridge Regression

Simple to derive kernel methodWorks great in practice with some finessing.Next time:Practical issues.More standard dual derivation.

Optimal Solution

Want:Mathematical Model:

Optimality Conditions:

2 2min ( , , ) ( ) || ||L b S b wλ= − + +w w y Xw e

b is a vector of on≈ +y Xw e e

( , , ) 2 '( ) 2 0L b S b λ∂= − − + =

∂w X y Xw e ww

( , , ) 2 '( ) 0L b S e bb

∂= − − =

∂w y Xw e

es

Let Try it: In class lab

Basic Ridge Regression

Optimality Condition:

Dual equivalent

2 2min ( , )L Sλ λ= + −w w w y Xw

( ) 1' 'nλ −= +w X X I X y

0.λ >

( ) 1

,where G ,

'i j i j

yλ −= +

=

=

α G I

x x

w X α

Better model

Basic Ridge Regression

Center X and Y to get Xc, Yc

Dual equivalent

2 2,min ( , ) ( )b L S bλ λ= + − +w w w y Xw e

( ) 1' 'nc c c cλ −= +w X X I X y

0.λ >

( ) 1

,where G ,

'i j i j

y

c c

c

λ −= +

=

=

α G I

x x

w X α

Recenter Data

Shift y by mean

Shift x by mean

1

1 :i i ii

y yc yµ µ=

= = −∑

1

1 :i i ii

c=

= = −∑x x x x x

Ridge Regression with bias

Center data by

Calculate w

CalculateTo predict new point

x and µ

1( ) 'c c c cλ −= +w X 'X I X y

'c c c µ= − = −X X ex y y e

b µ=( ) ( ) 'g x x x w b= − +

kernel ridge regression - rensselaer polytechnic institute

Documents