kernel ridge regression - rensselaer polytechnic institute
TRANSCRIPT
Kernel Ridge Regression
Prof. BennettBased on Chapter 2 of
Shawe-Taylor and Cristianini
Outline
OverviewRidge RegressionKernel Ridge RegressionOther KernelsSummary
Recall E&K model
R(t)=at2+bt+cIs linear is in its parametersDefine mapping θ(t) and make linear
function in the θ(t) or feature space2
2 2
( ) [ 1]' [ ]'
( ) ( ) [ 1]'
t t t s a b ca
R t t s t t b at bt cc
θ
θ
= =
⎡ ⎤⎢ ⎥= ⋅ = = + +⎢ ⎥⎢ ⎥⎣ ⎦
Linear versus Nonlinear
f(t)=bt+c f(θ(t))=at2+bt+c
Kernel Method
Two parts (+):Mapping into embedding or feature space defined by kernel.Learning algorithm for discovering linear patterns in that space.
Illustrate using linear ridge regression.
Linear Regression in Feature Space
Key Idea:Map data to higher dimensional space (feature space) and perform linear regression in embedded space.
Embedding Map:: n NR F R N nφ ∈ → ⊆ >>x
Nonlinear Regression in Feature Space
Input
[ ]1 2
2 2
2 21 2 3
,,
( ) , , 2
( ) ( ) ,
2F
r sw r w s
r s r s
g
w r w s w r s
θ
θ
== +
↓
⎡ ⎤= ⎣ ⎦=
= + +
xx w
x
x x w
Feature
Nonlinear Regression in Feature Space
[ ]1 2
3 3 2 2 2 2
3 31 2 1 0
,,
( ) , , , , , , , ,
( ) ( ) ,
. . . . 2F
r sw r w s
r s r s r s s r s r s r
g
w r w s w r s
θ
θ
== +
↓
⎡ ⎤= ⎣ ⎦=
= + + +
xx w
x
x x w
Input
Feature
Let’s try a quadratic on Aquasol
What are all the terms we need to add to our 525 dimensional space to map it into feature space?
Just do squared terms and cross terms.
Kernel and Duality to the rescue
Duality Alternative but equivalent view of the problemsKernel trick – makes mapping to the feature space efficient
Linear Regression
Given training data:
Construct linear function:
( ) ( ) ( ) ( )( )1 1 2 2, , , , , , , , ,
points and labels i
ni
S y y y y
R y R
=
∈ ∈
i
i
x x x x
x
… …
1( ) , '
n
i ii
g w x=
= = = ∑x w x w x
Least Squares Approximation
Want
Define error
Minimize loss
( )g x y≈
( , ) ( )f y y g ξ= − =x x
( )
( )( )
2
1
2
1 1
( , ) ( , ) ( )
, ,
i ii
i i ii i
L g s L w S y g
l y gξ
=
= =
= = −
= =
∑
∑ ∑
x
x
Ridge Regression
Use least norm solution for fixedRegularized problem
Optimality Condition:
2 2min ( , )L Sλ λ= + −w w w y Xw
( , )2 2 ' 2 ' 0
L Sλ λ∂
= − + =∂w
w X y X Xww
( )' 'nλ+ =X X I w X y
0.λ >
Requires 0(n3) operations
Ridge Regression (cont)
Inverse always exists for any
Alternative representation:( ) 1' 'λ −= +w X X I X y
0.λ >
( ) ( )( )
1
1
' ' ' '
' '
λ λ
λ
−
−
+ = ⇒ = −
⇒ = − =
X X I w X y w X y X Xw
w X y Xw X α
( )1λ−= −α y Xw
Solving l×l equation is
Ridge Regression (cont)
Inverse always exists for any
Alternative representation:( ) 1' 'λ −= +w X X I X y
0.λ >
( ) ( )( )
1
1
' ' ' '
' '
λ λ
λ
−
−
+ = ⇒ = −
⇒ = − =
X X I w X y w X y X Xw
w X y Xw X α( )( ) ( )
( )
1
1
''
where '
λ
λλ
λ
−
−
= −
⇒ = − = −
⇒ + =
⇒ = + =
α y Xw
α y Xw y XX αXX α α y
α G I y G XXSolving l×l equation is
Ridge Regression (cont)
Inverse always exists for any
Alternative representation:( ) 1' 'λ −= +w X X I X y
0.λ >
( ) ( )( )
1
1
' ' ' '
' '
λ λ
λ
−
−
+ = ⇒ = −
⇒ = − =
X X I w X y w X y X Xw
w X y Xw X α( )( ) ( )
( )
1
1
''
where '
λ
λλ
λ
−
−
= −
⇒ = − = −
⇒ + =
⇒ = + =
α y Xw
α y Xw y XX αXX α α y
α G I y G XXSolving l×l equation is
Gram or Kernel Matrix
Gram Matrix
Composed of inner products of data
'K G XX= =
, ,i j i jK x x=
Dual Ridge Regression
To predict new point:
Note need only compute G, the Gram Matrix
( ) 1
1( ) , , '
where ,
i ii
i i
g α λ −
=
= = = +
=
∑x w x x x y G I z
z x x
' ,ij i jG= =G XX x x
Ridge Regression requires only inner products between data points
Efficiency
To compute w in primal ridge regression is 0(n3)
α in dual ridge regression is 0(l3)
To predict new point xprimal 0(n)
dual 0(nl)( ) ( )
1 1 1( ) , ,
n
i i i i j ji i j
g α α= = =
⎛ ⎞= = = ⎜ ⎟
⎝ ⎠∑ ∑ ∑x wx x x x x
Dual is better if n>>l
( )1
( ) ,n
i ii
g w=
= = ∑x w x x
Notes on Ridge Regression
“Regularization” is key to addressstability and regularization.Regularization lets method work when n>>l.Dual more efficient when n>>l.Dual only requires inner products of data.
Nonlinear Regression in Feature Space
1
( ) ( ) ,
( ) , ( )
F
i ii
g φ
α φ φ=
=
= ∑
x x w
x x
In dual representation:
So if we can efficiently compute inner product, our metis efficient
Let try it for our sample problem
( )
2 2 2 21 2 1 2 1 2 1 2
2 2 2 21 1 2 2 1 2 1 2
21 1 2 2
2
( ) , ( )
( , , 2 ) , ( , , 2 )
2
,
u u u u v v v v
u v u v u u v v
u v u v
φ φ
=
= + +
= +
=
u v
u v
2D e f i n e : ( , ) ,K =u v u v
Kernel Function
A kernel is a function K such that
There are many possible kernels.Simplest is linear kernel.
, ( ), ( )
where is a mapping from input space to feature space .
FK
F
φ φ
φ
=x u x u
, ,K =x u x u
Dual Ridge Regression
To predict new point:
Note need only compute G, the Gram Matrix
( ) 1
1( ) , , '
where ,
i ii
i i
g α λ −
=
= = = +
=
∑x w x x x y G I z
z x x
' ,ij i jG= =G XX x x
Ridge Regression requires only inner products between data points
Ridge Regression in Feature Space
To predict new point:
To compute the Gram Matrix
( ) 1
1( ( )) , ( ) ( ), ( ) '
where ( ), ( )
i ii
i i
g φ φ α φ φ λ
φ φ
−
=
= = = +
=
∑x w x x x y G I z
z x x
( ) ( ) ' ( ), ( ) ( , )ij i j i jG Kφ φ φ φ= = =G X X x x x x
Use kernel to compute inner product
Nonlinear Regression in Feature Space
1
1
( ) ( ) ,
( ) , ( )
( , )
F
i ii
i ii
g
K
φ
α φ φ
α
=
=
=
=
=
∑
∑
x x w
x x
x x
Kernel trick works for any dual representation:
Popular Kernels based on vectors
By Hilbert-Schmidt Kernels (Courant and Hilbert 1953)
( ), ( ) ( , )u v K u vθ θ ≡for certain η and K, e.g.
2
( ) ( , )Degree polynomial ( , 1)
|| ||Radial Basis Function M achine exp
Two Layer Neural Network ( , )
d
u K u vd u v
u v
sigmoid u v c
θ
ση
+
⎛ ⎞−−⎜ ⎟
⎝ ⎠+
Kernels Intuition
Kernels encode the notion of similarity to be used for a specific applications.
Document use cosine of “bags of text”.Gene sequences can used edit distance.
Similarity defines distance:
Trick is to get right encoding for domain.
2|| || ( ) '( ) , 2 , ,− = − − = − +u v u v u v u u u v v v
Important Points
Kernel method = linear method + embedding in feature space.Kernel functions used to do embedding efficiently.Feature space is higher dimensional space so must regularize.Choose kernel appropriate to domain.
Kernel Ridge Regression
Simple to derive kernel methodWorks great in practice with some finessing.Next time:Practical issues.More standard dual derivation.
Optimal Solution
Want:Mathematical Model:
Optimality Conditions:
2 2min ( , , ) ( ) || ||L b S b wλ= − + +w w y Xw e
b is a vector of on≈ +y Xw e e
( , , ) 2 '( ) 2 0L b S b λ∂= − − + =
∂w X y Xw e ww
( , , ) 2 '( ) 0L b S e bb
∂= − − =
∂w y Xw e
es
Let Try it: In class lab
Basic Ridge Regression
Optimality Condition:
Dual equivalent
2 2min ( , )L Sλ λ= + −w w w y Xw
( ) 1' 'nλ −= +w X X I X y
0.λ >
( ) 1
,where G ,
'i j i j
yλ −= +
=
=
α G I
x x
w X α
Better model
Basic Ridge Regression
Center X and Y to get Xc, Yc
Dual equivalent
2 2,min ( , ) ( )b L S bλ λ= + − +w w w y Xw e
( ) 1' 'nc c c cλ −= +w X X I X y
0.λ >
( ) 1
,where G ,
'i j i j
y
c c
c
λ −= +
=
=
α G I
x x
w X α
Recenter Data
Shift y by mean
Shift x by mean
1
1 :i i ii
y yc yµ µ=
= = −∑
1
1 :i i ii
c=
= = −∑x x x x x
Ridge Regression with bias
Center data by
Calculate w
CalculateTo predict new point
x and µ
1( ) 'c c c cλ −= +w X 'X I X y
'c c c µ= − = −X X ex y y e
b µ=( ) ( ) 'g x x x w b= − +