![Page 1: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/1.jpg)
Kernel Ridge Regression
Prof. BennettBased on Chapter 2 of
Shawe-Taylor and Cristianini
![Page 2: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/2.jpg)
Outline
OverviewRidge RegressionKernel Ridge RegressionOther KernelsSummary
![Page 3: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/3.jpg)
Recall E&K model
R(t)=at2+bt+cIs linear is in its parametersDefine mapping θ(t) and make linear
function in the θ(t) or feature space2
2 2
( ) [ 1]' [ ]'
( ) ( ) [ 1]'
t t t s a b ca
R t t s t t b at bt cc
θ
θ
= =
⎡ ⎤⎢ ⎥= ⋅ = = + +⎢ ⎥⎢ ⎥⎣ ⎦
![Page 4: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/4.jpg)
Linear versus Nonlinear
f(t)=bt+c f(θ(t))=at2+bt+c
![Page 5: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/5.jpg)
Kernel Method
Two parts (+):Mapping into embedding or feature space defined by kernel.Learning algorithm for discovering linear patterns in that space.
Illustrate using linear ridge regression.
![Page 6: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/6.jpg)
Linear Regression in Feature Space
Key Idea:Map data to higher dimensional space (feature space) and perform linear regression in embedded space.
Embedding Map:: n NR F R N nφ ∈ → ⊆ >>x
![Page 7: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/7.jpg)
Nonlinear Regression in Feature Space
Input
[ ]1 2
2 2
2 21 2 3
,,
( ) , , 2
( ) ( ) ,
2F
r sw r w s
r s r s
g
w r w s w r s
θ
θ
== +
↓
⎡ ⎤= ⎣ ⎦=
= + +
xx w
x
x x w
Feature
![Page 8: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/8.jpg)
Nonlinear Regression in Feature Space
[ ]1 2
3 3 2 2 2 2
3 31 2 1 0
,,
( ) , , , , , , , ,
( ) ( ) ,
. . . . 2F
r sw r w s
r s r s r s s r s r s r
g
w r w s w r s
θ
θ
== +
↓
⎡ ⎤= ⎣ ⎦=
= + + +
xx w
x
x x w
Input
Feature
![Page 9: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/9.jpg)
Let’s try a quadratic on Aquasol
What are all the terms we need to add to our 525 dimensional space to map it into feature space?
Just do squared terms and cross terms.
![Page 10: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/10.jpg)
Kernel and Duality to the rescue
Duality Alternative but equivalent view of the problemsKernel trick – makes mapping to the feature space efficient
![Page 11: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/11.jpg)
Linear Regression
Given training data:
Construct linear function:
( ) ( ) ( ) ( )( )1 1 2 2, , , , , , , , ,
points and labels i
ni
S y y y y
R y R
=
∈ ∈
i
i
x x x x
x
… …
1( ) , '
n
i ii
g w x=
= = = ∑x w x w x
![Page 12: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/12.jpg)
Least Squares Approximation
Want
Define error
Minimize loss
( )g x y≈
( , ) ( )f y y g ξ= − =x x
( )
( )( )
2
1
2
1 1
( , ) ( , ) ( )
, ,
i ii
i i ii i
L g s L w S y g
l y gξ
=
= =
= = −
= =
∑
∑ ∑
x
x
![Page 13: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/13.jpg)
Ridge Regression
Use least norm solution for fixedRegularized problem
Optimality Condition:
2 2min ( , )L Sλ λ= + −w w w y Xw
( , )2 2 ' 2 ' 0
L Sλ λ∂
= − + =∂w
w X y X Xww
( )' 'nλ+ =X X I w X y
0.λ >
Requires 0(n3) operations
![Page 14: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/14.jpg)
Ridge Regression (cont)
Inverse always exists for any
Alternative representation:( ) 1' 'λ −= +w X X I X y
0.λ >
( ) ( )( )
1
1
' ' ' '
' '
λ λ
λ
−
−
+ = ⇒ = −
⇒ = − =
X X I w X y w X y X Xw
w X y Xw X α
( )1λ−= −α y Xw
Solving l×l equation is
![Page 15: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/15.jpg)
Ridge Regression (cont)
Inverse always exists for any
Alternative representation:( ) 1' 'λ −= +w X X I X y
0.λ >
( ) ( )( )
1
1
' ' ' '
' '
λ λ
λ
−
−
+ = ⇒ = −
⇒ = − =
X X I w X y w X y X Xw
w X y Xw X α( )( ) ( )
( )
1
1
''
where '
λ
λλ
λ
−
−
= −
⇒ = − = −
⇒ + =
⇒ = + =
α y Xw
α y Xw y XX αXX α α y
α G I y G XXSolving l×l equation is
![Page 16: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/16.jpg)
Ridge Regression (cont)
Inverse always exists for any
Alternative representation:( ) 1' 'λ −= +w X X I X y
0.λ >
( ) ( )( )
1
1
' ' ' '
' '
λ λ
λ
−
−
+ = ⇒ = −
⇒ = − =
X X I w X y w X y X Xw
w X y Xw X α( )( ) ( )
( )
1
1
''
where '
λ
λλ
λ
−
−
= −
⇒ = − = −
⇒ + =
⇒ = + =
α y Xw
α y Xw y XX αXX α α y
α G I y G XXSolving l×l equation is
![Page 17: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/17.jpg)
Gram or Kernel Matrix
Gram Matrix
Composed of inner products of data
'K G XX= =
, ,i j i jK x x=
![Page 18: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/18.jpg)
Dual Ridge Regression
To predict new point:
Note need only compute G, the Gram Matrix
( ) 1
1( ) , , '
where ,
i ii
i i
g α λ −
=
= = = +
=
∑x w x x x y G I z
z x x
' ,ij i jG= =G XX x x
Ridge Regression requires only inner products between data points
![Page 19: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/19.jpg)
Efficiency
To compute w in primal ridge regression is 0(n3)
α in dual ridge regression is 0(l3)
To predict new point xprimal 0(n)
dual 0(nl)( ) ( )
1 1 1( ) , ,
n
i i i i j ji i j
g α α= = =
⎛ ⎞= = = ⎜ ⎟
⎝ ⎠∑ ∑ ∑x wx x x x x
Dual is better if n>>l
( )1
( ) ,n
i ii
g w=
= = ∑x w x x
![Page 20: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/20.jpg)
Notes on Ridge Regression
“Regularization” is key to addressstability and regularization.Regularization lets method work when n>>l.Dual more efficient when n>>l.Dual only requires inner products of data.
![Page 21: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/21.jpg)
Nonlinear Regression in Feature Space
1
( ) ( ) ,
( ) , ( )
F
i ii
g φ
α φ φ=
=
= ∑
x x w
x x
In dual representation:
So if we can efficiently compute inner product, our metis efficient
![Page 22: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/22.jpg)
Let try it for our sample problem
( )
2 2 2 21 2 1 2 1 2 1 2
2 2 2 21 1 2 2 1 2 1 2
21 1 2 2
2
( ) , ( )
( , , 2 ) , ( , , 2 )
2
,
u u u u v v v v
u v u v u u v v
u v u v
φ φ
=
= + +
= +
=
u v
u v
2D e f i n e : ( , ) ,K =u v u v
![Page 23: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/23.jpg)
Kernel Function
A kernel is a function K such that
There are many possible kernels.Simplest is linear kernel.
, ( ), ( )
where is a mapping from input space to feature space .
FK
F
φ φ
φ
=x u x u
, ,K =x u x u
![Page 24: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/24.jpg)
Dual Ridge Regression
To predict new point:
Note need only compute G, the Gram Matrix
( ) 1
1( ) , , '
where ,
i ii
i i
g α λ −
=
= = = +
=
∑x w x x x y G I z
z x x
' ,ij i jG= =G XX x x
Ridge Regression requires only inner products between data points
![Page 25: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/25.jpg)
Ridge Regression in Feature Space
To predict new point:
To compute the Gram Matrix
( ) 1
1( ( )) , ( ) ( ), ( ) '
where ( ), ( )
i ii
i i
g φ φ α φ φ λ
φ φ
−
=
= = = +
=
∑x w x x x y G I z
z x x
( ) ( ) ' ( ), ( ) ( , )ij i j i jG Kφ φ φ φ= = =G X X x x x x
Use kernel to compute inner product
![Page 26: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/26.jpg)
Nonlinear Regression in Feature Space
1
1
( ) ( ) ,
( ) , ( )
( , )
F
i ii
i ii
g
K
φ
α φ φ
α
=
=
=
=
=
∑
∑
x x w
x x
x x
Kernel trick works for any dual representation:
![Page 27: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/27.jpg)
Popular Kernels based on vectors
By Hilbert-Schmidt Kernels (Courant and Hilbert 1953)
( ), ( ) ( , )u v K u vθ θ ≡for certain η and K, e.g.
2
( ) ( , )Degree polynomial ( , 1)
|| ||Radial Basis Function M achine exp
Two Layer Neural Network ( , )
d
u K u vd u v
u v
sigmoid u v c
θ
ση
+
⎛ ⎞−−⎜ ⎟
⎝ ⎠+
![Page 28: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/28.jpg)
Kernels Intuition
Kernels encode the notion of similarity to be used for a specific applications.
Document use cosine of “bags of text”.Gene sequences can used edit distance.
Similarity defines distance:
Trick is to get right encoding for domain.
2|| || ( ) '( ) , 2 , ,− = − − = − +u v u v u v u u u v v v
![Page 29: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/29.jpg)
Important Points
Kernel method = linear method + embedding in feature space.Kernel functions used to do embedding efficiently.Feature space is higher dimensional space so must regularize.Choose kernel appropriate to domain.
![Page 30: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/30.jpg)
Kernel Ridge Regression
Simple to derive kernel methodWorks great in practice with some finessing.Next time:Practical issues.More standard dual derivation.
![Page 31: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/31.jpg)
Optimal Solution
Want:Mathematical Model:
Optimality Conditions:
2 2min ( , , ) ( ) || ||L b S b wλ= − + +w w y Xw e
b is a vector of on≈ +y Xw e e
( , , ) 2 '( ) 2 0L b S b λ∂= − − + =
∂w X y Xw e ww
( , , ) 2 '( ) 0L b S e bb
∂= − − =
∂w y Xw e
es
![Page 32: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/32.jpg)
Let Try it: In class lab
Basic Ridge Regression
Optimality Condition:
Dual equivalent
2 2min ( , )L Sλ λ= + −w w w y Xw
( ) 1' 'nλ −= +w X X I X y
0.λ >
( ) 1
,where G ,
'i j i j
yλ −= +
=
=
α G I
x x
w X α
![Page 33: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/33.jpg)
Better model
Basic Ridge Regression
Center X and Y to get Xc, Yc
Dual equivalent
2 2,min ( , ) ( )b L S bλ λ= + − +w w w y Xw e
( ) 1' 'nc c c cλ −= +w X X I X y
0.λ >
( ) 1
,where G ,
'i j i j
y
c c
c
λ −= +
=
=
α G I
x x
w X α
![Page 34: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/34.jpg)
Recenter Data
Shift y by mean
Shift x by mean
1
1 :i i ii
y yc yµ µ=
= = −∑
1
1 :i i ii
c=
= = −∑x x x x x
![Page 35: Kernel Ridge Regression - Rensselaer Polytechnic Institute](https://reader031.vdocuments.net/reader031/viewer/2022012102/616a014a11a7b741a34dc4dd/html5/thumbnails/35.jpg)
Ridge Regression with bias
Center data by
Calculate w
CalculateTo predict new point
x and µ
1( ) 'c c c cλ −= +w X 'X I X y
'c c c µ= − = −X X ex y y e
b µ=( ) ( ) 'g x x x w b= − +