algorithmic stability - lecture of the course 'foundations ... · algorithmic stability...
TRANSCRIPT
![Page 1: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/1.jpg)
Algorithmic stabilityLecture of the course ’Foundations of Machine learning’
Y. Ryeznik J. Vaicenavicius T. Wiklund
Department of MathematicsUppsala University
October 7, 2015
![Page 2: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/2.jpg)
Stability-basedgeneralisation error bound
![Page 3: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/3.jpg)
Notation
I Let X denote the set of examples, Y denote the set of targetvalues.
I Let z = (x , y) ∈ X × Y.
I Let h ∈ H := {h : X → Y ′} (Y ′ might differ from Y).
I Let L : Y × Y ′ → R+ be a loss function andLz(h) := L(h(x), y).
I Given a sample S = (z1, . . . , zm), the empirical error
R̂(z1,...,zm)(h) :=1
m
m∑i=1
Lzi (h)
I Let D denote the distribution of the sample. Then thegeneralisation error
R(h) := Ez∼D [Lz(h)].
I Given an algorithm A and sample S , let hS ∈ H be thehypothesis returned by A.
![Page 4: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/4.jpg)
Stability-based generalisation guarantee
DefinitionA loss function L is said to be bounded by M ≥ 0 if
∀h ∈ H ∀z ∈ X × Y Lz(h) ≤ M.
Definition (Uniform stability)
Let S and S ′ be training samples of size m and differing by a singlepoint. A learning algorithm A is called β-stable if
∃β ≥ 0 ∀z ∈ X × Y |Lz(hS)− Lz(hS ′)| ≤ β.
Theorem (Generalisation error bound from a training sample)
Let the loss function L be bounded by some M ≥ 0, A be aβ-stable learning algorithm, S be a sample of m points drawn fromdistribution D. Then with probability no less than 1− δ
R(hS) ≤ R̂S(hS) + β + (2mβ + M)
√log(1/δ)
2m.
![Page 5: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/5.jpg)
Proof of generalisation error bound
I Define Φ(S) := R(hS)− R̂S(hS), where S is a sample.
I Let S = (z1, . . . , zm), S ′ = (z1, . . . , z′m) be two samples of size
m, differing by a single point.
I Straightforward calculation using the boundedness of L andthe β-stability of A yields that
|Φ(S)− Φ(S ′)| ≤ 2β + M/m.
I Hence we can apply McDiarmid’s ineq., which gives that
Φ(S) ≤ ES∼Dm [Φ(S)] + (2mβ + M)
√log(1/δ)
2m.
occurs with prob. 1− δ.
![Page 6: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/6.jpg)
Proof cont.
Straightforward to see that
ES∼Dm [R(hS)] = ES∼Dm [Ez∼D [Lz(hS)]] = ES ,z∼Dm+1 [Lz(hS)],
ES∼Dm [R̂(hS)] = E(z1,...,zm,z)∼Dm+1 [Lz(h(z1,...,zm−1,z))].
Then the absolute value of the first term
|ES∼Dm [Φ(S)]|= |E(z1,...,zm,z)∼Dm+1 [Lz(hS)]
−E(z1,...,zm,z)∼Dm+1 [Lz(h(z1,...,zm−1,z))]|≤ E(z1,...,zm,z)∼Dm+1 [|Lz(hS)− Lz(h(z1,...,zm−1,z))|]≤ E(z1,...,zm,z)∼Dm+1 [β]
= β,
which finishes the proof of the claim.
![Page 7: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/7.jpg)
Kernel methods
![Page 8: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/8.jpg)
Non-linear separation
I In most problems linear separation is not possible.
I Maybe linear separation in a bigger space H after a non-lineartransformation
Φ : X → H
is still possible?
![Page 9: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/9.jpg)
Positive definite symmetric kernels
Definition
I A function K : X × X → R is called a kernel over X .
I It is said to be positive definite symmetric (PDS) if for any{x1, . . . , xm} ⊂ X , the matrix K := [K (xi , xj)]ij ∈ Rm×m issymmetric positive semidefinite (symmetric non-negativedefinite) .
![Page 10: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/10.jpg)
Polynomial kernel
DefinitionThe kernel K : R× R→ R defined by
K (x, x′) := (x · x′ + c)d (c > 0, d ∈ N)
is called the polynomial kernel of degree d .
Example (XOR classification)
Figure: Polynomial kernel with c = 1 used.
![Page 11: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/11.jpg)
Reproducing Kernel Hilbert Space
TheoremLet K : X × X → R be a PDS kernel. Then there exists a Hilbertspace H and a map Φ : X → H such that
K (x , y) = 〈Φ(x),Φ(y)〉
for all x , y ∈ X .
I The space H is called a feature space associated to K.
I The map Φ is called a feature map.
![Page 12: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/12.jpg)
Proof
I Define Φ : X → RX by
Φ(x)(y) = K (x , y) (x , y ∈ X ).
I Let H0 := Span({Φ(x) : x ∈ X})I Define a map 〈·, ·〉 : H0 × H0 → R by specifying that for any
f =∑i∈I
aiΦ(xi ), g =∑j∈J
bjΦ(yj) in H0,
the value
〈f , g〉 :=∑
i∈I ,j∈JaibjK (xi , yj) =
∑j∈J
bj f (yj) =∑i∈I
ajg(xi ).
I From above, the map 〈·, ·〉 is well-defined, symmetric, andbilinear.
![Page 13: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/13.jpg)
Proof cont.
I 〈·, ·〉 is positive semidefinite.Proof: 〈f , f 〉 =
∑i ,j∈I aiajK (xi , xj) ≥ 0.
I 〈·, ·〉 is a PDS kernel on H0.Proof: for any f1, . . . , fm ∈ H0 and any c1, . . . , cm ∈ R, wehave ∑
1≤i ,j≤mcicj〈fi , fj〉 = 〈
m∑i=1
ci fi ,m∑j=1
cj fj〉 ≥ 0.
I Cauchy-Schwarz for PDS kernels. If K is a PDS kernel, then
K (x , y)2 ≤ K (x , x)K (y , y)
for all x , y ∈ X .Proof: [K (xi , xj)]1≤i ,j≤2 is PDS and so its determinant isnon-negative.
![Page 14: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/14.jpg)
Proof cont.
I By Cauchy-Schwarz for PDS kernels,
〈f ,Φ(x)〉2 ≤ 〈f , f 〉〈Φ(x),Φ(x)〉.
I From the definition of 〈·, ·〉, we have
f (x) = 〈f ,Φ(x)〉 (reproducing property)
for all f ∈ H0 and all x ∈ X .
I From the two above combined
f (x)2 ≤ 〈f , f 〉〈Φ(x),Φ(x)〉
for all x ∈ X . Consequently, 〈·, ·〉 is definite.
I Also, for any x ∈ X fixed, f 7→ 〈f ,Φ(x)〉 is Lipschitz (byCauchy-Schwarz) and so continuos.
I Finally, we define H := H̄0 (completion of H0) and call it thereproducing kernel Hilbert space.
Q.E.D.
![Page 15: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/15.jpg)
Normalised kernels
DefinitionGiven a kernel K , we define the normalised kernel K ′ by
K ′(x , x ′) =
0 if K (x , x) = 0 or K (x ′, x ′) = 0K(x ,x ′)√
K(x ,x)K(x ′,x ′)otherwise,
for all x , x ′ ∈ X .
I By definition, K ′(x , x) = 1 for all x ∈ X .
LemmaIf K is a PDS kernel, then its normalised kernel K ′ is also PDS.
I Let K ′ be a normalised PDS kernel. Then
K ′(x , x ′) = 〈Φ(x),Φ(x ′)〉
can be interpreted as the sine of the angle between thefeature vectors Φ(x) and Φ(x ′) in a feature Hilbert space H.
![Page 16: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/16.jpg)
PDS kernels - closure properties
TheoremPDS kernels are closed under:
I sum,
I product,
I tensor product,
I pointwise limit,
I composition with power series x 7→∑∞
n=0 anxn with an ≥ 0
for all n ∈ N.
Corollary
The Gaussian kernel K : RN × RN → R defined by
K (x, x′) := exp
(−||x
′ − x||2
2σ2
),
where x, x′ ∈ RN , is PDS.
![Page 17: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/17.jpg)
Regularised kernel methods
Hypothesis space equal to H for some RKHS associated to kernel
K : X × X → R
regularised estimator (on sample S = ((x1, y1), . . . , (xm, ym)))
hS = argminh (R̂S(h) + λ‖h‖2)︸ ︷︷ ︸FS (h)
.
Recall empirical risk (average badness of choices made by h)
R̂S(h) =1
m
m∑i=1
L(h(xi ), yi ).
![Page 18: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/18.jpg)
Regularised kernel methods
Hypothesis space equal to H for some RKHS associated to kernel
K : X × X → R
regularised estimator (on sample S = ((x1, y1), . . . , (xm, ym)))
hS = argminh (R̂S(h) + λ‖h‖2)︸ ︷︷ ︸FS (h)
.
Recall empirical risk (average badness of choices made by h)
R̂S(h) =1
m
m∑i=1
L(h(xi ), yi ).
![Page 19: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/19.jpg)
Theorem (Stability of regularised kernel methods)
If L is convex, differentiable and (in first argument) σ-Lipschitz, i.e.
|L(v , y)− L(v ′, y)| ≤ σ|v − v ′|
and the kernel K is bounded, i.e. for some r > 0
K (x , x) ≤ r2
the regularised estimator
hS = argminh FS(h)
is β-stable for β = σ2r2
mλ .
![Page 20: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/20.jpg)
Theorem (Stability of regularised kernel methods)
If L is convex, differentiable and (in first argument) σ-Lipschitz, i.e.
|L(v , y)− L(v ′, y)| ≤ σ|v − v ′|
and the kernel K is bounded, i.e. for some r > 0
K (x , x) ≤ r2
the regularised estimator
hS = argminh FS(h)
is β-stable for β = σ2r2
mλ .
![Page 21: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/21.jpg)
Theorem (Stability of regularised kernel methods)
If L is convex, differentiable and (in first argument) σ-Lipschitz, i.e.
|L(v , y)− L(v ′, y)| ≤ σ|v − v ′|
and the kernel K is bounded, i.e. for some r > 0
K (x , x) ≤ r2
the regularised estimator
hS = argminh FS(h)
is β-stable for β = σ2r2
mλ .
![Page 22: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/22.jpg)
Recall β-stability
For two samples
S = ((x1, y1), . . . , (xm, ym))
and
S ′ = ((x ′1, y′1), . . . , (x ′m, y
′m))
such that (xk , yk) 6= (x ′k , y′k) for at most one k we must show
|L(hS(x), y)− L(hS ′(x), y)| ≤ β.
![Page 23: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/23.jpg)
This one weird trick...
I Translate minimisation property of hS and hS ′ using convexity
I bound pointwise |h(x)− h′(x)| by H-norm ‖h − h′‖
![Page 24: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/24.jpg)
This one weird trick... (two really)
I Translate minimisation property of hS and hS ′ using convexity
I bound pointwise |h(x)− h′(x)| by H-norm ‖h − h′‖
![Page 25: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/25.jpg)
Step One
Observe that hS and hS ′ minima of differentiable FS and FS ′ , so
∇FS(hS) = 0 and ∇FS ′(hS ′) = 0.
![Page 26: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/26.jpg)
Convexity and hyperplanes
G (h)
h0
G (h0) + 〈∇G (h0), h − h0〉
G (h0) + 〈∇G (h0), h − h0〉 ≤ G (h)
![Page 27: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/27.jpg)
Convexity and hyperplanes
G (h)
h0
G (h0) + 〈∇G (h0), h − h0〉
0 ≤ G (h)− G (h0)− 〈∇G (h0), h − h0〉︸ ︷︷ ︸Sometimes called Bregman Divergence
![Page 28: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/28.jpg)
Convexity and hyperplanes (for the risk!)
R̂S(h)
h0
R̂S(h0) + 〈∇R̂S(h0), h − h0〉
0 ≤ R̂S(h)− R̂S(h0)− 〈∇R̂S(h0), h − h0〉
![Page 29: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/29.jpg)
Simple RKHS property
|h(x)| = |〈h,K (x , ·)〉| (RKHS)
≤ ‖h‖‖K (x , ·)‖ (Cauchy-Schwarz-Bunyakovsky)
= ‖h‖√〈K (x , ·),K (x , ·)〉
= ‖h‖√
K (x , x) (RKHS)
= r‖h‖
![Page 30: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/30.jpg)
Back to β-stability
S and S ′ samples differing on one point, x and y arbitrary
β-stability︷ ︸︸ ︷|L(hS(x), y)− L(hS ′(x), y)| ≤ σ|hS(x)− hS ′(x)| (σ-Lipschitz)
= σ|(hS − hS ′)(x)|≤ σr‖hS − hS ′‖
![Page 31: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/31.jpg)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
‖hS − hS ′‖2 = 〈hS − hS ′ , hS − hS ′〉= −〈hS , hS ′ − hS〉 − 〈hS ′ , hS − hS ′〉
=1
2(− 〈2hS , hS ′ − hS〉 − 〈2hS ′ , hS − hS ′〉)
=1
2(−〈∇N(hS), hS ′ − hS〉 − 〈∇N(hS ′), hS − hS ′〉)
=1
2λ(− 〈∇λN(hS), hS ′ − hS〉 − 〈∇λN(hS ′), hS − hS ′〉)
![Page 32: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/32.jpg)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
‖hS − hS ′‖2 = 〈hS − hS ′ , hS − hS ′〉= −〈hS , hS ′ − hS〉 − 〈hS ′ , hS − hS ′〉
=1
2(− 〈2hS , hS ′ − hS〉 − 〈2hS ′ , hS − hS ′〉)
=1
2(−〈∇N(hS), hS ′ − hS〉 − 〈∇N(hS ′), hS − hS ′〉)
=1
2λ(− 〈∇λN(hS), hS ′ − hS〉 − 〈∇λN(hS ′), hS − hS ′〉)
![Page 33: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/33.jpg)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
− 〈∇λN(hS), hS ′ − hS〉
≤ R̂S(hS ′)− R̂S(hS)− 〈∇R̂S(hS), hS ′ − hS〉︸ ︷︷ ︸by convexity ≥ 0
−〈∇λN(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)− 〈∇(R̂S + λN)(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)− 〈∇FS(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)
![Page 34: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/34.jpg)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
− 〈∇λN(hS), hS ′ − hS〉
≤ R̂S(hS ′)− R̂S(hS)− 〈∇R̂S(hS), hS ′ − hS〉︸ ︷︷ ︸by convexity ≥ 0
−〈∇λN(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)− 〈∇(R̂S + λN)(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)− 〈∇FS(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)
![Page 35: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/35.jpg)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
‖hS − hS ′‖2 =1
2λ(− 〈∇λN(hS), hS ′ − hS〉 − 〈∇λN(hS ′), hS − hS ′〉)
≤ 1
2λ
(R̂S(hS ′)− R̂S(hS) + R̂S ′(hS)− R̂S ′(hS ′)
)≤ 1
2λ
(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)
)
![Page 36: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/36.jpg)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
‖hS − hS ′‖2 =1
2λ(− 〈∇λN(hS), hS ′ − hS〉 − 〈∇λN(hS ′), hS − hS ′〉)
≤ 1
2λ
(R̂S(hS ′)− R̂S(hS) + R̂S ′(hS)− R̂S ′(hS ′)
)≤ 1
2λ
(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)
)
![Page 37: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/37.jpg)
Use that S ≈ S ′
R̂S(hS ′)− R̂S ′(hS ′)
=1
m
∑i
L(hS ′(xi ), yi )−1
m
∑i
L(hS ′(x′i ), y
′i )
=1
m
∑i 6=k
(L(hS ′(xi ), yi )− L(hS ′(x′i ), y
′i ))
+1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
=1
m
∑i 6=k
(L(hS ′(xi ), yi )− L(hS ′(xi ), yi ))
+1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
=1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
![Page 38: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/38.jpg)
Use that S ≈ S ′
R̂S(hS ′)− R̂S ′(hS ′)
=1
m
∑i
L(hS ′(xi ), yi )−1
m
∑i
L(hS ′(x′i ), y
′i )
=1
m
∑i 6=k
(L(hS ′(xi ), yi )− L(hS ′(x′i ), y
′i ))
+1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
=1
m
∑i 6=k
(L(hS ′(xi ), yi )− L(hS ′(xi ), yi ))
+1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
=1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
![Page 39: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/39.jpg)
Put some finishing touches
‖hS − hS ′‖2 ≤1
2λ
(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)
)=
1
2λm(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k)
+ L(hS(x ′k), y ′k)− L(hS(xk), yk))
=1
2λm(L(hS ′(xk), yk)− L(hS(xk), yk)
+ L(hS(x ′k), y ′k)− L(hS ′(x′k), y ′k))
=1
2λm(σr‖hS ′ − hS‖+ σr‖hS − hS ′‖)
=σr
λm‖hS − hS ′‖
![Page 40: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/40.jpg)
Put some finishing touches
‖hS − hS ′‖2 ≤1
2λ
(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)
)=
1
2λm(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k)
+ L(hS(x ′k), y ′k)− L(hS(xk), yk))
=1
2λm(L(hS ′(xk), yk)− L(hS(xk), yk)
+ L(hS(x ′k), y ′k)− L(hS ′(x′k), y ′k))
=1
2λm(σr‖hS ′ − hS‖+ σr‖hS − hS ′‖)
=σr
λm‖hS − hS ′‖
![Page 41: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/41.jpg)
Put some finishing touches
‖hS − hS ′‖�2 ≤1
2λ
(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)
)=
1
2λm(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k)
+ L(hS(x ′k), y ′k)− L(hS(xk), yk))
=1
2λm(L(hS ′(xk), yk)− L(hS(xk), yk)
+ L(hS(x ′k), y ′k)− L(hS ′(x′k), y ′k))
=1
2λm(σr‖hS ′ − hS‖+ σr‖hS − hS ′‖)
=σr
λm������‖hS − hS ′‖
![Page 42: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/42.jpg)
Finally combine
|L(hS(x), y)− L(hS ′(x), y)|≤ σr‖hS − hS ′‖
≤ σr σrλm
![Page 43: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/43.jpg)
Theorem (Stability of regularised kernel methods)
If L is convex, differentiable and (in first argument) σ-Lipschitz, i.e.
|L(v , y)− L(v ′, y)| ≤ σ|v − v ′|
and the kernel K is bounded, i.e. for some r > 0
K (x , x) ≤ r2
the regularised estimator
hS = argminh FS(h)
is β-stable for β = σ2r2
mλ .
![Page 44: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/44.jpg)
Differentiability
Only require supporting hyperplanes with
I ∇FS(hS) = ∇FS ′(hS ′) = 0
I ∇FS(hS) = ∇R̂S(hS) +∇λN(hS)
Sufficient to pick subdifferentials!
![Page 45: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/45.jpg)
Differentiability
Only require supporting hyperplanes with
I ∇FS(hS) = ∇FS ′(hS ′) = 0
I ∇FS(hS) = ∇R̂S(hS) +∇λN(hS)
Sufficient to pick subdifferentials!
![Page 46: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/46.jpg)
Lipschitz
Only required on elements of type h(x), h′(x) for h, h′ ∈ H. I.e.sufficient that L satisfies
|L(h(x), y)− L(h′(x), y)| ≤ σ|h(x)− h′(x)|.
L is said to be σ-admissable with respect to H.
![Page 47: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/47.jpg)
Final observation (Lemma 11.1)
Recall that we showed that |h(x)| ≤ r‖h‖ when K was bounded.If also supy L(0, y) = B <∞ then
argminh R̂S(h) + λ‖h‖2 ≤ R̂S(0) + λ‖0‖2 ≤ B.
so
|hS(x)| ≤ r‖hS‖ ≤ r
√B
λ.
![Page 48: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/48.jpg)
Final observation (Lemma 11.1)
Recall that we showed that |h(x)| ≤ r‖h‖ when K was bounded.If also supy L(0, y) = B <∞ then
argminh R̂S(h) + λ‖h‖2 ≤ R̂S(0) + λ‖0‖2 ≤ B
so
|hS(x)| ≤ r‖hS‖ ≤ r
√B
λ.
![Page 49: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/49.jpg)
Applications
![Page 50: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/50.jpg)
Application to classification algorithm: SVMs
Standard hinge loss: Lhinge(y ′, y) : R× {−1,+1} → R
Lhinge(y ′, y) =
{0, if 1− yy ′ ≤ 0
1− yy ′, otherwise
![Page 51: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/51.jpg)
Application to classification algorithm: SVMs
Stability-based learning bound for SVMs
I Assume that K (x , x) ≤ r2 for all x ∈ X for some r ≥ 0
I Let hS denote the hypothesis returned by SVMs when trainedon an i.i.d. sample S of size m.
Then, for any δ > 0, the following inequality holds with probabilityat least 1− δ
R(hS) ≤ R̂(hS) +r2
mλ+
(2r2
λ+
r√λ
+ 1
)√log (1/δ)
2m
![Page 52: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/52.jpg)
Proof
According to its definition, Lhinge(·, y) is Lipschitz with constant 1for any for any (x , y) ∈ X × Y and h, h′ ∈ H, i.e.∣∣Lhinge(h(x), y)− Lhinge(h′(x), y)
∣∣ ≤ ∣∣h(x)− h′(x)∣∣ ,
which means that hinge loss function is σ-admissible with σ = 1.
![Page 53: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/53.jpg)
Proof
I According to the Theorem (Stability of regularised kernelmethods)
β ≤ r2
mλ
I Since |L(0, y)| ≤ 1 for any y ∈ Y then (Lemma 11.1)
|hS(x)| ≤ r√λ
![Page 54: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/54.jpg)
Proof
I According to the Theorem (Stability of regularised kernelmethods)
β ≤ r2
mλ
I Since |L(0, y)| ≤ 1 for any y ∈ Y then (Lemma 11.1)
|hS(x)| ≤ r√λ⇒ Lhinge(hS(x), y) ≤ M = r√
λ+ 1
R(hS) ≤ R̂S(hS) + β + (2mβ + M)
√log(1/δ)
2m
≤ R̂S(hS) +r2
mλ+
(2r2
λ+
r√λ
+ 1
)√log(1/δ)
2m
![Page 55: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/55.jpg)
Stability of linear regression: kernel ridge regression (KRR)
Kernel ridge regression is defined by the minimization of thefollowing objective function
minw
F (w) =m∑i=1
(w ·Φ(xi )− yi )2 + λ||w ||2.
![Page 56: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/56.jpg)
Stability of linear regression: kernel ridge regression (KRR)
Square loss: L2(y ′, y) : R× R→ R
L2(y ′, y) = (y ′ − y)2.
![Page 57: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/57.jpg)
Stability of linear regression: kernel ridge regression (KRR)
Stability-based learning bound for KRR
I Assume that K (x , x) ≤ r2 for all x ∈ X for some r ≥ 0 andthat L2(y ′, y) is bounded by M ≥ 0
I Let hS denote the hypothesis returned by KRR when trainedon an i.i.d. sample S of size m.
Then, for any δ > 0, the following inequality holds with probabilityat least 1− δ
R(hS) ≤ R̂(hS) +4Mr2
mλ+
(8Mr2
λ+ M
)√log (1/δ)
2m
![Page 58: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/58.jpg)
Proof
According to its definition, L2(·, y) is Lipschitz with constant 2√M
for any (x , y) ∈ X × Y and h, h′ ∈ H, i.e.∣∣L2(h(x), y)− L2(h′(x), y)∣∣ ≤ 2
√M∣∣h(x)− h′(x)
∣∣ ,which means that square loss function is σ-admissible withσ = 2
√M.
![Page 59: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/59.jpg)
Proof
I According to the Theorem (Stability of regularised kernelmethods)
β ≤ 4Mr2
mλ
![Page 60: Algorithmic stability - Lecture of the course 'Foundations ... · Algorithmic stability Lecture of the course ’Foundations of Machine learning’ Y. Ryeznik J. Vaicenavicius T](https://reader035.vdocuments.net/reader035/viewer/2022071506/6126bf952fdf005bb9434b10/html5/thumbnails/60.jpg)
Proof
I According to the Theorem (Stability of regularised kernelmethods)
β ≤ 4Mr2
mλ
R(hS) ≤ R̂S(hS) + β + (2mβ + M)
√log(1/δ)
2m
≤ R̂S(hS) +4Mr2
mλ+
(8Mr2
λ+ M
)√log(1/δ)
2m