machine learning -...
TRANSCRIPT
![Page 1: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/1.jpg)
17.06.13 1
Machine Learning
Support Vector Machine (SVM)
Prof. Dr. Volker Sperschneider
AG Maschinelles Lernen und Natürlichsprachliche Systeme
Institut für Informatik Technische Fakultät
Albert-Ludwigs-Universität Freiburg
![Page 2: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/2.jpg)
17.06.13 2
SVM
I. Large margin linear separability II. Optimization theory III. Maximum margin classifier at work IV. Kernel functions and kernel trick V. SVM learnability theory VI. Extension to soft margin
![Page 3: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/3.jpg)
17.06.13 3
SVM
I. Large margin linear separability
![Page 4: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/4.jpg)
17.06.13 4
Architecture
x1
xn
w1
wn
b
⎩⎨⎧
<−
≥+=
+=
ℜ∈
ℜ∈
∑=
0),,(10),,(1
),,(
,,,,,,,
1
21
21
xbwnetxbwnet
y
bxwxbwnet
bwwwxxx
n
iii
n
n
![Page 5: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/5.jpg)
17.06.13 5
Training set
{ }1,1,,,
,,,
),(,),,(),,(
21
21
2211
+−∈
ℜ∈
l
nl
ll
ddd
xxx
dxdxdx
Set of l labelled (classified) vectors
We assume that both positive and negative
training vectors are present.
![Page 6: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/6.jpg)
17.06.13 6
Hyperplanes and Halfspaces
{ }
{ }
{ }0),(
0),(
0),(
<+ℜ∈=
>+ℜ∈=
=+ℜ∈=
−
+
bxwxbwH
bxwxbwH
bxwxbwH
Tn
Tn
Tn
![Page 7: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/7.jpg)
17.06.13 7
Linear Separability
),( bwH −
),( bwH +
wwb2−
02222 =+
−=+
−=+
− bwwbbww
wbbw
wbw TT
),( bwH
![Page 8: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/8.jpg)
17.06.13 8
Distance of arbitrary vector z to hyperplane is
Signed distance (> 0 for vectors in positive halfspace, < 0 for vectors in negative halfspace) of arbitrary vector z to hyperplane is
wbzwT +
wbzwT +
![Page 9: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/9.jpg)
17.06.13 9
Take two arbitrary vectors on hyperplane:
Difference vector is perpendicular to w:
Thus distance of origin to hyperplane is
0)( =−=−=− bbywxwyxw TTT
wb
bywbxw TT +==+ 0
![Page 10: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/10.jpg)
17.06.13 10
wwb2−
u
zv
parallelwvvuz
−
+=
,
vwvw
wvw
wvwbuw
wbvuw
wbzw
T
TTTT
=⋅
==
++=
++=
+ )(
![Page 11: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/11.jpg)
17.06.13 11
wwb2−
u
z
v
parallelwvvuz
−
−=
,
vwvw
wvw
wvwbuw
wbvuw
wbzw
T
TTTT
−=⋅−
=−
=
−+=
+−=
+ )(
![Page 12: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/12.jpg)
17.06.13 12
Unfavourable seperating lines
![Page 13: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/13.jpg)
17.06.13 13
Favourable seperating line
![Page 14: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/14.jpg)
17.06.13 14
due to large margin
![Page 15: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/15.jpg)
17.06.13 15
Maximum Margin Separation
Given training set T, find weight vector w and threshold b that maximize margin for hyperplane H(w,b) w.r.t. T. { }
w
bxwbwbw
w
bxw
wbxw
bw
dxdxdxT
kT
lk
bwT
bw
bw
kT
lkkT
lkT
ll
+==
→+
=+
=
=
=
=
=
1
,,maxmax
,1
1
2211
minmaxarg),(maxarg,
maxmin
min),(
),(,),,(),,(
µ
µ
![Page 16: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/16.jpg)
17.06.13 16
Normal form of maximum margin
Double occurrence of w in the definition of margin, in nominator and denominator, can be avoided. Simply scale w, b with suitable factor λ > 0. Scaled parameters define the same hyperplane and halfspaces as before. Use scaled w, b such that
1min1
=+=
bxw kT
lk
![Page 17: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/17.jpg)
17.06.13 17
Normal form of maximum margin
Constraints after scaling Training vectors xk with are called support vectors.
lkbxwdlkbxwd
kTk
kTk
111111
=∀−≤+⇒−=
=∀+≥+⇒+=
1=+bxw kT
![Page 18: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/18.jpg)
17.06.13 18
Positive support vectors are separated from negative support vectors by a corridor of width . Exercise: Proof this! Why is the statement not completely trivial? The term above is to be maximized under the normalized constraints. Alternatively one can minimize under the normalized constraints.
w2
221 w
![Page 19: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/19.jpg)
17.06.13 19
Constraints can be transformed in uniform format: lkbxwd
lkbxwdkTk
kTk
111111
=∀−≤+⇒−=
=∀+≥+⇒+=
lkbxwdlkbxwd
kTk
kTk
11)(111)(1
=∀−≤+⇒−=
=∀−≤+−⇒+=
lkbxwdlkbxwd
kTk
kTk
101)(1101)(1
=∀≤++⇒−=
=∀≤++−⇒+=
lkbxwd kTk 101)( =∀≤++−
![Page 20: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/20.jpg)
17.06.13 20
Normal form of maximum margin
under constraints Parameter b has vanished in the function to be maximized. Does this cause a problem? Does it make the optimization senseless? Why cannot we simply let norm of w tend to infinity?
lkbxwd kTk 101)( =∀≤++−
min221
ww →
![Page 21: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/21.jpg)
17.06.13 21
SVM
II. Optimization theory
Is only presented (without proofs) in so far as is required for an understanding of support vector machines. For a more detailed presentation use Martin Riedmillers slides:
Riedmiller_svm
![Page 22: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/22.jpg)
17.06.13 22
Convexity makes life easier Subset is convex if the following holds:
[ ] Ω∈−+∈∀Ω∈∀Ω∈∀ )(1,0 xyxyx λλ
yxyxx )( −+ λ
nℜ⊆Ω
![Page 23: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/23.jpg)
17.06.13 23
Convexity makes life easier Function f is convex if the following holds:
[ ]1,0))()(()())(( ∈∀−+≤−+ λλλ xfyfxfxyxf
yxyxx )( −+ λ
![Page 24: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/24.jpg)
17.06.13 24
Convexity makes life easier Consider convex function on convex domain:
A local minimum is a vector such that: A global minimum is a vector such that:
ℜ→Ω:f
))()((0 yfxfyrxyyr ≤∧Ω∈⇒≤−∀>∃
))()(( yfxfyy ≤⇒Ω∈∀
Ω∈x
Ω∈x
![Page 25: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/25.jpg)
17.06.13 25
Convexity makes life easier Consider convex function on convex domain:
Theorem: Every local minimum is a global minimum. Proof: • Let be a local minimum and arbitrary. • Choose small enough such that
ℜ→Ω:f
Ω∈yΩ∈x1,0 ≤> λλ
))(()( xyxfxf −+≤ λ
![Page 26: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/26.jpg)
17.06.13 26
Convexity makes life easier Using convexity we conclude:
)()(
)()(0
))()((0
))()(()())(()(
yfxf
xfyf
xfyf
xfyfxfxyxfxf
≤
−≤
−≤
−+≤−+≤
λ
λλ
![Page 27: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/27.jpg)
17.06.13 27
Convexity makes life easier Examples of convex functions: • linear functions - trivial • affine functions (= linear + constant) - trivial • square function (1-dimensional) – proof follows • sum of convex functions • squared euklidean norm (n-dimensional) – from
results above • convex function scaled with positive factor - easy
![Page 28: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/28.jpg)
17.06.13 28
Convexity makes life easier
Square function is convex:
:10 <<≠ λandyxConsider
true
xyxyxyxyyxyx
xyxyxxyxyxyxyx
xyxyxxyxyxxxyxxyx
⇔≤
⇔−≤−
⇔−≤−
⇔≤−+
⇔+≤−+
⇔+−≤−+−
⇔+−+≤−+−+
⇔−+≤−+
1)()()()(
)()()(2
))(()()(2))(()()(2
)())((
2
2
2
22
2222
2222
λ
λ
λλ
λλλ
λλλ
λλλ
λλλ
λλ
![Page 29: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/29.jpg)
17.06.13 29
Minimization under equalities
Differentiable function to be minimized Equality constraints Lagrange function
ℜ→Ω:f
lpxhp ,,10)( =∀=
∑=
+=l
p
ppl xhxfxL
11 )()(),,,( ααα
![Page 30: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/30.jpg)
17.06.13 30
Minimization under equalities
Necessary condition on minimum x with constraints is existence of Lagrange multipliers with: Under certain conditions also sufficient.
lαα ,,1
lpxh
xhxfxL
p
l
p
pxpxlx
,,10)(
0))(())(()),,,((1
1
=∀=
=∇+∇=∇ ∑=
ααα
![Page 31: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/31.jpg)
17.06.13 31
Minimization under equalities
In explicit terms:
lpxh
xxh
xxf
p
l
p i
p
pi
,,10)(
0)()(1
=∀=
=∂
∂+
∂∂
∑=
α
![Page 32: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/32.jpg)
17.06.13 32
Example 1: max area rectangle
• Find rectangle with side lengths x and y, fixed circum- ference sum 2x + 2y = c, and maximum area xy.
• Function to be minimized
• Equality constraint
022 =−+ cyx
xyyxf −=),(
![Page 33: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/33.jpg)
17.06.13 33
Solution by square
844ccc xy === α
0220202 =−+=+−=+− cyxxy αα
cxy === ααα 822
![Page 34: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/34.jpg)
17.06.13 34
Example 2: Entropy maximization
• Function to be maximized
• Equality constraint
∑=
=−n
kkx
1
01
∑=
=n
kkkn xxxxf
11 log),,(
[ ] ℜ→nf 1,0:
![Page 35: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/35.jpg)
17.06.13 35
Solution
∑
∑∑
=
==
=
=++−=∂
∂
=++−=∂
∂
−+−=
n
kk
nn
n
n
n
kk
n
kkkn
x
xxxxL
xxxxL
xxxxxL
1
1
11
1
111
1
0)2log(log),,,(
0)2log(log),,,(
)1(log),,,(
αα
αα
αα
![Page 36: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/36.jpg)
17.06.13 36
∑=
=
==n
kk
n
x
xx
1
1
1
∑=
=
=+=+n
kk
n
x
xx
1
1
1
2loglog2loglog αα
nnxx 11 ===
Solution by uniform probability distribution
![Page 37: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/37.jpg)
17.06.13 37
Example 3: Likelihood maximization
• A random process with k independent possible events is observed with number of occurrences for the events
• If probabilities of events were known to be
• then likelihood of this probability model under observations above is defined by likelihood function:
• Equality constraint
knn ,,1
kpp ,,1
![Page 38: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/38.jpg)
17.06.13 38
• Likelihood function is to be maximized.
• Equality constraint
∑=
=−k
iip
1
01
∏=
==k
i
nikkkipppnnLppf
1111 ),,,,(),,(
[ ] ℜ→kf 1,0:
![Page 39: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/39.jpg)
17.06.13 39
Exercise: Show that empirical relative frequencies give the most likely probability model: Calculations are are little bit more complicated than in the examples before.
ninn
npk
ii ,,1
1
=∀++
=
![Page 40: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/40.jpg)
17.06.13 40
Minimization under inequalities Differentiable function to be minimized Inequality constraints Lagrange function
ℜ→Ω:f
∑=
+=l
p
ppl xgxfxL
11 )()(),,,( ααα
mpxg p ,,10)( =∀≤
![Page 41: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/41.jpg)
17.06.13 41
Minimization under inequalities
Necessary condition on minimum x with inequality constraints is the existence of Lagrange multipliers which fulfill the following KKT constraints (Karush, Kuhn, Tucker):
lαα ,,1
![Page 42: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/42.jpg)
17.06.13 42
Minimization under inequalities Karush-Kuhn-Tucker constraints Note that there are as many equations as variables.
lpxglpxg
xgxfxL
pp
p
l
p
pxpxlx
l
,,10)(,,10)(
0))(())(()),,,((
0,,
11
1
=∀=
=∀≤
=∇+∇=∇
≥
∑=
α
ααα
αα
![Page 43: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/43.jpg)
17.06.13 43
Duality Primal problem
lpxgtsrequirementosubject
xff
p
x
n
,,10)(
min)(:
=∀≤
→
ℜ→Ω
Ω∈
![Page 44: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/44.jpg)
17.06.13 44
Duality Lagrange function More compact: KKT conditions force us to solve equations under inequality constraints. This is often uncomfortable.
0,,)()(),,,( 11
1 ≥+= ∑=
l
l
p
ppl xgxfxL ααααα
0)()(),(1
≥+= ∑=
αααl
p
pp xgxfxL
![Page 45: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/45.jpg)
17.06.13 45
Duality Dual problem ignores all inequality constraints and, for fixed α, minimizes Lagrange function over x. This defines a lower bound for primal problem. Lemma 1: For arbitrary α ≥ 0
),(inf)( αα xLQx
=
)(inf)(0)(
xfQxgwithx
≤
≤α
![Page 46: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/46.jpg)
17.06.13 46
Duality Proof: Consider an arbitrary y with: Then: Since this holds for all y we conclude:
∑=
≤+=≤=l
p
ppx
yfygyfyLxLQ1
)()()(),(),(inf)( αααα
)(inf)(0)(
xfQxgwithx
≤
≤α
lpyg p ,,10)( =∀≤
![Page 47: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/47.jpg)
17.06.13 47
Duality Having ignored requirements is compensated in a second step by taking the greatest lower bound, that is, by maximizing over all Lagrange multipliers: Lemma 2: Existence of infima and max supposed,
max)(0≥→α
αQ
)(inf)(max0)(
0xfQ
xgwithx
≤≥
≤αα
![Page 48: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/48.jpg)
17.06.13 48
Computing max inf L means to find a saddle-point of function L.
![Page 49: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/49.jpg)
17.06.13 49
Duality Lemma 3: Assume you found β ≥ 0 and y with For short: „Dual value meets primal value.“ Then For short: „Optimal dual = optimal primal = solution of KKT has been obtained.“
)()(,,10)(
yfQlpyg p
=
=∀≤
β
),()()(inf)(max)(0)(
0yLyfxfQQ
xgwithx
βαβα
====≤
≥
![Page 50: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/50.jpg)
17.06.13 50
Duality Proof: was used in Lemma 2 to conclude: Using we obtain
)()( yfQ =β
lpyg p ,,10)( =∀≤
)()(inf)(max)()(0)(
0yfxfQQyf
xgwithx
=≤≤=≤
≥αβ
α
)(inf)(max0)(
0xfQ
xgwithx
≤≥
≤αα
![Page 51: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/51.jpg)
17.06.13 51
Duality Proof continued: In the proof of Lemma 1 we showed: Thus So far, things were rather simple. The non-trivial part is (without proofs):
)(),()( yfyLQ ≤≤ ββ
),()()( ββ yLQyf ==
![Page 52: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/52.jpg)
17.06.13 52
Duality Lemma 4: Under certain conditions (that are fulfilled for the margin maximization problem: quadratic function, linear constraints, compact domains) equality holds: In particular, this means that max and min both exits.
)(inf)(max0)(
0xfQ
xgwithx
≤≥
=αα
![Page 53: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/53.jpg)
17.06.13 53
Duality Thus we have the choice: • Solve primal problem
• Solve dual problem
• Solve KKT
![Page 54: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/54.jpg)
17.06.13 54
Minimization under equality and inequality constraints
Exercise:
Combine the formulas for the case of equality constraints and the case of inequality constraints into formulas for the combination of equality and inequality constraints.
![Page 55: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/55.jpg)
17.06.13 55
Margin maximization: Primal form
min221
ww →
lpbxwd pTp 101)( =∀≤++−
![Page 56: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/56.jpg)
17.06.13 56
Lagrange function
∑∑
∑
==
=
−+−
=++−+
=
l
p
pTpp
n
ii
l
p
pTpp
l
bxwdw
bxwdw
bwL
1
2
1
1
2
1
)1)(()(21
)1)((21
),,,,(
α
α
αα
![Page 57: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/57.jpg)
17.06.13 57
∑
∑
=
=
−=∂
∂
−=∂
∂
l
p
pp
l
l
p
pi
ppi
i
l
db
bwL
xdww
bwL
1
1
1
1
),,,,(
),,,,(
ααα
ααα
Partial derivatives:
![Page 58: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/58.jpg)
17.06.13 58
Karush-Kuhn-Tucker conditions
lpbxwd
lpbxwd
d
xdw
pTpp
pTp
l
p
pp
l
p
ppp
l
,,10)1)(()5(
,,101)()4(
0)3(
)2(
0,,)1(
1
1
1
=∀=−+
=∀≤++−
=
=
≥
∑
∑
=
=
α
α
α
αα
![Page 59: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/59.jpg)
17.06.13 59
Margin maximization: Dual form
Consider dual function Q and maximize:
max),,,(
),,,,,(inf),,,(
0,,21
21,21
1 ≥→
=
ll
lbwl
Q
bwLQ
ααααα
αααααα
![Page 60: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/60.jpg)
17.06.13 60
Lagrange function:
∑ ∑∑∑
∑∑
∑
= ===
==
=
+−−
=−+−
=++−+
=
l
p
l
pp
l
k
pp
pTpp
n
ii
l
p
pTpp
n
ii
l
p
pTpp
l
dbxwdw
bxwdw
bxwdw
bwL
1 11
2
1
1
2
1
1
2
1
)()(21
)1)(()(21
)1)((21
),,,,(
ααα
α
α
αα
![Page 61: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/61.jpg)
17.06.13 61
If then due to subterm in function L and the fact that minimization of L also runs over b we conclude that Thus this case does not participate in the process of maximization of inf L in the definition of Q. So we may assume:
−∞=),,,,,(inf 21, lbwbwL ααα
∑=
≠l
p
ppd
1
0α ∑=
≠−l
p
ppdb
1
0α
∑=
=l
p
ppd
1
0α
![Page 62: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/62.jpg)
17.06.13 62
Lagrange function reduces to: For every fixed α try to explain why:
∑∑
∑
==
=
+−
=++−+
=
l
pp
l
p
pTpp
l
p
pTpp
l
xwdw
bxwdw
bwL
11
2
1
2
1
)(21
)1)((21
),,,,(
αα
α
αα
−∞≠),,(inf αbwLw
![Page 63: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/63.jpg)
17.06.13 63
Existence of the infimum fixes w by setting gradient of L w.r.t. w to zero: Thus: This allows further simplification of function Q:
∑=
=l
p
ppp xdw
1
α
0),,,,(1
1 =−=∇ ∑=
l
p
ppplw xdwbwL ααα
![Page 64: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/64.jpg)
17.06.13 64
∑∑∑
∑∑ ∑
∑∑
∑∑∑
∑∑
== =
== =
==
===
==
+−
=+−
=+−
=+−
=+−
=
l
pp
l
p
qTpqpqp
l
q
l
pp
l
p
pl
q
Tqqq
pp
l
pp
l
p
pTpp
l
pp
l
p
pTpp
l
p
ppp
T
l
pp
l
p
pTpp
T
k
xxdd
xxdd
xwd
xwdxdw
xwdww
Q
11 121
11 121
1121
11121
1121
21
))((
))((
)(
)(
)(
),,,(
ααα
ααα
αα
ααα
αα
ααα
![Page 65: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/65.jpg)
17.06.13 65
max),,,(
)(
),,,(
0,,21
11 121
21
1 ≥
== =
→
+−
=
∑∑∑
ll
l
pp
l
p
l
q
qTpqpqp
l
Q
xxdd
Q
ααααα
ααα
ααα
Summary
![Page 66: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/66.jpg)
17.06.13 66
∑=
=l
p
ppp xdw
1
α
Solution of primal problem obtained from this:
Bias b is determined as follows: Select a non-zero
Lagrange multiplier – why must it exist? Now use
and solve for b.
0)1)(( =−+ bxwd pTppα
![Page 67: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/67.jpg)
17.06.13 67
SVM
III. Maximum margin classifier at work
![Page 68: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/68.jpg)
17.06.13 68
Generalization • Let a fresh vector z be presented to the net. • Inner product with weight vector w is computed.
• Inner product of x with support vector xp measures similarity between these vectors: Parallel vectors give large inner product. Orthogonal vectors give zero inner product.
∑∑==
==l
p
pTpp
l
p
ppp
TT xzdxdzwz11
)(αα
![Page 69: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/69.jpg)
17.06.13 69
Generalization • Term dp gives inner product correct sign. • Term αp weights terms above appropriately
(hopefully) to compute net input. • Net input can alternatively be seen as computed
by the following architecture with a hidden linear neuron for each support vector whose weight vector is the support vector.
![Page 70: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/70.jpg)
17.06.13 70
),()(11
wznetxzdxdzwzl
p
pTpp
l
p
ppp
TT === ∑∑==
αα
1z
nz
11dα
kkdα
ssdα
11x
1nxkx1
knx sx1
snx
linear
linear
linear
step function
computed here
![Page 71: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/71.jpg)
17.06.13 71
We obtain a quite natural and intutive architecture: • Among training vectors the support vectors are
determined; only these play a role; they are the most representative among the training vectors.
• Similarity with each support vector is computed. This is some sort of „case based reasoning“:
support vectors = cases • Lagrange multipliers weight the similarities.
![Page 72: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/72.jpg)
17.06.13 72
Main disadvantage - plan for a solution • Linear separability is seldom the case. • Embedding vectors of low dimensional input
space into higher dimensional feature space may help:
Create additional features whatever you think could be problem relevant.
• Keep additional complexity limited.
![Page 73: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/73.jpg)
17.06.13 73
SVM
IV. Kernel functions and kernel trick
![Page 74: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/74.jpg)
17.06.13 74
Kernel functions
Use extra features that appear to be relevant for the problem solution, formally described as a function from (lower-dimensional) input space to (higher-dimensional) feature space:
Nn ℜ→ℜΦ :
![Page 75: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/75.jpg)
17.06.13 75
Kernel functions
Originally given learning problem in input space reads in feature space equivalently as follows:
),(),( 11 ll dxdx
)),(()),(( 11 ll dxdx ΦΦ
![Page 76: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/76.jpg)
17.06.13 76
Kernel functions
Dual optimization now reads in feature space as follows:
max),,,(
)())((
),,,(
0,,21
11 121
21
1 ≥
Φ
== =
Φ
→
+ΦΦ−
=
∑∑∑
ll
l
pp
l
p
l
q
qTpqpqp
l
Q
xxdd
Q
ααααα
ααα
ααα
![Page 77: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/77.jpg)
17.06.13 77
Kernel functions
Classification of a fresh vector z from input space proceeds by computing inner product of embedded vectors as follows and comparing it with threshold:
))()((1∑=
ΦΦl
p
pTpp xzdα
![Page 78: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/78.jpg)
17.06.13 78
Kernel functions
For the process of learning (convex optimization) as well as for the process of classification fresh vectors (generalization) the following operation is central and occurs in high number: Function K is called a kernel function.
)()(),( yxyxK TΦΦ=
![Page 79: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/79.jpg)
17.06.13 79
Kernel functions
Computation of lots of inner products in feature space may be expensive since dimension N might be a large number compare to n. We should look for means to compute inner products in input space:
)()(),( yxyxK TΦΦ=
ℜ→ℜ
=ΦΦ=
:)()()(),(
kfunctionsomewithyxkyxyxK TT
![Page 80: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/80.jpg)
17.06.13 80
Kernel functions
Example: Inner product in feature space:
),2,,,(),(
:22
52
yxyxyxyx =Φ
ℜ→ℜΦ
2
2
2222
)())','(),(())','(),(()','(),('''2''')','(),(
rrrkwithyxyxkyxyxyxyxyyyxyxxxyyxxyxyx
TTT
T
+=
=+
=++++=ΦΦ
![Page 81: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/81.jpg)
17.06.13 81
Classification:
1z
nz
11dα
kkdα
ssdα
11x
1nxkx1
knx sx1
snx
linear
linear
linear
step function
),( 1xzK
),( lxzK
),( kxzK
![Page 82: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/82.jpg)
17.06.13 82
Interpretation: • Hidden neurons measure similarity between input vector
and training vectors after embedding in feature space.
• Output neuron uses class labels and Lagrange multipliers to weight similarities and integrate them into a summarized net input.
• A particularly natural way to measure similarity of input
vector with some training vector is by Gaussian bell shaped functions:
2
2
21),( σσπ
pxzp exzK
−−
=
![Page 83: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/83.jpg)
17.06.13 83
• It can be shown that this indeed is a Kernel function (Mercer‘s theorem).
• Other widely used kernels are polynomial kernels of degree d
• and tanh-Kernels that mimic the behaviour of MLPs (weight vector βw and threshold θ):
The latter are Kernels onl for certain combinations of βw and θ.
dT xzxzK )1(),( +=
)tanh(),( θβ −= xzxzK T
![Page 84: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/84.jpg)
17.06.13 84
• There are lots of further useful kernel functions, some more general, some tailored to specifica applications.
• There are lots of cooking recipes for building fresh
kernels out of already constructed ones.
![Page 85: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/85.jpg)
17.06.13 85
SVM
V. SVM learnability theory
![Page 86: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/86.jpg)
17.06.13 86
Generalization ability of maximum margin classifier
• Using lots of additional features (remember used embedding into high-dimensional feature space), usually (as for MLPs) has danger of overfitting.
• Remember the estimations for VC-dimension of
MLPs of order w·log(w) or w2n2.
![Page 87: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/87.jpg)
17.06.13 87
• SVM do not suffer from this problem: Expected error scales with m/R where m is maximum margin of a training set and R the radius of training vectors, but does not depend on dimension of feature space.
• More concrete estimation for expected error ε is (with m maximum margin, l size of training set, R radius of training set, δ confidence):
)logloglog(),,,( 4328
642222
2
δδε +=ml
Relm
mR
lRml
![Page 88: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/88.jpg)
17.06.13 88
A single formula What confidence would you like to have?
Fix δ > 0, δ < 1 (for example δ = 1%)
What margin do you expect?
m would be nice (for example m = 2)
How many training data l are available?
![Page 89: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/89.jpg)
17.06.13 89
{
}δ
ε
µµ
−≥
≤−
⇒≥∀
=
1
),(),(
),((,),(,),,(Pr
,
11)(
bwerrorbwerror
bwbwdxdxT
genTemp
T
lll
Under the settings above, what error must be expected under a randomly drawn training set in the worst case?
p
lp
ml
Relm
mR
l
xRand
with
,,1
4328
642
max
)logloglog( 222
2
==
+= δε
![Page 90: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/90.jpg)
17.06.13 90
Do you see any problem with this estimation? What if you draw a random training set of size l and margin 1 instead of the desired margin 2? Using formula again with m = 1 changes l to l‘. Again draw a random training set of size l‘. Does this game finally come to an end?
![Page 91: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/91.jpg)
17.06.13 91
• Dependence on m/R instead of m is obvious: By scaling training set with a positive factor R one could increase maximum margin m as strong as wished without affecting learning problem.
• The existence of some difficulties with a sound mathematical formalization of this result should be mentioned – maximum margin cannot be fixed in advance before randomly drawing a training set. The interested reader should take a look into literature on structural risk minimization.
![Page 92: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/92.jpg)
17.06.13 92
SVM
VI. Soft margin
![Page 93: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/93.jpg)
17.06.13 93
• Allowing limited classification error (that is bad), larger margin becomes possible (that is good).
• The amount of classification error is measured
by slack variables ξp, one for each training vector – error may vary from training vector to training vector.
• Inequality constraints now read as follows:
![Page 94: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/94.jpg)
17.06.13 94
lpbxwdlpbxwd
ppTp
ppTp
111111
=∀+−≤+⇒−=
=∀−+≥+⇒+=
ξ
ξ
lpbxwd ppTp 11)( =∀−≥+ ξ
lpbxwd ppTp 101)( =∀≤−++− ξ
![Page 95: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/95.jpg)
17.06.13 95
larger margin
error measuredby slack variables
![Page 96: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/96.jpg)
17.06.13 96
• In margin maximization use error values as small as possible.
• This is expressed by the following function
with constant C that is chosen by the user: • C controls the balance between margin
maximization and error tolerance.
0,,
)(),,,(
1
1
22211
≥
+= ∑=
l
l
p
pl Cwwf
ξξ
ξξξ
![Page 97: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/97.jpg)
17.06.13 97
Soft margin: Primal problem Under constraints
min)(),,,(,1
22211
ξξξξ
w
l
p
pl Cwwf →+= ∑=
lpbxwd ppTp
l
101)(0,,1
=∀≤−++−
≥
ξ
ξξ
![Page 98: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/98.jpg)
17.06.13 98
Soft margin: Lagrange function
)())(1(
)(
),,,,,,,,,,(
11
1
2221
111
pl
pp
l
p
pTppp
l
p
p
lll
bxwd
Cw
bwL
ξβξα
ξ
ββααξξ
−++−−
++
=
∑∑
∑
==
=
![Page 99: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/99.jpg)
17.06.13 99
Soft margin: KKT conditions
pipi
piCL
pidbL
pixdwwL
lp
lpbxwdlp
lpbxwd
i
i
iiii
l
p
pp
l
p
pi
ppi
i
pp
pTppp
p
pTpp
1010
102
10
10
10)(
10))(1(10
10)(1
1
1
=∀≥
=∀≥
=∀=−−=∂∂
=∀=−=∂∂
=∀=−=∂∂
=∀=−
=∀=+−−
=∀≤−
=∀≤+−−
∑
∑
=
=
β
α
βαξξ
α
α
ξβ
ξα
ξ
ξ
![Page 100: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/100.jpg)
17.06.13 100
Soft margin: Deriving solution for primal problem from KKT Bias b is indirectly derived from an equality constraint for non-zero αi.
lpC
xdw
ppp
l
p
ppp
12
1
=∀+
=
=∑=
βαξ
α
![Page 101: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/101.jpg)
17.06.13 101
Soft margin: Dual function
.)..(0
12
),,,,,,,,(inf),,,,,(
1
1
11,,11
glwd
lpC
xdw
bwLQ
l
p
pp
ppp
l
p
ppp
ppbwpp
∑
∑
=
=
=
=∀+
=
=
=
α
βαξ
α
ββααξββααξ
![Page 102: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/102.jpg)
17.06.13 102
Soft margin: Inserting values in dual function
∑∑
∑
==
=
−++−−
++
=
l
p
pp
l
p
pTppp
l
p
pT
pp
bxwd
Cww
Q
11
1
221
11
)())(1(
)(
),,,,,(
ξβξα
ξ
ββαα
![Page 103: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/103.jpg)
17.06.13 103
Inserting values in dual function continued
∑∑∑ ∑
∑∑
∑∑
∑
=
+
== =
+
=
+
=
==
=
−−−−
++
=
−++−−
++
l
pCp
l
p
pp
l
p
l
p
pTppCp
l
pC
Tl
p
ppp
l
p
pp
l
p
pTppp
l
p
pT
pppp
pp
dbxwd
Cwxd
bxwd
Cww
12
11 12
1
22
121
11
1
221
)1(
)()(
)())(1(
)(
βαβα
βα
βααα
α
ξβξα
ξ
![Page 104: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/104.jpg)
17.06.13 104
Inserting values in dual function continued
∑∑∑∑
∑∑∑ ∑
∑∑
=
+
==
+
=
=
+
== =
+
=
+
=
−++−
=
−−−−
++
l
pC
l
pp
l
pC
Tl
p
ppp
l
pCp
l
p
pp
l
p
l
p
pTppCp
l
pC
Tl
p
ppp
pppp
pppp
pp
Cwxd
dbxwd
Cwxd
12
)(
11
22
121
12
11 12
1
22
121
2
)()(
)1(
)()(
βαβα
βαβα
βα
αα
βααα
α
![Page 105: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/105.jpg)
17.06.13 105
Inserting values in dual function continued
∑ ∑∑
∑∑
∑ ∑∑
∑∑∑
= =
+
=
+
= =
= =
++
=
=
+
= =
−+
+−
=
−−
++−
l
p
l
pCp
l
pC
l
p
l
q
qTpqpqp
l
p
l
pCpC
l
ppp
l
pC
l
p
l
q
qTpqpqp
pppp
pppp
pp
C
xxdd
Cxxdd
1 12
)(
1
22
1 121
1 122
1
1
22
1 121
2
)(
)(
)()(
βαβα
βαβα
βα
α
αα
βαα
αα
![Page 106: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/106.jpg)
17.06.13 106
Inserting values in dual function continued
∑∑∑
∑∑
∑ ∑∑
∑∑
=
+
==
+
= =
= =
+
=
+
= =
−+
+−
=
−+
+−
l
pC
l
pp
l
pC
l
p
l
q
qTpqpqp
l
p
l
pCp
l
pC
l
p
l
q
qTpqpqp
pppp
pppp
xxdd
C
xxdd
12
)(
114
)(
1 121
1 12
)(
1
22
1 121
22
2
)(
)(
)(
βαβα
βαβα
α
αα
α
αα
![Page 107: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/107.jpg)
17.06.13 107
Inserting values in dual function continued
∑∑∑∑
∑∑∑
∑∑
==
+
= =
=
+
==
+
= =
+−−
=
−+
+−
l
pp
l
pC
l
p
l
q
qTpqpqp
l
pC
l
pp
l
pC
l
p
l
q
qTpqpqp
pp
pppp
xxdd
xxdd
114
)(21
1 121
12
)(
114
)(
1 121
2
22
)(
)(
ααα
α
αα
βα
βαβα
![Page 108: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/108.jpg)
17.06.13 108
Final form of dual function with constraints
∑∑∑∑==
+
= =
+−−
=l
pp
l
pC
l
p
l
q
qTpqpqp
ll
ppxxdd
Q
114
)(21
1 121
11
2
)(
),,,,,(
ααα
ββαα
βα
∑=
=∀=
≥l
p
pp
ll
lpd1
11
,,10
0,,,,,
α
ββαα
![Page 109: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/109.jpg)
17.06.13 109
Wild mixture of kernel functions
cyxyxK T +=),(
dT cyxyxK ))((),( += α
2
2
2),( σ
yx
eyxK−
−=
22),( σ
yx
eyxK−
−=
σyx
eyxK−
−=),(
linear polynomial Gauss exponential Laplace
![Page 110: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/110.jpg)
17.06.13 110
Wild mixture of kernel functions
))(tanh(),( cyxyxK T += α
2111),(yx
yxK−+
=σ
22),( cyxyxK +−=
22
1),(cyx
yxK+−
=
Cauchy sigmoidal quadratic inverse quadratic
![Page 111: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/111.jpg)
17.06.13 111
Wild mixture of kernel functions
θθ yxyx
yxK−
−= sin),(
dyxyxK −−=),(
)1log(),( +−−=dyxyxK
)),min(),min(
),min(1(),(
3312
2
1
iiiiyx
n
iiiiiii
yxyx
yxyxyxyxK
ii +−
++=
+
=∏
Wave Power Log Spline
![Page 112: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm](https://reader033.vdocuments.net/reader033/viewer/2022042914/5f4d27f96021977ec95662fa/html5/thumbnails/112.jpg)
17.06.13 112
Wild mixture of kernel functions
∑=
=n
iii yxyxK
1
),min(),(
∑=
=n
iii yxyxK
1
),min(),( βα
dyxyxK
−+=1
1),(
histogram generalized histogram T-student