1. mathematical foundations of machine learninggrohs/tmp/deeplearninglecture...mathematical...
TRANSCRIPT
-
1. Mathematical Foundations ofMachine Learning
-
1.1 Basic Concepts
-
Definition of Learning
Definition [Mitchell (1997)]
“A computer program is said to learn from experience E with respectto some class of tasks T and performance measure P , if itsperformance at tasks in T , as measured by P , improves withexperience E”
...recall that this is not pure mathematics...
-
The Task T
Classification
Compute f : Rn → {1, . . . , k} which maps data x ∈ Rn to a categoryin {1, . . . , k}. Alternative: Compute f : Rn → Rk which maps datax ∈ Rn to a histogram with respect to k categories.
x = 7→ f(x) = 5.
-
The Task T
Regression
Predict a numerical value f : Rn → R.
Expected claim of insured person
Algorithmic trading
-
The Task T
Density Estimation
Estimate a probability density p : R→ R+ which can be interpretedas a probability distribution on the space that the examples weredrawn from.
Useful for many tasks in data processing, for example if weobserve corrupted data x̃ we may estimate the original x as theargmax of p(x̃|x).
-
The Experience E
The experience typically consists of a dataset which consists of manyexamples (aka data points).
If these data points are labeled (for example in the classificationproblem, if we know the classifier of our given data points) wespeak of supervised learning.
If these data points are not labeled (for example in theclassification problem, the algorithm would have to find theclusters itself from the given dataset) we speak of unsupervisedlearning.
-
The Performance Measure P
In classification problems this is typically the accuracy, i.e., theproportion of examples for which the model produces the correctoutput.
Often the given dataset is split into a training set on which thealgorithm operates and a test set on which its performance ismeasured.
-
An Example: Linear Regression
The Task
Regression: Predict f : Rn → R.
The Experience
Training data ((xtraini , ytraini ))
mi=1.
The Performance Measure
Given test data ((xtesti , ytesti ))
ni=1 we evaluate the performance of an
estimator f̂ : Rn → R as the mean squared error
1
n
n∑i=1
|f̂(xtesti )− ytesti |2.
-
An Example: Linear Regression
The Computer Program
Define a Hypothesis Space
H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)
and, given training data
z = (xi, yi)mi=1,
and the defining the empirical risk
Ez(f) :=1
m
m∑i=1
(f(xi)− yi)2,
we let our algorithm find the minimizer (a.k.a. empirical targetfunction)
fH,z := argminf∈H Ez(f).
-
An Example: Linear Regression
Computing the Empirical Target Function
LetA = (ϕj(xi))i,j ∈ Rm×l.
Every f ∈ H can be written as∑m
i=1wiϕi and we denotew := (wi)
li=1.
We let y := (yi)mi=1.
We get thatEz(f) = ‖Aw − y‖2 .
A minimizer is given by w∗ := A†y, and we get our estimate
f∗ :=
l∑i=1
(w∗)iϕi.
-
Degree too low: underfitting. Degree to high: overfitting!
-
Figure: Error with Polynomial Degree
Bias-Variance Problem
“Capacity” of the hypothesis space has to be adapted to thecomplexity of the target function and the sample size!
-
Until Next Time...
Please brush up on you probability knowledge. Required notions:Probability measure, expectation, conditional measure, marginalmeasure, ...
If you need help, start with Chapter I.3 ofwww.deeplearningbook.org.
You may want to read Chapter I.1 of [Cucker-Smale, On theMathematical Foundations of Learning]
www.deeplearningbook.org
-
1.2 Mathematical Foundations ofGeneral Regression Problems
-
1.2.1 Basic Definitions
-
The Mathematical Learning Problem
Given (compact) X (⊂ Rd) and Y = Rk and a probability measure ρon Z := X × Y .Define the least squares error
E(f) :=∫Z
(f(x)− y)2dρ(x, y).
The learning problem asks for the function f which minimizes E(f).
-
Example 1: Regression
Suppose our data is generated from noisy observations
y = f(x) + ξ,
where ξ is a r.v. with η as probability density on Y = Rk with∫Y yη(y)dy = 0.
Then the density of ρ is given by pρ(x, y) =1|X|η(y − f(x)).
We have that
E(g) = 1|X|
∫X×Y
(g(x)− y)2η(y − f(x))dydx =
1
|X|
∫X×Y
(g(x)−f(x)−y)2η(y)dydx = 1|X|
∫X×Y
(g(x)−f(x))2η(y)dydx
+1
|X|
∫X×Y
y2η(y)dydx.
The learning problem finds f !
-
Example 2: Classifications
Suppose that there is a function f which maps a matrixx ∈ [0, 1]256×256 to a histogram f(x) ∈ R10+ . We consider thevector f(x) ·
∑10i=1 f(x)i as a histogram describing which digit
the image x represents.
Let ρ be any measure on Z = X × Y which generates themeasurement data we get to see (ρ will not be known to us!!!)
Now, a function f as above will in general not exist for ourproblem. But we can look for the function fρ which minimizesthe least squares error E – this will be the optimal explanation ofthe measurements in terms of a functional relation between Xand Y !
-
Regression Function
Conditional Probability
Let pρ be the density of ρ. Then the measure ρ(·|x) on Y has density
1∫Y pρ(x, y)dy
pρ(x, ·).
Regression Function
The regression function is defined as
fρ(x) :=
∫Yydρ(y|x)
Variance
The variance is defined as
σ2ρ := E(fρ).
-
The Regression Function solves the Learning Problem
Marginal
The marginal of ρ on X is defined as the measure
ρX(A) := ρ(A× Y ).
Proposition
For every f : X → Y it holds that
E(f) =∫X
(f(x)− fρ(x))2dρX(x) + σ2ρ.
Corollary
The regression function solves the learning problem!
Are we done??
We don’t know ρ!!!
-
The Next Goal
We do have access to a finite number of random samples of ρ.
Goal
Approximate (or “learn”) the regression function fρ from a finitenumber of random samples w.r.t. the measure ρ on Z.
-
1.2.2 Empirical Minimization,Hypothesis Space, and Bias-VarianceTradeoff
-
Sampling
Empirical Error
Given z := ((x1, y1), . . . , (xm, ym)) ∈ Zm be i.i.d. w.r.t. ρ samples.Define the empirical error
Ez(f) :=1
m
m∑i=1
(f(xi)− yi)2.
The empirical error can actually be computed!
Defect
The defect of f is defined as
Lz(f) := E(f)− Ez(f).
Can we control the defect? If yes, we actually have some hope ofapproximating the regression function.
-
Concentration Inequalities
Bernstein Inequality
Suppose that ξ is a r.v. on a probability space (Z, ρ) with meanE(ξ) = µ and variance V(ξ) = σ2. Suppose that |ξ(z)− µ| ≤Mwith probability 1. Then
Pz∈Zm{∣∣∣∣∣ 1m
m∑i=1
ξ(zi)− µ
∣∣∣∣∣ ≥ ε}≤ 2e
− mε2
2(σ2+13Mε) .
-
Bounding the Defect
Theorem
For f : X → Y define fY : Z → Y via fY (x, y) = f(x)− y and letσ2 = V(f2Y ). Suppose that |f(x)− y| ≤M almost everywhere.Then, for all ε > 0,
Pz∈Zm {|Lz(f)| ≤ ε} ≥ 1− 2e− mε
2
2(σ2+13Mε) .
Are we done??
Any f vanishing on the sample points makes the empirical errorvanish!!!
-
Hypothesis Space
Definition
Let H be a compact subset of the Banach space{f : X → Y, continuous} with norm ‖f‖ := maxx∈X |f(x)|. We callH hypothesis space or model space.
Target Function in HDefine the target function in H via
fH := argminf∈H E(f).
Empirical Target Function
Given z = ((x1, y1), . . . , (xm, ym)) ∈ Zm random samples, define theempirical target function as
fH,z := argminH Ez(f).
The empirical target function can be computed!
-
Sample- and Approximation Error
For a hypothesis class H let
fH := argminf∈H E(f) = argminf∈H∫X
(f(x)− fρ(x))dρX(x).
For f ∈ H letEH(f) := E(f)− E(fH) ≥ 0.
The empirical error E(fH,z) decomposes into
E(fH,z) = EH(fH,z) + E(fH).
The first term is called sample error and the second term is calledapproximation error.
Our goal is to make the empirical error as small as possible!Think about what happens if we fix m, the number of samplesand enlarge the hypothesis space H.
-
Figure: Blue: fH, Red: fH,z, m = 10, H = polynomials of degree 5, 15, 20(from top left to bottom).
If H is too complex, the sampling error increases.
-
The Bias-Variance Trade-Off
If we keep the sample size m fixed and enlarge the hypothesis spaceH, the approximation error will certainly decrease, BUT the sampleerror will increase – this is exactly what we observed experimentally!
Bishop [Neural Networks for Pattern Recognition (1995)]
“A model which is too simple, or too inflexible, will have alarge bias, while one which has too much flexibility inrelation to the particular data set will have a large variance.Bias and variance are complementary quantities, and thebest generalization is obtained when we have the bestcompromise between the conflicting requirements of smallbias and small variance.”
Bias-Variance Problem
What are the precise relations between the number of samples m andthe “capacity” of our hypothesis space H?
-
Covering Numbers
Definition
Let S be a metric space and s > 0. Define the covering numberN (S, s) to be the minimal l ∈ N such that there exist l disks in Swith radius s covering S.
Scaling of N (S, s) with s is a measure of complexity of S termedmetric entropy.
-
Abstract Analysis of Sample Error
Theorem
Let H ⊂ C(X) be a hypothesis class. Assume that for all f ∈ H itholds that |f(x)− y| < M a.e. Then, for all ε > 0,
Pz∈Zm(
supf∈H|Lz(f)| ≤ ε
)≥ 1−N (H, ε
8M)2e− mε
2
4(2σ2+13M2ε) ,
where σ2 := supf∈H σ2(f2Y ).
Proof is left as an exercise!
-
Abstract Analysis of Sample Error
Lemma
Let ε > 0 and 0 < δ < 1 such that
Pz∈Zm(supf∈H|Lz(f)| ≤ ε) ≥ 1− δ.
ThenPz∈Zm(EH(fH,z) ≤ 2ε) ≥ 1− δ.
Theorem
Let H be a hypothesis class. Assume that for all f ∈ H it holds that|f(x)− y| < M a.e. Then, for all ε > 0,
Pz∈Zm (EH(fH,z) ≤ ε) ≥ 1−N (H,ε
16M)2e− mε
2
8(2σ2+13M2ε) ,
where σ2 := supf∈H σ2(f2Y ).
-
Abstract Analysis of Sample Error
Question
Given ε, δ > 0, how many samples m do we need such that theprobability that the sample error is ≤ ε is at least 1− δ?
Answer
By the previous theorem it suffices to choose
m ≥8(4σ2 + 13M
2ε)
ε2
(ln(2N (H, ε
16M)) + ln(
1
δ)
).
Question
How to bound the covering number?
-
Linear Regression
Recall
HR = span {ϕ1, . . . , ϕl}∩{f ∈ C(X) : ‖f‖ ≤ R} ⊂ C(X).
Theorem
Let T := ‖∑l
j=1 |ϕj |‖. Then
ln(N (HR, η)) ≤ l · ln(
4RT
η
).
In the motivational section on linear regression we have seen thatfH,z can be found by solving an l-dimensional linear system.
-
1.2.3 Reproducing Kernel HilbertSpaces (RKHS)
-
Reproducing Kernel Hilbert Spaces (RKHS)
Motivation
Suppose we have a ‘similarity measure’ K : X ×X → R on X andwe would like to do things like nearest neighbour search, PCA etc.with respect to this measure of similarity. One idea is to associateeach x ∈ X with a feature map Φx which is an element of ahigh-dimensional inner-product-space, and which ‘linearizes’ thesimilarity measure in the sense that
K(x, x′) = 〈Φx,Φx′〉.
Question
Which conditions on K guarantee the existence of a feature map?
-
Mercer Theorem
Definition
K : X ×X → R is symmetric if K(x, x′) = K(x′, x) for allx, x′ ∈ X. Let x = {x1, . . . , xk} ⊂ X and K[x] ∈ Rk×k with entriesK(xi, xj) the Gramian of K at x. K is called positive semidefinite ifits Gramian is always positive semidefinite. K is called a Mercerkernel if it is symmetric, positive semidefinite and continuous.
Theorem
There exists a unique Hilbert space HK of functions on X satisfying1 The functions Kx : x
′ 7→ K(x, x′) are in HK ,2 the span of the Kx’s is dense, and
3 for all f ∈ HK we have f(x) = 〈f,Kk〉.
In particular, K(x, x′) = 〈Kx,Kx′〉 and in this sense, the RKHS HKcan be regarded as a feature space.
-
Examples I
Dot-Product Kernels
Let X be the ball of radius T in Rn and K(x, x′) =∑∞
d=1 ad(x · x′),where ad ≥ 0 for all d and
∑d adT
2d
-
Examples II
Translation-Invariant Kernels
Suppose that k : Rn → R is such that its Fourier transform isreal-valued and non-negative. Then K(x, x′) := k(x− x′) is a mercerkernel, called a translation-invariant kernel.
Example
Let k = χ[−1,1] ∗ χ[−1,1] the cardinal B-spline of degree one. Then
K(x, x′) =
{1− |x−x
′|2 |x− x
′| ≤ 20 else
.
-
Examples III
Radial Basis Functions (RBF)
Suppose that f : R+ → R is completely monotonic (i.e.(−1)kf (k) ≥ 0). Then K(x, x′) := f(|x− x′|2) is a mercer kernel,called a RBF kernel.
Example
A Gaussian f(t) := e−t/c2
and an inverse multiquadric(c2 + |t|)−α, α > 0 are completely monotonic and definecorresponding RBF kernels.
-
Covering Numbers
Theorem
For R > 0 denote BR the ball of radius R in a RKHS HK . Then BRis a compact subset of C(X) and thus a valid hypothesis space.
Theorem
If K ∈ Cs then
ln(N (BR, η)) ≤ C · diam(X)n‖K‖n/sCs(R
η
)2n/s.
For specific kernels (such as Gaussian RBF kernels), much betterresults exist.
-
Computational Issues
Question
How can we determine fH,z?
Let HK,z := span{Kx1 , . . . ,Kxm} and P : HK → HK,z theorthogonal projection. Then, since f(xi) = 〈f,Kxi〉= 〈P (f),Kxi〉 = P (f)(xi) we have Ez(f) = Ez(P (f))!!!
Theorem
We have fH,z =∑m
i=1 c∗iKxi , where (c
∗i ) is a minimizer of
1
m
m∑j=1
(m∑i=1
ciK(xi, xj)− yj
)2s.t. cTK[x]c ≤ R2.
This is a convex quadratic program that can be efficiently solvedby interior point methods! Check out http://cvxr.com/cvx/!
http://cvxr.com/cvx/
-
1.2.4 Regularization
-
Lagrange Multipliers
Theorem
Let U be a Hilbert space and F,H : U → R smooth. Let c ∈ U be asolution of the problem
minF (f) s.t. H(f) ≤ 0.
Then there exist µ, λ ∈ R with
µDF (c) + λDH(c) = 0.
If H(c) < 0 then λ = 0 and µ 6= 0 (we would have an unconstrainedstationary point). If DH(c) 6= 0 then we can take µ = 1 and λ ≥ 0.
-
From Constrained to Regularized Least Squares
Let z be such that K[x] is invertible and denote fBR,z as aboveas minimizer of F = Ez s.t. H(f) ≤ 0 withH(f) := ‖f‖2HK −R
2.
Let f∗ be the unconstrained minimizer of Ez in HK such thatR0 := ‖f∗‖HK is minimal.Since K[x] is invertible, one can show that for all R < R0 wehave DF (fBR,z) 6= 0 and DH(fBR,z) 6= 0, and therefore, by theresult on Lagrange multipliers, there exists λ(R) > 0 with
DF (fBR,z) + λ(R)DH(fBR,z) = 0.
Theorem
The function fBR,z is also a solution to the unconstrained regularizedproblem to minimize
Ez(f) + λ(R) · ‖f‖2HK .
-
From Regularized to Constrained Least Squares
Let z be such that K[x] is invertible and denote fλ,z as above asthe minimizer of the convex unconstrained problem to minimizeEz(f) + λ · ‖f‖2HK .Let R(λ) := ‖fλ,z‖HK .
Theorem
The function fλ,z is also a solution to the constrained regularizedproblem to minimize
Ez(f) s.t. ‖f‖2HK ≤ R(λ)2.
Regularized and Constrained Least Squares are equivalent!
-
Numerical Solution of Regularized Least Squares
Theorem
We have fλ,z =∑m
i=1 c∗iKxi , where c
∗ = (c∗i ) is a solution of theregular linear equation
(λmI +K[x]) c = y.
Extremely easy to implement!
-
Regularized Least Squares
Figure: Example for Kernel regression with Gaussian kernel. In clockwiseorder: Correct choice of λ, overfitting (λ too small), underfitting (λ toolarge).
-
It’s really simple
1 f u n c t i o n G = GramAssemble ( x , k e r n e l )23 %Assemble the Gram mat r i x a s s o c i a t e d wi th data p o i n t s x .4 %The pa ramete r s a r e a data v e c t o r x and an i n l i n e f u n c t i o n k e r n e l .5 %Output i s the Gram mat r i x67 n = s i z e ( x , 2 ) ;8 G = z e r o s ( n ) ;9 f o r i = 1 : n
10 f o r j = 1 : n11 G( i , j )=k e r n e l ( x ( : , i ) , x ( : , j ) ) ;12 end13 end
-
It’s really simple
1 f u n c t i o n v=K e r n e l I n t e r p o l ( x , y , t , k e r n e l , gamma)23 %the r e g u l a r i z e d l e a s t s qua r e s problem i s s o l v e d .4 %i n p u t s a r e sample data x , y , an i n l i n e f u n c t i o n k e r n e l which d e s c r i b e s5 %the k e r n e l and a r e g u l a r i z a t i o n paramete r gamma .6 %output i s a v e c t o r v which i s f i v e n by the e s t ima t ed r e g r e s s i o n f un c t i o n ,
e v a l u a t e d7 %on a v e c t o r t o f t e s t p o i n t s .89 n = l e ng t h ( x ) ;
10 %s o l v e r e g u l a r i z e d l e a s t s qua r e s11 G = GramAssemble ( x , k e r n e l ) ;12 A = G + n∗gamma∗eye ( n ) ;13 c = p inv (A)∗y ’ ;1415 %c r e a t e f u n c t i o n v a l u e s o f emp i r c a l t a r g e t f u n c t i o n at g r i d t16 v = z e r o s ( s i z e ( t , 2 ) , 1 ) ;17 f o r i = 1 : n18 v = v+c ( i )∗ k e r n e l ( repmat ( squeeze ( x ( : , i ) ) , 1 , s i z e ( t , 2 ) ) , t ) ;%v l=c ( i )∗K { x i }( t l )19 end
-
1.2.5 A Bayesian Interpretation
-
Bayes’ Theorem
Suppose that the probability of measuring f ∈ HK is equal toP(f) = c · exp(−‖f‖2HK ).Suppose that we observe a function f , corrupted with Gaussiannoise. Then the probability of making the observationsz = ((x1, y1), . . . , (xm, ym)), from the signal f is equal to
P(z|f) = d · exp
(− 1
2σ2
m∑i=1
(f(xi)− yi)2).
Bayes’ Theorem yields that
P(f |z) = P(z|f) · P(f)P(z)
∼ exp(− 12σ2
m∑i=1
(f(xi)−yi)2−‖f‖2HK ).
-
Maximum A Posteriori (MAP) Estimate
MAP Estimate
The MAP Estimate maximizes the a posteriori probability P(f |z),given an a priori distribution on f and on the noise.
For the a priori distribution P(f) = c · exp(−‖f‖2HK ) and Gaussiannoise, the solution of the regularized least squares problem is also theMAP estimate!
-
1.3 Classification
-
1.3.1 Regularized Classification
-
The Classification Problem
We now aim at classifying data into two classes and thus look forf : X → {−1, 1}. Therefore, let’s put Y = {−1, 1} and Z := X × Y .
Misclassification Error
Given a probability measure ρ on Z and f : X → Y , define themisclassification error as
R(f) := Pz∈Z(f(x) 6= y) =∫XP(f(x) 6= y|x)dρX
The classification problem asks to minimize the misclassification error.
-
Bayes Rule
Bayes Rule
Define the Bayes rule as
fc := sgn(fρ).
Theorem
The Bayes rule minimizes the misclassification error
We can re–use everything!
-
Bayes Rule
Figure: Bayes Rule for Gaussian Kernel regression. Left: sample data.Right: Estimate using Bayes Rule.
-
Case Study: Breast Cancer Detection
-
Classification Results based on Kernel Regression
Size of dataset m = 683 and dimensionality of feature space d = 10.
We obtain 95 percent classification accuracy from only 68 trainingsamples and linear kernel (see MATLAB codes on the webpage)!
-
Visualizing Data
Data is very well-clustered...
But how did we obtain this visualization of our 10-dimensionaldataset?
-
1.3.2 (Kernel) Principal ComponentAnalysis (PCA)
-
Dimensionality Reduction
Dimensionality Reduction Problem
Given dataset x = (xi)mi=1 ⊂ Rd with d large.
Goal: Construct map Φ : Rd → Rs with s
-
What is a good projection?
Pick subspace which maximizes variance of the projected dataset.
-
PCA
PCA Problem
Look for s-dimensional affine subspace with associated orthogonalprojection Φ such that the variance
∑mi=1 |Φ(xi −
1m
∑mj=1 xj)|2 is
maximized.
PCA Solution
Suppose w.l.o.g. (why?) that the data is centered, i.e.,1m
∑mj=1 xj = 0.
Define the empirical covariance matrixG = 1m
∑mi=1 xix
Ti ∈ Rd×d.
Then the solution is given by the subspace spanned by the first snormalized Eigenvectors u1, . . . , us of G andΦ(x) =
∑sl=1(x · ul)ul.
-
It’s really simple
1 f u n c t i o n z=pca (X)2 %p r o j e c t data X on i t s3 %p r i n c i p a l components .4 C=cov (X) ;5 [ U, D, pc ] = svd (C) ;6 z = c e n t e r (X) ∗pc ;7 s c a t t e r ( z ( : , 1 ) , z ( : , 2 ) ) ;%p l o t 2d p r o j e c t i o n
-
When PCA fails...
-
When PCA fails...
-
Kernel PCA
Construct nonlinear Φ by applying linear PCA on RKHS HK bymapping data points xi to their feature vectors Kxi!
This is a powerful idea, called kernelization!
Kernel PCA
Define the matrixG = K[x]− 1mK[x]−K[x]1m + 1mK[x]1m ∈ Rm×m and(1m)i,j =
1m for i, j ∈ {1, . . . ,m} and denote u1, . . . , us the first s
normalized (w.r.t. the inner product uTK[x]u) Eigenvectors of G.Then the projection Φ is defined as
Φ(x) =
(m∑i=1
(u1)iK(xi, x), . . . ,
m∑i=1
(us)iK(xi, x)
)T.
-
Example
-
Kernel PCA Denoising
Figure: Top: Linear PCA reconstruction from n principal components.Bottom: Gaussian Kernel Reconstruction from n principal components (findz with ‖Kz − Φ(x)‖HK minimal).
-
Literature: Sebastian Mika, Bernhard Schölkopf, Alex Smola,Klaus-Robert Müller, Matthias Scholz, Gunnar Rätsch. KernelPCA and De-Noising in Feature Space. NIPS (1999).
Other methods: Multidimensional Scaling, Isomap, DiffusionMaps, ...
Try to appreciate the power of kernelization!
Go to https://archive.ics.uci.edu/ml/datasets.htmlfor further datasets and play around with them!
https://archive.ics.uci.edu/ml/datasets.html
-
1.3.3 (Kernel) Support Vector Machine(SVM)
-
Basic Idea
Suppose that data points (xi)mi=1 ⊂ Rn to be classified are
linearly separable, i.e. there exists a separating hyperplanedefined by w ∈ Rn, |w| = 1 and b ∈ R such that
yi = 1⇔ w · xi > b.
Define the margin of a separating hyperplane defined by w, b asabove by
∆(w, b) :=m
mini=1|w · xi − b|.
Try to find separating hyperplane with maximal margin!
-
The SVM
The SVM problem can be formalized by the following minimizationproblem
argminw,b |w| s.t. yi(w · xi − b) ≥ 1 for all i ∈ {1, . . . ,m}.
data is in general not linearly separable...
Soft Margin SVM
Relax to
argminw,b1
m
m∑i=1
Φhl(yi(w · xi − b)) + λ|w|2,
where Φhl(t) := max(0, 1− t), the hinge loss.
This is a convex quadratic program that can be efficientlysolved!
-
K-SVM
K-SVM
Kernelization yields the problem
argminf∈HK ,b1
m
m∑i=1
Φhl(yi(f(xi)− b)) + λ‖f‖2HK
Separable Measures
This provably works for separable measures ρ in the sense that thereis fs ∈ HK with yf(x) > 0 almost surely. It means that data isseparated by the zero level set of fs. Clearly, the Bayes classifier isthat equal to sgn(fs).
For most data points the hinge loss will be zero which impliesthat f will be sparse in {Kxi : i = 1, . . . ,m}, resulting in potentiallybig computational savings!
-
Literature: Felipe Cucker and Ding Xuan Zhou. LearningTheory: An Approximation Theory Viewpoint. Chapter 9.
Main advantage as compared to classification method in Section1.2 is that solution will be sparse.
Experiment and Compare!