1. mathematical foundations of machine learninggrohs/tmp/deeplearninglecture...mathematical...

79
1. Mathematical Foundations of Machine Learning

Upload: others

Post on 09-Feb-2021

16 views

Category:

Documents


0 download

TRANSCRIPT

  • 1. Mathematical Foundations ofMachine Learning

  • 1.1 Basic Concepts

  • Definition of Learning

    Definition [Mitchell (1997)]

    “A computer program is said to learn from experience E with respectto some class of tasks T and performance measure P , if itsperformance at tasks in T , as measured by P , improves withexperience E”

    ...recall that this is not pure mathematics...

  • The Task T

    Classification

    Compute f : Rn → {1, . . . , k} which maps data x ∈ Rn to a categoryin {1, . . . , k}. Alternative: Compute f : Rn → Rk which maps datax ∈ Rn to a histogram with respect to k categories.

    x = 7→ f(x) = 5.

  • The Task T

    Regression

    Predict a numerical value f : Rn → R.

    Expected claim of insured person

    Algorithmic trading

  • The Task T

    Density Estimation

    Estimate a probability density p : R→ R+ which can be interpretedas a probability distribution on the space that the examples weredrawn from.

    Useful for many tasks in data processing, for example if weobserve corrupted data x̃ we may estimate the original x as theargmax of p(x̃|x).

  • The Experience E

    The experience typically consists of a dataset which consists of manyexamples (aka data points).

    If these data points are labeled (for example in the classificationproblem, if we know the classifier of our given data points) wespeak of supervised learning.

    If these data points are not labeled (for example in theclassification problem, the algorithm would have to find theclusters itself from the given dataset) we speak of unsupervisedlearning.

  • The Performance Measure P

    In classification problems this is typically the accuracy, i.e., theproportion of examples for which the model produces the correctoutput.

    Often the given dataset is split into a training set on which thealgorithm operates and a test set on which its performance ismeasured.

  • An Example: Linear Regression

    The Task

    Regression: Predict f : Rn → R.

    The Experience

    Training data ((xtraini , ytraini ))

    mi=1.

    The Performance Measure

    Given test data ((xtesti , ytesti ))

    ni=1 we evaluate the performance of an

    estimator f̂ : Rn → R as the mean squared error

    1

    n

    n∑i=1

    |f̂(xtesti )− ytesti |2.

  • An Example: Linear Regression

    The Computer Program

    Define a Hypothesis Space

    H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)

    and, given training data

    z = (xi, yi)mi=1,

    and the defining the empirical risk

    Ez(f) :=1

    m

    m∑i=1

    (f(xi)− yi)2,

    we let our algorithm find the minimizer (a.k.a. empirical targetfunction)

    fH,z := argminf∈H Ez(f).

  • An Example: Linear Regression

    Computing the Empirical Target Function

    LetA = (ϕj(xi))i,j ∈ Rm×l.

    Every f ∈ H can be written as∑m

    i=1wiϕi and we denotew := (wi)

    li=1.

    We let y := (yi)mi=1.

    We get thatEz(f) = ‖Aw − y‖2 .

    A minimizer is given by w∗ := A†y, and we get our estimate

    f∗ :=

    l∑i=1

    (w∗)iϕi.

  • Degree too low: underfitting. Degree to high: overfitting!

  • Figure: Error with Polynomial Degree

    Bias-Variance Problem

    “Capacity” of the hypothesis space has to be adapted to thecomplexity of the target function and the sample size!

  • Until Next Time...

    Please brush up on you probability knowledge. Required notions:Probability measure, expectation, conditional measure, marginalmeasure, ...

    If you need help, start with Chapter I.3 ofwww.deeplearningbook.org.

    You may want to read Chapter I.1 of [Cucker-Smale, On theMathematical Foundations of Learning]

    www.deeplearningbook.org

  • 1.2 Mathematical Foundations ofGeneral Regression Problems

  • 1.2.1 Basic Definitions

  • The Mathematical Learning Problem

    Given (compact) X (⊂ Rd) and Y = Rk and a probability measure ρon Z := X × Y .Define the least squares error

    E(f) :=∫Z

    (f(x)− y)2dρ(x, y).

    The learning problem asks for the function f which minimizes E(f).

  • Example 1: Regression

    Suppose our data is generated from noisy observations

    y = f(x) + ξ,

    where ξ is a r.v. with η as probability density on Y = Rk with∫Y yη(y)dy = 0.

    Then the density of ρ is given by pρ(x, y) =1|X|η(y − f(x)).

    We have that

    E(g) = 1|X|

    ∫X×Y

    (g(x)− y)2η(y − f(x))dydx =

    1

    |X|

    ∫X×Y

    (g(x)−f(x)−y)2η(y)dydx = 1|X|

    ∫X×Y

    (g(x)−f(x))2η(y)dydx

    +1

    |X|

    ∫X×Y

    y2η(y)dydx.

    The learning problem finds f !

  • Example 2: Classifications

    Suppose that there is a function f which maps a matrixx ∈ [0, 1]256×256 to a histogram f(x) ∈ R10+ . We consider thevector f(x) ·

    ∑10i=1 f(x)i as a histogram describing which digit

    the image x represents.

    Let ρ be any measure on Z = X × Y which generates themeasurement data we get to see (ρ will not be known to us!!!)

    Now, a function f as above will in general not exist for ourproblem. But we can look for the function fρ which minimizesthe least squares error E – this will be the optimal explanation ofthe measurements in terms of a functional relation between Xand Y !

  • Regression Function

    Conditional Probability

    Let pρ be the density of ρ. Then the measure ρ(·|x) on Y has density

    1∫Y pρ(x, y)dy

    pρ(x, ·).

    Regression Function

    The regression function is defined as

    fρ(x) :=

    ∫Yydρ(y|x)

    Variance

    The variance is defined as

    σ2ρ := E(fρ).

  • The Regression Function solves the Learning Problem

    Marginal

    The marginal of ρ on X is defined as the measure

    ρX(A) := ρ(A× Y ).

    Proposition

    For every f : X → Y it holds that

    E(f) =∫X

    (f(x)− fρ(x))2dρX(x) + σ2ρ.

    Corollary

    The regression function solves the learning problem!

    Are we done??

    We don’t know ρ!!!

  • The Next Goal

    We do have access to a finite number of random samples of ρ.

    Goal

    Approximate (or “learn”) the regression function fρ from a finitenumber of random samples w.r.t. the measure ρ on Z.

  • 1.2.2 Empirical Minimization,Hypothesis Space, and Bias-VarianceTradeoff

  • Sampling

    Empirical Error

    Given z := ((x1, y1), . . . , (xm, ym)) ∈ Zm be i.i.d. w.r.t. ρ samples.Define the empirical error

    Ez(f) :=1

    m

    m∑i=1

    (f(xi)− yi)2.

    The empirical error can actually be computed!

    Defect

    The defect of f is defined as

    Lz(f) := E(f)− Ez(f).

    Can we control the defect? If yes, we actually have some hope ofapproximating the regression function.

  • Concentration Inequalities

    Bernstein Inequality

    Suppose that ξ is a r.v. on a probability space (Z, ρ) with meanE(ξ) = µ and variance V(ξ) = σ2. Suppose that |ξ(z)− µ| ≤Mwith probability 1. Then

    Pz∈Zm{∣∣∣∣∣ 1m

    m∑i=1

    ξ(zi)− µ

    ∣∣∣∣∣ ≥ ε}≤ 2e

    − mε2

    2(σ2+13Mε) .

  • Bounding the Defect

    Theorem

    For f : X → Y define fY : Z → Y via fY (x, y) = f(x)− y and letσ2 = V(f2Y ). Suppose that |f(x)− y| ≤M almost everywhere.Then, for all ε > 0,

    Pz∈Zm {|Lz(f)| ≤ ε} ≥ 1− 2e− mε

    2

    2(σ2+13Mε) .

    Are we done??

    Any f vanishing on the sample points makes the empirical errorvanish!!!

  • Hypothesis Space

    Definition

    Let H be a compact subset of the Banach space{f : X → Y, continuous} with norm ‖f‖ := maxx∈X |f(x)|. We callH hypothesis space or model space.

    Target Function in HDefine the target function in H via

    fH := argminf∈H E(f).

    Empirical Target Function

    Given z = ((x1, y1), . . . , (xm, ym)) ∈ Zm random samples, define theempirical target function as

    fH,z := argminH Ez(f).

    The empirical target function can be computed!

  • Sample- and Approximation Error

    For a hypothesis class H let

    fH := argminf∈H E(f) = argminf∈H∫X

    (f(x)− fρ(x))dρX(x).

    For f ∈ H letEH(f) := E(f)− E(fH) ≥ 0.

    The empirical error E(fH,z) decomposes into

    E(fH,z) = EH(fH,z) + E(fH).

    The first term is called sample error and the second term is calledapproximation error.

    Our goal is to make the empirical error as small as possible!Think about what happens if we fix m, the number of samplesand enlarge the hypothesis space H.

  • Figure: Blue: fH, Red: fH,z, m = 10, H = polynomials of degree 5, 15, 20(from top left to bottom).

    If H is too complex, the sampling error increases.

  • The Bias-Variance Trade-Off

    If we keep the sample size m fixed and enlarge the hypothesis spaceH, the approximation error will certainly decrease, BUT the sampleerror will increase – this is exactly what we observed experimentally!

    Bishop [Neural Networks for Pattern Recognition (1995)]

    “A model which is too simple, or too inflexible, will have alarge bias, while one which has too much flexibility inrelation to the particular data set will have a large variance.Bias and variance are complementary quantities, and thebest generalization is obtained when we have the bestcompromise between the conflicting requirements of smallbias and small variance.”

    Bias-Variance Problem

    What are the precise relations between the number of samples m andthe “capacity” of our hypothesis space H?

  • Covering Numbers

    Definition

    Let S be a metric space and s > 0. Define the covering numberN (S, s) to be the minimal l ∈ N such that there exist l disks in Swith radius s covering S.

    Scaling of N (S, s) with s is a measure of complexity of S termedmetric entropy.

  • Abstract Analysis of Sample Error

    Theorem

    Let H ⊂ C(X) be a hypothesis class. Assume that for all f ∈ H itholds that |f(x)− y| < M a.e. Then, for all ε > 0,

    Pz∈Zm(

    supf∈H|Lz(f)| ≤ ε

    )≥ 1−N (H, ε

    8M)2e− mε

    2

    4(2σ2+13M2ε) ,

    where σ2 := supf∈H σ2(f2Y ).

    Proof is left as an exercise!

  • Abstract Analysis of Sample Error

    Lemma

    Let ε > 0 and 0 < δ < 1 such that

    Pz∈Zm(supf∈H|Lz(f)| ≤ ε) ≥ 1− δ.

    ThenPz∈Zm(EH(fH,z) ≤ 2ε) ≥ 1− δ.

    Theorem

    Let H be a hypothesis class. Assume that for all f ∈ H it holds that|f(x)− y| < M a.e. Then, for all ε > 0,

    Pz∈Zm (EH(fH,z) ≤ ε) ≥ 1−N (H,ε

    16M)2e− mε

    2

    8(2σ2+13M2ε) ,

    where σ2 := supf∈H σ2(f2Y ).

  • Abstract Analysis of Sample Error

    Question

    Given ε, δ > 0, how many samples m do we need such that theprobability that the sample error is ≤ ε is at least 1− δ?

    Answer

    By the previous theorem it suffices to choose

    m ≥8(4σ2 + 13M

    2ε)

    ε2

    (ln(2N (H, ε

    16M)) + ln(

    1

    δ)

    ).

    Question

    How to bound the covering number?

  • Linear Regression

    Recall

    HR = span {ϕ1, . . . , ϕl}∩{f ∈ C(X) : ‖f‖ ≤ R} ⊂ C(X).

    Theorem

    Let T := ‖∑l

    j=1 |ϕj |‖. Then

    ln(N (HR, η)) ≤ l · ln(

    4RT

    η

    ).

    In the motivational section on linear regression we have seen thatfH,z can be found by solving an l-dimensional linear system.

  • 1.2.3 Reproducing Kernel HilbertSpaces (RKHS)

  • Reproducing Kernel Hilbert Spaces (RKHS)

    Motivation

    Suppose we have a ‘similarity measure’ K : X ×X → R on X andwe would like to do things like nearest neighbour search, PCA etc.with respect to this measure of similarity. One idea is to associateeach x ∈ X with a feature map Φx which is an element of ahigh-dimensional inner-product-space, and which ‘linearizes’ thesimilarity measure in the sense that

    K(x, x′) = 〈Φx,Φx′〉.

    Question

    Which conditions on K guarantee the existence of a feature map?

  • Mercer Theorem

    Definition

    K : X ×X → R is symmetric if K(x, x′) = K(x′, x) for allx, x′ ∈ X. Let x = {x1, . . . , xk} ⊂ X and K[x] ∈ Rk×k with entriesK(xi, xj) the Gramian of K at x. K is called positive semidefinite ifits Gramian is always positive semidefinite. K is called a Mercerkernel if it is symmetric, positive semidefinite and continuous.

    Theorem

    There exists a unique Hilbert space HK of functions on X satisfying1 The functions Kx : x

    ′ 7→ K(x, x′) are in HK ,2 the span of the Kx’s is dense, and

    3 for all f ∈ HK we have f(x) = 〈f,Kk〉.

    In particular, K(x, x′) = 〈Kx,Kx′〉 and in this sense, the RKHS HKcan be regarded as a feature space.

  • Examples I

    Dot-Product Kernels

    Let X be the ball of radius T in Rn and K(x, x′) =∑∞

    d=1 ad(x · x′),where ad ≥ 0 for all d and

    ∑d adT

    2d

  • Examples II

    Translation-Invariant Kernels

    Suppose that k : Rn → R is such that its Fourier transform isreal-valued and non-negative. Then K(x, x′) := k(x− x′) is a mercerkernel, called a translation-invariant kernel.

    Example

    Let k = χ[−1,1] ∗ χ[−1,1] the cardinal B-spline of degree one. Then

    K(x, x′) =

    {1− |x−x

    ′|2 |x− x

    ′| ≤ 20 else

    .

  • Examples III

    Radial Basis Functions (RBF)

    Suppose that f : R+ → R is completely monotonic (i.e.(−1)kf (k) ≥ 0). Then K(x, x′) := f(|x− x′|2) is a mercer kernel,called a RBF kernel.

    Example

    A Gaussian f(t) := e−t/c2

    and an inverse multiquadric(c2 + |t|)−α, α > 0 are completely monotonic and definecorresponding RBF kernels.

  • Covering Numbers

    Theorem

    For R > 0 denote BR the ball of radius R in a RKHS HK . Then BRis a compact subset of C(X) and thus a valid hypothesis space.

    Theorem

    If K ∈ Cs then

    ln(N (BR, η)) ≤ C · diam(X)n‖K‖n/sCs(R

    η

    )2n/s.

    For specific kernels (such as Gaussian RBF kernels), much betterresults exist.

  • Computational Issues

    Question

    How can we determine fH,z?

    Let HK,z := span{Kx1 , . . . ,Kxm} and P : HK → HK,z theorthogonal projection. Then, since f(xi) = 〈f,Kxi〉= 〈P (f),Kxi〉 = P (f)(xi) we have Ez(f) = Ez(P (f))!!!

    Theorem

    We have fH,z =∑m

    i=1 c∗iKxi , where (c

    ∗i ) is a minimizer of

    1

    m

    m∑j=1

    (m∑i=1

    ciK(xi, xj)− yj

    )2s.t. cTK[x]c ≤ R2.

    This is a convex quadratic program that can be efficiently solvedby interior point methods! Check out http://cvxr.com/cvx/!

    http://cvxr.com/cvx/

  • 1.2.4 Regularization

  • Lagrange Multipliers

    Theorem

    Let U be a Hilbert space and F,H : U → R smooth. Let c ∈ U be asolution of the problem

    minF (f) s.t. H(f) ≤ 0.

    Then there exist µ, λ ∈ R with

    µDF (c) + λDH(c) = 0.

    If H(c) < 0 then λ = 0 and µ 6= 0 (we would have an unconstrainedstationary point). If DH(c) 6= 0 then we can take µ = 1 and λ ≥ 0.

  • From Constrained to Regularized Least Squares

    Let z be such that K[x] is invertible and denote fBR,z as aboveas minimizer of F = Ez s.t. H(f) ≤ 0 withH(f) := ‖f‖2HK −R

    2.

    Let f∗ be the unconstrained minimizer of Ez in HK such thatR0 := ‖f∗‖HK is minimal.Since K[x] is invertible, one can show that for all R < R0 wehave DF (fBR,z) 6= 0 and DH(fBR,z) 6= 0, and therefore, by theresult on Lagrange multipliers, there exists λ(R) > 0 with

    DF (fBR,z) + λ(R)DH(fBR,z) = 0.

    Theorem

    The function fBR,z is also a solution to the unconstrained regularizedproblem to minimize

    Ez(f) + λ(R) · ‖f‖2HK .

  • From Regularized to Constrained Least Squares

    Let z be such that K[x] is invertible and denote fλ,z as above asthe minimizer of the convex unconstrained problem to minimizeEz(f) + λ · ‖f‖2HK .Let R(λ) := ‖fλ,z‖HK .

    Theorem

    The function fλ,z is also a solution to the constrained regularizedproblem to minimize

    Ez(f) s.t. ‖f‖2HK ≤ R(λ)2.

    Regularized and Constrained Least Squares are equivalent!

  • Numerical Solution of Regularized Least Squares

    Theorem

    We have fλ,z =∑m

    i=1 c∗iKxi , where c

    ∗ = (c∗i ) is a solution of theregular linear equation

    (λmI +K[x]) c = y.

    Extremely easy to implement!

  • Regularized Least Squares

    Figure: Example for Kernel regression with Gaussian kernel. In clockwiseorder: Correct choice of λ, overfitting (λ too small), underfitting (λ toolarge).

  • It’s really simple

    1 f u n c t i o n G = GramAssemble ( x , k e r n e l )23 %Assemble the Gram mat r i x a s s o c i a t e d wi th data p o i n t s x .4 %The pa ramete r s a r e a data v e c t o r x and an i n l i n e f u n c t i o n k e r n e l .5 %Output i s the Gram mat r i x67 n = s i z e ( x , 2 ) ;8 G = z e r o s ( n ) ;9 f o r i = 1 : n

    10 f o r j = 1 : n11 G( i , j )=k e r n e l ( x ( : , i ) , x ( : , j ) ) ;12 end13 end

  • It’s really simple

    1 f u n c t i o n v=K e r n e l I n t e r p o l ( x , y , t , k e r n e l , gamma)23 %the r e g u l a r i z e d l e a s t s qua r e s problem i s s o l v e d .4 %i n p u t s a r e sample data x , y , an i n l i n e f u n c t i o n k e r n e l which d e s c r i b e s5 %the k e r n e l and a r e g u l a r i z a t i o n paramete r gamma .6 %output i s a v e c t o r v which i s f i v e n by the e s t ima t ed r e g r e s s i o n f un c t i o n ,

    e v a l u a t e d7 %on a v e c t o r t o f t e s t p o i n t s .89 n = l e ng t h ( x ) ;

    10 %s o l v e r e g u l a r i z e d l e a s t s qua r e s11 G = GramAssemble ( x , k e r n e l ) ;12 A = G + n∗gamma∗eye ( n ) ;13 c = p inv (A)∗y ’ ;1415 %c r e a t e f u n c t i o n v a l u e s o f emp i r c a l t a r g e t f u n c t i o n at g r i d t16 v = z e r o s ( s i z e ( t , 2 ) , 1 ) ;17 f o r i = 1 : n18 v = v+c ( i )∗ k e r n e l ( repmat ( squeeze ( x ( : , i ) ) , 1 , s i z e ( t , 2 ) ) , t ) ;%v l=c ( i )∗K { x i }( t l )19 end

  • 1.2.5 A Bayesian Interpretation

  • Bayes’ Theorem

    Suppose that the probability of measuring f ∈ HK is equal toP(f) = c · exp(−‖f‖2HK ).Suppose that we observe a function f , corrupted with Gaussiannoise. Then the probability of making the observationsz = ((x1, y1), . . . , (xm, ym)), from the signal f is equal to

    P(z|f) = d · exp

    (− 1

    2σ2

    m∑i=1

    (f(xi)− yi)2).

    Bayes’ Theorem yields that

    P(f |z) = P(z|f) · P(f)P(z)

    ∼ exp(− 12σ2

    m∑i=1

    (f(xi)−yi)2−‖f‖2HK ).

  • Maximum A Posteriori (MAP) Estimate

    MAP Estimate

    The MAP Estimate maximizes the a posteriori probability P(f |z),given an a priori distribution on f and on the noise.

    For the a priori distribution P(f) = c · exp(−‖f‖2HK ) and Gaussiannoise, the solution of the regularized least squares problem is also theMAP estimate!

  • 1.3 Classification

  • 1.3.1 Regularized Classification

  • The Classification Problem

    We now aim at classifying data into two classes and thus look forf : X → {−1, 1}. Therefore, let’s put Y = {−1, 1} and Z := X × Y .

    Misclassification Error

    Given a probability measure ρ on Z and f : X → Y , define themisclassification error as

    R(f) := Pz∈Z(f(x) 6= y) =∫XP(f(x) 6= y|x)dρX

    The classification problem asks to minimize the misclassification error.

  • Bayes Rule

    Bayes Rule

    Define the Bayes rule as

    fc := sgn(fρ).

    Theorem

    The Bayes rule minimizes the misclassification error

    We can re–use everything!

  • Bayes Rule

    Figure: Bayes Rule for Gaussian Kernel regression. Left: sample data.Right: Estimate using Bayes Rule.

  • Case Study: Breast Cancer Detection

  • Classification Results based on Kernel Regression

    Size of dataset m = 683 and dimensionality of feature space d = 10.

    We obtain 95 percent classification accuracy from only 68 trainingsamples and linear kernel (see MATLAB codes on the webpage)!

  • Visualizing Data

    Data is very well-clustered...

    But how did we obtain this visualization of our 10-dimensionaldataset?

  • 1.3.2 (Kernel) Principal ComponentAnalysis (PCA)

  • Dimensionality Reduction

    Dimensionality Reduction Problem

    Given dataset x = (xi)mi=1 ⊂ Rd with d large.

    Goal: Construct map Φ : Rd → Rs with s

  • What is a good projection?

    Pick subspace which maximizes variance of the projected dataset.

  • PCA

    PCA Problem

    Look for s-dimensional affine subspace with associated orthogonalprojection Φ such that the variance

    ∑mi=1 |Φ(xi −

    1m

    ∑mj=1 xj)|2 is

    maximized.

    PCA Solution

    Suppose w.l.o.g. (why?) that the data is centered, i.e.,1m

    ∑mj=1 xj = 0.

    Define the empirical covariance matrixG = 1m

    ∑mi=1 xix

    Ti ∈ Rd×d.

    Then the solution is given by the subspace spanned by the first snormalized Eigenvectors u1, . . . , us of G andΦ(x) =

    ∑sl=1(x · ul)ul.

  • It’s really simple

    1 f u n c t i o n z=pca (X)2 %p r o j e c t data X on i t s3 %p r i n c i p a l components .4 C=cov (X) ;5 [ U, D, pc ] = svd (C) ;6 z = c e n t e r (X) ∗pc ;7 s c a t t e r ( z ( : , 1 ) , z ( : , 2 ) ) ;%p l o t 2d p r o j e c t i o n

  • When PCA fails...

  • When PCA fails...

  • Kernel PCA

    Construct nonlinear Φ by applying linear PCA on RKHS HK bymapping data points xi to their feature vectors Kxi!

    This is a powerful idea, called kernelization!

    Kernel PCA

    Define the matrixG = K[x]− 1mK[x]−K[x]1m + 1mK[x]1m ∈ Rm×m and(1m)i,j =

    1m for i, j ∈ {1, . . . ,m} and denote u1, . . . , us the first s

    normalized (w.r.t. the inner product uTK[x]u) Eigenvectors of G.Then the projection Φ is defined as

    Φ(x) =

    (m∑i=1

    (u1)iK(xi, x), . . . ,

    m∑i=1

    (us)iK(xi, x)

    )T.

  • Example

  • Kernel PCA Denoising

    Figure: Top: Linear PCA reconstruction from n principal components.Bottom: Gaussian Kernel Reconstruction from n principal components (findz with ‖Kz − Φ(x)‖HK minimal).

  • Literature: Sebastian Mika, Bernhard Schölkopf, Alex Smola,Klaus-Robert Müller, Matthias Scholz, Gunnar Rätsch. KernelPCA and De-Noising in Feature Space. NIPS (1999).

    Other methods: Multidimensional Scaling, Isomap, DiffusionMaps, ...

    Try to appreciate the power of kernelization!

    Go to https://archive.ics.uci.edu/ml/datasets.htmlfor further datasets and play around with them!

    https://archive.ics.uci.edu/ml/datasets.html

  • 1.3.3 (Kernel) Support Vector Machine(SVM)

  • Basic Idea

    Suppose that data points (xi)mi=1 ⊂ Rn to be classified are

    linearly separable, i.e. there exists a separating hyperplanedefined by w ∈ Rn, |w| = 1 and b ∈ R such that

    yi = 1⇔ w · xi > b.

    Define the margin of a separating hyperplane defined by w, b asabove by

    ∆(w, b) :=m

    mini=1|w · xi − b|.

    Try to find separating hyperplane with maximal margin!

  • The SVM

    The SVM problem can be formalized by the following minimizationproblem

    argminw,b |w| s.t. yi(w · xi − b) ≥ 1 for all i ∈ {1, . . . ,m}.

    data is in general not linearly separable...

    Soft Margin SVM

    Relax to

    argminw,b1

    m

    m∑i=1

    Φhl(yi(w · xi − b)) + λ|w|2,

    where Φhl(t) := max(0, 1− t), the hinge loss.

    This is a convex quadratic program that can be efficientlysolved!

  • K-SVM

    K-SVM

    Kernelization yields the problem

    argminf∈HK ,b1

    m

    m∑i=1

    Φhl(yi(f(xi)− b)) + λ‖f‖2HK

    Separable Measures

    This provably works for separable measures ρ in the sense that thereis fs ∈ HK with yf(x) > 0 almost surely. It means that data isseparated by the zero level set of fs. Clearly, the Bayes classifier isthat equal to sgn(fs).

    For most data points the hinge loss will be zero which impliesthat f will be sparse in {Kxi : i = 1, . . . ,m}, resulting in potentiallybig computational savings!

  • Literature: Felipe Cucker and Ding Xuan Zhou. LearningTheory: An Approximation Theory Viewpoint. Chapter 9.

    Main advantage as compared to classification method in Section1.2 is that solution will be sparse.

    Experiment and Compare!