1. mathematical foundations of machine learninggrohs/tmp/deeplearninglecture...mathematical...

1. Mathematical Foundations ofMachine Learning

1.1 Basic Concepts

Definition of Learning

Definition [Mitchell (1997)]

“A computer program is said to learn from experience E with respectto some class of tasks T and performance measure P , if itsperformance at tasks in T , as measured by P , improves withexperience E”

...recall that this is not pure mathematics...

The Task T

Classification

Compute f : Rn → {1, . . . , k} which maps data x ∈ Rn to a categoryin {1, . . . , k}. Alternative: Compute f : Rn → Rk which maps datax ∈ Rn to a histogram with respect to k categories.

x = 7→ f(x) = 5.

The Task T

Regression

Predict a numerical value f : Rn → R.

Expected claim of insured person

Algorithmic trading

The Task T

Density Estimation

Estimate a probability density p : R→ R+ which can be interpretedas a probability distribution on the space that the examples weredrawn from.

Useful for many tasks in data processing, for example if weobserve corrupted data x̃ we may estimate the original x as theargmax of p(x̃|x).

The Experience E

The experience typically consists of a dataset which consists of manyexamples (aka data points).

If these data points are labeled (for example in the classificationproblem, if we know the classifier of our given data points) wespeak of supervised learning.

If these data points are not labeled (for example in theclassification problem, the algorithm would have to find theclusters itself from the given dataset) we speak of unsupervisedlearning.

The Performance Measure P

In classification problems this is typically the accuracy, i.e., theproportion of examples for which the model produces the correctoutput.

Often the given dataset is split into a training set on which thealgorithm operates and a test set on which its performance ismeasured.

An Example: Linear Regression

The Task

Regression: Predict f : Rn → R.

The Experience

Training data ((xtraini , ytraini ))

mi=1.

The Performance Measure

Given test data ((xtesti , ytesti ))

ni=1 we evaluate the performance of an

estimator f̂ : Rn → R as the mean squared error

1

n

n∑i=1

|f̂(xtesti )− ytesti |2.


The Computer Program

Define a Hypothesis Space

H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)

and, given training data

z = (xi, yi)mi=1,

and the defining the empirical risk

Ez(f) :=1

m

m∑i=1

(f(xi)− yi)2,

we let our algorithm find the minimizer (a.k.a. empirical targetfunction)

fH,z := argminf∈H Ez(f).


Computing the Empirical Target Function

LetA = (ϕj(xi))i,j ∈ Rm×l.

Every f ∈ H can be written as∑m

i=1wiϕi and we denotew := (wi)

li=1.

We let y := (yi)mi=1.

We get thatEz(f) = ‖Aw − y‖2 .

A minimizer is given by w∗ := A†y, and we get our estimate

f∗ :=

l∑i=1

(w∗)iϕi.

Degree too low: underfitting. Degree to high: overfitting!

Figure: Error with Polynomial Degree

Bias-Variance Problem

“Capacity” of the hypothesis space has to be adapted to thecomplexity of the target function and the sample size!

Until Next Time...

Please brush up on you probability knowledge. Required notions:Probability measure, expectation, conditional measure, marginalmeasure, ...

If you need help, start with Chapter I.3 ofwww.deeplearningbook.org.

You may want to read Chapter I.1 of [Cucker-Smale, On theMathematical Foundations of Learning]

www.deeplearningbook.org

1.2 Mathematical Foundations ofGeneral Regression Problems

1.2.1 Basic Definitions

The Mathematical Learning Problem

Given (compact) X (⊂ Rd) and Y = Rk and a probability measure ρon Z := X × Y .Define the least squares error

E(f) :=∫Z

(f(x)− y)2dρ(x, y).

The learning problem asks for the function f which minimizes E(f).

Example 1: Regression

Suppose our data is generated from noisy observations

y = f(x) + ξ,

where ξ is a r.v. with η as probability density on Y = Rk with∫Y yη(y)dy = 0.

Then the density of ρ is given by pρ(x, y) =1|X|η(y − f(x)).

We have that

E(g) = 1|X|

∫X×Y

(g(x)− y)2η(y − f(x))dydx =

1

|X|

∫X×Y

(g(x)−f(x)−y)2η(y)dydx = 1|X|

∫X×Y

(g(x)−f(x))2η(y)dydx

+1

|X|

∫X×Y

y2η(y)dydx.

The learning problem finds f !

Example 2: Classifications

Suppose that there is a function f which maps a matrixx ∈ [0, 1]256×256 to a histogram f(x) ∈ R10+ . We consider thevector f(x) ·

∑10i=1 f(x)i as a histogram describing which digit

the image x represents.

Let ρ be any measure on Z = X × Y which generates themeasurement data we get to see (ρ will not be known to us!!!)

Now, a function f as above will in general not exist for ourproblem. But we can look for the function fρ which minimizesthe least squares error E – this will be the optimal explanation ofthe measurements in terms of a functional relation between Xand Y !

Regression Function

Conditional Probability

Let pρ be the density of ρ. Then the measure ρ(·|x) on Y has density

1∫Y pρ(x, y)dy

pρ(x, ·).

Regression Function

The regression function is defined as

fρ(x) :=

∫Yydρ(y|x)

Variance

The variance is defined as

σ2ρ := E(fρ).

The Regression Function solves the Learning Problem

Marginal

The marginal of ρ on X is defined as the measure

ρX(A) := ρ(A× Y ).

Proposition

For every f : X → Y it holds that

E(f) =∫X

(f(x)− fρ(x))2dρX(x) + σ2ρ.

Corollary

The regression function solves the learning problem!

Are we done??

We don’t know ρ!!!

The Next Goal

We do have access to a finite number of random samples of ρ.

Goal

Approximate (or “learn”) the regression function fρ from a finitenumber of random samples w.r.t. the measure ρ on Z.

1.2.2 Empirical Minimization,Hypothesis Space, and Bias-VarianceTradeoff

Sampling

Empirical Error

Given z := ((x1, y1), . . . , (xm, ym)) ∈ Zm be i.i.d. w.r.t. ρ samples.Define the empirical error

Ez(f) :=1

m

m∑i=1

(f(xi)− yi)2.

The empirical error can actually be computed!

Defect

The defect of f is defined as

Lz(f) := E(f)− Ez(f).

Can we control the defect? If yes, we actually have some hope ofapproximating the regression function.

Concentration Inequalities

Bernstein Inequality

Suppose that ξ is a r.v. on a probability space (Z, ρ) with meanE(ξ) = µ and variance V(ξ) = σ2. Suppose that |ξ(z)− µ| ≤Mwith probability 1. Then

Pz∈Zm{∣∣∣∣∣ 1m

m∑i=1

ξ(zi)− µ

∣∣∣∣∣ ≥ ε}≤ 2e

− mε2

2(σ2+13Mε) .

Bounding the Defect

Theorem

For f : X → Y define fY : Z → Y via fY (x, y) = f(x)− y and letσ2 = V(f2Y ). Suppose that |f(x)− y| ≤M almost everywhere.Then, for all ε > 0,

Pz∈Zm {|Lz(f)| ≤ ε} ≥ 1− 2e− mε

2

2(σ2+13Mε) .

Are we done??

Any f vanishing on the sample points makes the empirical errorvanish!!!

Hypothesis Space

Definition

Let H be a compact subset of the Banach space{f : X → Y, continuous} with norm ‖f‖ := maxx∈X |f(x)|. We callH hypothesis space or model space.

Target Function in HDefine the target function in H via

fH := argminf∈H E(f).

Empirical Target Function

Given z = ((x1, y1), . . . , (xm, ym)) ∈ Zm random samples, define theempirical target function as

fH,z := argminH Ez(f).

The empirical target function can be computed!

Sample- and Approximation Error

For a hypothesis class H let

fH := argminf∈H E(f) = argminf∈H∫X

(f(x)− fρ(x))dρX(x).

For f ∈ H letEH(f) := E(f)− E(fH) ≥ 0.

The empirical error E(fH,z) decomposes into

E(fH,z) = EH(fH,z) + E(fH).

The first term is called sample error and the second term is calledapproximation error.

Our goal is to make the empirical error as small as possible!Think about what happens if we fix m, the number of samplesand enlarge the hypothesis space H.

Figure: Blue: fH, Red: fH,z, m = 10, H = polynomials of degree 5, 15, 20(from top left to bottom).

If H is too complex, the sampling error increases.

The Bias-Variance Trade-Off

If we keep the sample size m fixed and enlarge the hypothesis spaceH, the approximation error will certainly decrease, BUT the sampleerror will increase – this is exactly what we observed experimentally!

Bishop [Neural Networks for Pattern Recognition (1995)]

“A model which is too simple, or too inflexible, will have alarge bias, while one which has too much flexibility inrelation to the particular data set will have a large variance.Bias and variance are complementary quantities, and thebest generalization is obtained when we have the bestcompromise between the conflicting requirements of smallbias and small variance.”

Bias-Variance Problem

What are the precise relations between the number of samples m andthe “capacity” of our hypothesis space H?

Covering Numbers

Definition

Let S be a metric space and s > 0. Define the covering numberN (S, s) to be the minimal l ∈ N such that there exist l disks in Swith radius s covering S.

Scaling of N (S, s) with s is a measure of complexity of S termedmetric entropy.

Abstract Analysis of Sample Error

Theorem

Let H ⊂ C(X) be a hypothesis class. Assume that for all f ∈ H itholds that |f(x)− y| < M a.e. Then, for all ε > 0,

Pz∈Zm(

supf∈H|Lz(f)| ≤ ε

)≥ 1−N (H, ε

8M)2e− mε

2

4(2σ2+13M2ε) ,

where σ2 := supf∈H σ2(f2Y ).

Proof is left as an exercise!


Lemma

Let ε > 0 and 0 < δ < 1 such that

Pz∈Zm(supf∈H|Lz(f)| ≤ ε) ≥ 1− δ.

ThenPz∈Zm(EH(fH,z) ≤ 2ε) ≥ 1− δ.

Theorem

Let H be a hypothesis class. Assume that for all f ∈ H it holds that|f(x)− y| < M a.e. Then, for all ε > 0,

Pz∈Zm (EH(fH,z) ≤ ε) ≥ 1−N (H,ε

16M)2e− mε

2

8(2σ2+13M2ε) ,

where σ2 := supf∈H σ2(f2Y ).


Question

Given ε, δ > 0, how many samples m do we need such that theprobability that the sample error is ≤ ε is at least 1− δ?

Answer

By the previous theorem it suffices to choose

m ≥8(4σ2 + 13M

2ε)

ε2

(ln(2N (H, ε

16M)) + ln(

1

δ)

).

Question

How to bound the covering number?

Linear Regression

Recall

HR = span {ϕ1, . . . , ϕl}∩{f ∈ C(X) : ‖f‖ ≤ R} ⊂ C(X).

Theorem

Let T := ‖∑l

j=1 |ϕj |‖. Then

ln(N (HR, η)) ≤ l · ln(

4RT

η

).

In the motivational section on linear regression we have seen thatfH,z can be found by solving an l-dimensional linear system.

1.2.3 Reproducing Kernel HilbertSpaces (RKHS)

Reproducing Kernel Hilbert Spaces (RKHS)

Motivation

Suppose we have a ‘similarity measure’ K : X ×X → R on X andwe would like to do things like nearest neighbour search, PCA etc.with respect to this measure of similarity. One idea is to associateeach x ∈ X with a feature map Φx which is an element of ahigh-dimensional inner-product-space, and which ‘linearizes’ thesimilarity measure in the sense that

K(x, x′) = 〈Φx,Φx′〉.

Question

Which conditions on K guarantee the existence of a feature map?

Mercer Theorem

Definition

K : X ×X → R is symmetric if K(x, x′) = K(x′, x) for allx, x′ ∈ X. Let x = {x1, . . . , xk} ⊂ X and K[x] ∈ Rk×k with entriesK(xi, xj) the Gramian of K at x. K is called positive semidefinite ifits Gramian is always positive semidefinite. K is called a Mercerkernel if it is symmetric, positive semidefinite and continuous.

Theorem

There exists a unique Hilbert space HK of functions on X satisfying1 The functions Kx : x

′ 7→ K(x, x′) are in HK ,2 the span of the Kx’s is dense, and

3 for all f ∈ HK we have f(x) = 〈f,Kk〉.

In particular, K(x, x′) = 〈Kx,Kx′〉 and in this sense, the RKHS HKcan be regarded as a feature space.

Examples I

Dot-Product Kernels

Let X be the ball of radius T in Rn and K(x, x′) =∑∞

d=1 ad(x · x′),where ad ≥ 0 for all d and

∑d adT

2d

Examples II

Translation-Invariant Kernels

Suppose that k : Rn → R is such that its Fourier transform isreal-valued and non-negative. Then K(x, x′) := k(x− x′) is a mercerkernel, called a translation-invariant kernel.

Example

Let k = χ[−1,1] ∗ χ[−1,1] the cardinal B-spline of degree one. Then

K(x, x′) =

{1− |x−x

′|2 |x− x

′| ≤ 20 else

.

Examples III

Radial Basis Functions (RBF)

Suppose that f : R+ → R is completely monotonic (i.e.(−1)kf (k) ≥ 0). Then K(x, x′) := f(|x− x′|2) is a mercer kernel,called a RBF kernel.

Example

A Gaussian f(t) := e−t/c2

and an inverse multiquadric(c2 + |t|)−α, α > 0 are completely monotonic and definecorresponding RBF kernels.

Covering Numbers

Theorem

For R > 0 denote BR the ball of radius R in a RKHS HK . Then BRis a compact subset of C(X) and thus a valid hypothesis space.

Theorem

If K ∈ Cs then

ln(N (BR, η)) ≤ C · diam(X)n‖K‖n/sCs(R

η

)2n/s.

For specific kernels (such as Gaussian RBF kernels), much betterresults exist.

Computational Issues

Question

How can we determine fH,z?

Let HK,z := span{Kx1 , . . . ,Kxm} and P : HK → HK,z theorthogonal projection. Then, since f(xi) = 〈f,Kxi〉= 〈P (f),Kxi〉 = P (f)(xi) we have Ez(f) = Ez(P (f))!!!

Theorem

We have fH,z =∑m

i=1 c∗iKxi , where (c

∗i ) is a minimizer of

1

m

m∑j=1

(m∑i=1

ciK(xi, xj)− yj

)2s.t. cTK[x]c ≤ R2.

This is a convex quadratic program that can be efficiently solvedby interior point methods! Check out http://cvxr.com/cvx/!

http://cvxr.com/cvx/

1.2.4 Regularization

Lagrange Multipliers

Theorem

Let U be a Hilbert space and F,H : U → R smooth. Let c ∈ U be asolution of the problem

minF (f) s.t. H(f) ≤ 0.

Then there exist µ, λ ∈ R with

µDF (c) + λDH(c) = 0.

If H(c) < 0 then λ = 0 and µ 6= 0 (we would have an unconstrainedstationary point). If DH(c) 6= 0 then we can take µ = 1 and λ ≥ 0.

From Constrained to Regularized Least Squares

Let z be such that K[x] is invertible and denote fBR,z as aboveas minimizer of F = Ez s.t. H(f) ≤ 0 withH(f) := ‖f‖2HK −R

2.

Let f∗ be the unconstrained minimizer of Ez in HK such thatR0 := ‖f∗‖HK is minimal.Since K[x] is invertible, one can show that for all R < R0 wehave DF (fBR,z) 6= 0 and DH(fBR,z) 6= 0, and therefore, by theresult on Lagrange multipliers, there exists λ(R) > 0 with

DF (fBR,z) + λ(R)DH(fBR,z) = 0.

Theorem

The function fBR,z is also a solution to the unconstrained regularizedproblem to minimize

Ez(f) + λ(R) · ‖f‖2HK .

From Regularized to Constrained Least Squares

Let z be such that K[x] is invertible and denote fλ,z as above asthe minimizer of the convex unconstrained problem to minimizeEz(f) + λ · ‖f‖2HK .Let R(λ) := ‖fλ,z‖HK .

Theorem

The function fλ,z is also a solution to the constrained regularizedproblem to minimize

Ez(f) s.t. ‖f‖2HK ≤ R(λ)2.

Regularized and Constrained Least Squares are equivalent!

Numerical Solution of Regularized Least Squares

Theorem

We have fλ,z =∑m

i=1 c∗iKxi , where c

∗ = (c∗i ) is a solution of theregular linear equation

(λmI +K[x]) c = y.

Extremely easy to implement!

Regularized Least Squares

Figure: Example for Kernel regression with Gaussian kernel. In clockwiseorder: Correct choice of λ, overfitting (λ too small), underfitting (λ toolarge).

It’s really simple

1 f u n c t i o n G = GramAssemble ( x , k e r n e l )23 %Assemble the Gram mat r i x a s s o c i a t e d wi th data p o i n t s x .4 %The pa ramete r s a r e a data v e c t o r x and an i n l i n e f u n c t i o n k e r n e l .5 %Output i s the Gram mat r i x67 n = s i z e ( x , 2 ) ;8 G = z e r o s ( n ) ;9 f o r i = 1 : n

10 f o r j = 1 : n11 G( i , j )=k e r n e l ( x ( : , i ) , x ( : , j ) ) ;12 end13 end


1 f u n c t i o n v=K e r n e l I n t e r p o l ( x , y , t , k e r n e l , gamma)23 %the r e g u l a r i z e d l e a s t s qua r e s problem i s s o l v e d .4 %i n p u t s a r e sample data x , y , an i n l i n e f u n c t i o n k e r n e l which d e s c r i b e s5 %the k e r n e l and a r e g u l a r i z a t i o n paramete r gamma .6 %output i s a v e c t o r v which i s f i v e n by the e s t ima t ed r e g r e s s i o n f un c t i o n ,

e v a l u a t e d7 %on a v e c t o r t o f t e s t p o i n t s .89 n = l e ng t h ( x ) ;

10 %s o l v e r e g u l a r i z e d l e a s t s qua r e s11 G = GramAssemble ( x , k e r n e l ) ;12 A = G + n∗gamma∗eye ( n ) ;13 c = p inv (A)∗y ’ ;1415 %c r e a t e f u n c t i o n v a l u e s o f emp i r c a l t a r g e t f u n c t i o n at g r i d t16 v = z e r o s ( s i z e ( t , 2 ) , 1 ) ;17 f o r i = 1 : n18 v = v+c ( i )∗ k e r n e l ( repmat ( squeeze ( x ( : , i ) ) , 1 , s i z e ( t , 2 ) ) , t ) ;%v l=c ( i )∗K { x i }( t l )19 end

1.2.5 A Bayesian Interpretation

Bayes’ Theorem

Suppose that the probability of measuring f ∈ HK is equal toP(f) = c · exp(−‖f‖2HK ).Suppose that we observe a function f , corrupted with Gaussiannoise. Then the probability of making the observationsz = ((x1, y1), . . . , (xm, ym)), from the signal f is equal to

P(z|f) = d · exp

(− 1

2σ2

m∑i=1

(f(xi)− yi)2).

Bayes’ Theorem yields that

P(f |z) = P(z|f) · P(f)P(z)

∼ exp(− 12σ2

m∑i=1

(f(xi)−yi)2−‖f‖2HK ).

Maximum A Posteriori (MAP) Estimate

MAP Estimate

The MAP Estimate maximizes the a posteriori probability P(f |z),given an a priori distribution on f and on the noise.

For the a priori distribution P(f) = c · exp(−‖f‖2HK ) and Gaussiannoise, the solution of the regularized least squares problem is also theMAP estimate!

1.3 Classification

1.3.1 Regularized Classification

The Classification Problem

We now aim at classifying data into two classes and thus look forf : X → {−1, 1}. Therefore, let’s put Y = {−1, 1} and Z := X × Y .

Misclassification Error

Given a probability measure ρ on Z and f : X → Y , define themisclassification error as

R(f) := Pz∈Z(f(x) 6= y) =∫XP(f(x) 6= y|x)dρX

The classification problem asks to minimize the misclassification error.

Bayes Rule

Bayes Rule

Define the Bayes rule as

fc := sgn(fρ).

Theorem

The Bayes rule minimizes the misclassification error

We can re–use everything!

Bayes Rule

Figure: Bayes Rule for Gaussian Kernel regression. Left: sample data.Right: Estimate using Bayes Rule.

Case Study: Breast Cancer Detection

Classification Results based on Kernel Regression

Size of dataset m = 683 and dimensionality of feature space d = 10.

We obtain 95 percent classification accuracy from only 68 trainingsamples and linear kernel (see MATLAB codes on the webpage)!

Visualizing Data

Data is very well-clustered...

But how did we obtain this visualization of our 10-dimensionaldataset?

1.3.2 (Kernel) Principal ComponentAnalysis (PCA)

Dimensionality Reduction

Dimensionality Reduction Problem

Given dataset x = (xi)mi=1 ⊂ Rd with d large.

Goal: Construct map Φ : Rd → Rs with s

What is a good projection?

Pick subspace which maximizes variance of the projected dataset.

PCA

PCA Problem

Look for s-dimensional affine subspace with associated orthogonalprojection Φ such that the variance

∑mi=1 |Φ(xi −

1m

∑mj=1 xj)|2 is

maximized.

PCA Solution

Suppose w.l.o.g. (why?) that the data is centered, i.e.,1m

∑mj=1 xj = 0.

Define the empirical covariance matrixG = 1m

∑mi=1 xix

Ti ∈ Rd×d.

Then the solution is given by the subspace spanned by the first snormalized Eigenvectors u1, . . . , us of G andΦ(x) =

∑sl=1(x · ul)ul.


1 f u n c t i o n z=pca (X)2 %p r o j e c t data X on i t s3 %p r i n c i p a l components .4 C=cov (X) ;5 [ U, D, pc ] = svd (C) ;6 z = c e n t e r (X) ∗pc ;7 s c a t t e r ( z ( : , 1 ) , z ( : , 2 ) ) ;%p l o t 2d p r o j e c t i o n

When PCA fails...

Kernel PCA

Construct nonlinear Φ by applying linear PCA on RKHS HK bymapping data points xi to their feature vectors Kxi!

This is a powerful idea, called kernelization!

Kernel PCA

Define the matrixG = K[x]− 1mK[x]−K[x]1m + 1mK[x]1m ∈ Rm×m and(1m)i,j =

1m for i, j ∈ {1, . . . ,m} and denote u1, . . . , us the first s

normalized (w.r.t. the inner product uTK[x]u) Eigenvectors of G.Then the projection Φ is defined as

Φ(x) =

(m∑i=1

(u1)iK(xi, x), . . . ,

m∑i=1

(us)iK(xi, x)

)T.

Example

Kernel PCA Denoising

Figure: Top: Linear PCA reconstruction from n principal components.Bottom: Gaussian Kernel Reconstruction from n principal components (findz with ‖Kz − Φ(x)‖HK minimal).

Literature: Sebastian Mika, Bernhard Schölkopf, Alex Smola,Klaus-Robert Müller, Matthias Scholz, Gunnar Rätsch. KernelPCA and De-Noising in Feature Space. NIPS (1999).

Other methods: Multidimensional Scaling, Isomap, DiffusionMaps, ...

Try to appreciate the power of kernelization!

Go to https://archive.ics.uci.edu/ml/datasets.htmlfor further datasets and play around with them!

https://archive.ics.uci.edu/ml/datasets.html

1.3.3 (Kernel) Support Vector Machine(SVM)

Basic Idea

Suppose that data points (xi)mi=1 ⊂ Rn to be classified are

linearly separable, i.e. there exists a separating hyperplanedefined by w ∈ Rn, |w| = 1 and b ∈ R such that

yi = 1⇔ w · xi > b.

Define the margin of a separating hyperplane defined by w, b asabove by

∆(w, b) :=m

mini=1|w · xi − b|.

Try to find separating hyperplane with maximal margin!

The SVM

The SVM problem can be formalized by the following minimizationproblem

argminw,b |w| s.t. yi(w · xi − b) ≥ 1 for all i ∈ {1, . . . ,m}.

data is in general not linearly separable...

Soft Margin SVM

Relax to

argminw,b1

m

m∑i=1

Φhl(yi(w · xi − b)) + λ|w|2,

where Φhl(t) := max(0, 1− t), the hinge loss.

This is a convex quadratic program that can be efficientlysolved!

K-SVM

K-SVM

Kernelization yields the problem

argminf∈HK ,b1

m

m∑i=1

Φhl(yi(f(xi)− b)) + λ‖f‖2HK

Separable Measures

This provably works for separable measures ρ in the sense that thereis fs ∈ HK with yf(x) > 0 almost surely. It means that data isseparated by the zero level set of fs. Clearly, the Bayes classifier isthat equal to sgn(fs).

For most data points the hinge loss will be zero which impliesthat f will be sparse in {Kxi : i = 1, . . . ,m}, resulting in potentiallybig computational savings!

Literature: Felipe Cucker and Ding Xuan Zhou. LearningTheory: An Approximation Theory Viewpoint. Chapter 9.

Main advantage as compared to classification method in Section1.2 is that solution will be sparse.

Experiment and Compare!

1. mathematical foundations of machine learninggrohs/tmp/deeplearninglecture...mathematical...

Documents