deep learning - univie.ac.atgrohs/tmp/dl19_intro.pdf · deep learning philipp grohs march 4st 2019....

Deep Learning

Philipp Grohs

March 4st 2019

Organisatorial Stuff...

3h lecture (MSc Level)

Monday 11:30 - 12:15 (seminar room 11) & Wednesday 11:30 -13:00 (seminar room 7)

Oral exam at the end of the semester

Exercise classes during the semester

Prerequisites: Linear algebra, Analysis, Probability

Desirable: Functional Analysis, Experience with Python(!!!!!) orMATLAB

I will create a website at http://mat.univie.ac.at/~grohs/DeepLearningCourse2019.html.

Face recognition

D. Trump

B. Sanders

B. Johnson

A. Merkel

Face recognition

D. Trump

B. Sanders

B. Johnson

A. Merkel

Describing the content of an image

AI generates sentences describing the content of animage [Vinyals et al., 2015 ]

AI beats Go-champion Lee Sedol [Silver et al., 2016 ]

Atari games

AI beats professional human Atari-players [Mnih et al., 2015 ]

Rating Attractiveness

Swiss Dating App Blinq (developed in cooperation with ETHZ)

Goal of This Course

How does this work???

Syllabus

1. Statistical Learning Theory

2. Classical Models

3. Neural Networks

4. Expressive Power of Neural Networks

5. Breaking the Curse of Diemensionality with Neural Networks

Caution

This is not a pure Mathematics course!

Syllabus

1. Statistical Learning Theory

2. Classical Models

3. Neural Networks

4. Expressive Power of Neural Networks

5. Breaking the Curse of Diemensionality with Neural Networks

Caution

This is not a pure Mathematics course!

Some Literature

I. Goodfellow, Y. Bengio and A. Courville. Deep Learning. (2016).Available from http://www.deeplearningbook.org.

L Devroye, L Gyorfi, G Lugosi. A Probabilistic Theory of PatternRecognition (2013). Springer.

F. Cucker and S. Smale. On the Mathematical Foundations ofLearning. Bulletin of the AMS 39/1, pp 1 – 49 (2001).

F. Chollet. Deep Learning with Python (2017). Manning.

P. Grohs, D. Perekrestenko, D. Elbrachter, H. Bolcskei. Deep NeuralNetwork Approximation Theory. Available fromhttps://arxiv.org/abs/1901.02220.

J. Berner, P. Grohs, A. Jentzen. Analysis of the generalization error:Empirical risk minimization over deep artificial neural networksovercomes the curse of dimensionality in the numerical approximationof Black-Scholes partial differential equations. Available fromhttps://arxiv.org/abs/1809.03062.

0. Prelude

1.1 Basic Concepts

Definition of Learning

Definition [Mitchell (1997)]

“A computer program is said to learn from experience E with respectto some class of tasks T and performance measure P , if itsperformance at tasks in T , as measured by P , improves withexperience E”

...recall that this is not pure mathematics...

Definition of Learning

Definition [Mitchell (1997)]

“A computer program is said to learn from experience E with respectto some class of tasks T and performance measure P , if itsperformance at tasks in T , as measured by P , improves withexperience E”

...recall that this is not pure mathematics...

The Task T

Classification

Compute f : Rn → {1, . . . , k} which maps data x ∈ Rn to a categoryin {1, . . . , k}. Alternative: Compute f : Rn → Rk which maps datax ∈ Rn to a histogram with respect to k categories.

x = 7→ f(x) = 5.

The Task T

Classification

x = 7→ f(x) = 5.

The Task T

Classification

x = 7→ f(x) = 5.

The Task T

Regression

Predict a numerical value f : Rn → R.

Expected claim of insured person

Algorithmic trading

The Task T

Regression

Algorithmic trading

The Task T

Regression

Algorithmic trading

The Task T

Density Estimation

Estimate a probability density p : R→ R+ which can be interpretedas a probability distribution on the space that the examples weredrawn from.

Useful for many tasks in data processing, for example if weobserve corrupted data x we may estimate the original x as theargmax of p(x|x).

The Task T

Density Estimation

Estimate a probability density p : R→ R+ which can be interpretedas a probability distribution on the space that the examples weredrawn from.

Useful for many tasks in data processing, for example if weobserve corrupted data x we may estimate the original x as theargmax of p(x|x).

The Experience E

The experience typically consists of a dataset which consists of manyexamples (aka data points).

If these data points are labeled (for example in the classificationproblem, if we know the classifier of our given data points) wespeak of supervised learning.

If these data points are not labeled (for example in theclassification problem, the algorithm would have to find theclusters itself from the given dataset) we speak of unsupervisedlearning.

The Experience E

The Performance Measure P

In classification problems this is typically the accuracy, i.e., theproportion of examples for which the model produces the correctoutput.

Often the given dataset is split into a training set on which thealgorithm operates and a test set on which its performance ismeasured.

The Performance Measure P

In classification problems this is typically the accuracy, i.e., theproportion of examples for which the model produces the correctoutput.

Often the given dataset is split into a training set on which thealgorithm operates and a test set on which its performance ismeasured.

An Example: Linear Regression

The Task

Regression: Predict f : Rn → R.

The Experience

Training data ((xtraini , ytraini ))mi=1.

The Performance Measure

Given test data ((xtesti , ytesti ))ni=1 we evaluate the performance of anestimator f : Rn → R as the mean squared error

n∑i=1

|f(xtesti )− ytesti |2.

The Task

The Experience

n∑i=1

The Task

The Experience

n∑i=1

The Task

The Experience

n∑i=1

The Computer Program

Define a Hypothesis Space

H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)

and, given training data

z = (xi, yi)mi=1,

and the defining the empirical risk

Ez(f) :=1

m∑i=1

(f(xi)− yi)2,

we let our algorithm find the minimizer (a.k.a. empirical targetfunction)

fH,z := argminf∈H Ez(f).

H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)

z = (xi, yi)mi=1,

Ez(f) :=1

m∑i=1

(f(xi)− yi)2,

H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)

z = (xi, yi)mi=1,

Ez(f) :=1

m∑i=1

(f(xi)− yi)2,

H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)

z = (xi, yi)mi=1,

Ez(f) :=1

m∑i=1

(f(xi)− yi)2,

Computing the Empirical Target Function

LetA = (ϕj(xi))i,j ∈ Rm×l.

Every f ∈ H can be written as∑m

i=1wiϕi and we denotew := (wi)

We let y := (yi)mi=1.

We get thatEz(f) = ‖Aw − y‖2 .

A minimizer is given by w∗ := A†y, and we get our estimate

f∗ :=

l∑i=1

(w∗)iϕi.

f∗ :=

l∑i=1

(w∗)iϕi.

f∗ :=

l∑i=1

(w∗)iϕi.

f∗ :=

l∑i=1

(w∗)iϕi.

f∗ :=

l∑i=1

(w∗)iϕi.

f∗ :=

l∑i=1

(w∗)iϕi.

Degree too low: underfitting. Degree to high: overfitting!

Figure: Error with Polynomial Degree

Bias-Variance Problem

“Capacity” of the hypothesis space has to be adapted to thecomplexity of the target function and the sample size!

Figure: Error with Polynomial Degree

Bias-Variance Problem

“Capacity” of the hypothesis space has to be adapted to thecomplexity of the target function and the sample size!

deep learning - univie.ac.atgrohs/tmp/dl19_intro.pdf · deep learning philipp grohs march 4st 2019....

Documents