deep learning - univie.ac.atgrohs/tmp/dl19_intro.pdf · deep learning philipp grohs march 4st 2019....
TRANSCRIPT
Deep Learning
Philipp Grohs
March 4st 2019
Organisatorial Stuff...
3h lecture (MSc Level)
Monday 11:30 - 12:15 (seminar room 11) & Wednesday 11:30 -13:00 (seminar room 7)
Oral exam at the end of the semester
Exercise classes during the semester
Prerequisites: Linear algebra, Analysis, Probability
Desirable: Functional Analysis, Experience with Python(!!!!!) orMATLAB
I will create a website at http://mat.univie.ac.at/~grohs/DeepLearningCourse2019.html.
Organisatorial Stuff...
3h lecture (MSc Level)
Monday 11:30 - 12:15 (seminar room 11) & Wednesday 11:30 -13:00 (seminar room 7)
Oral exam at the end of the semester
Exercise classes during the semester
Prerequisites: Linear algebra, Analysis, Probability
Desirable: Functional Analysis, Experience with Python(!!!!!) orMATLAB
I will create a website at http://mat.univie.ac.at/~grohs/DeepLearningCourse2019.html.
Organisatorial Stuff...
3h lecture (MSc Level)
Monday 11:30 - 12:15 (seminar room 11) & Wednesday 11:30 -13:00 (seminar room 7)
Oral exam at the end of the semester
Exercise classes during the semester
Prerequisites: Linear algebra, Analysis, Probability
Desirable: Functional Analysis, Experience with Python(!!!!!) orMATLAB
I will create a website at http://mat.univie.ac.at/~grohs/DeepLearningCourse2019.html.
Organisatorial Stuff...
3h lecture (MSc Level)
Monday 11:30 - 12:15 (seminar room 11) & Wednesday 11:30 -13:00 (seminar room 7)
Oral exam at the end of the semester
Exercise classes during the semester
Prerequisites: Linear algebra, Analysis, Probability
Desirable: Functional Analysis, Experience with Python(!!!!!) orMATLAB
I will create a website at http://mat.univie.ac.at/~grohs/DeepLearningCourse2019.html.
Organisatorial Stuff...
3h lecture (MSc Level)
Monday 11:30 - 12:15 (seminar room 11) & Wednesday 11:30 -13:00 (seminar room 7)
Oral exam at the end of the semester
Exercise classes during the semester
Prerequisites: Linear algebra, Analysis, Probability
Desirable: Functional Analysis, Experience with Python(!!!!!) orMATLAB
I will create a website at http://mat.univie.ac.at/~grohs/DeepLearningCourse2019.html.
Organisatorial Stuff...
3h lecture (MSc Level)
Monday 11:30 - 12:15 (seminar room 11) & Wednesday 11:30 -13:00 (seminar room 7)
Oral exam at the end of the semester
Exercise classes during the semester
Prerequisites: Linear algebra, Analysis, Probability
Desirable: Functional Analysis, Experience with Python(!!!!!) orMATLAB
I will create a website at http://mat.univie.ac.at/~grohs/DeepLearningCourse2019.html.
Organisatorial Stuff...
3h lecture (MSc Level)
Monday 11:30 - 12:15 (seminar room 11) & Wednesday 11:30 -13:00 (seminar room 7)
Oral exam at the end of the semester
Exercise classes during the semester
Prerequisites: Linear algebra, Analysis, Probability
Desirable: Functional Analysis, Experience with Python(!!!!!) orMATLAB
I will create a website at http://mat.univie.ac.at/~grohs/DeepLearningCourse2019.html.
Face recognition
Face recognition
D. Trump
B. Sanders
B. Johnson
A. Merkel
Face recognition
D. Trump
B. Sanders
B. Johnson
A. Merkel
Describing the content of an image
AI generates sentences describing the content of animage [Vinyals et al., 2015 ]
Go!
AI beats Go-champion Lee Sedol [Silver et al., 2016 ]
Atari games
AI beats professional human Atari-players [Mnih et al., 2015 ]
Rating Attractiveness
Swiss Dating App Blinq (developed in cooperation with ETHZ)
Goal of This Course
How does this work???
Syllabus
1. Statistical Learning Theory
2. Classical Models
3. Neural Networks
4. Expressive Power of Neural Networks
5. Breaking the Curse of Diemensionality with Neural Networks
Caution
This is not a pure Mathematics course!
Syllabus
1. Statistical Learning Theory
2. Classical Models
3. Neural Networks
4. Expressive Power of Neural Networks
5. Breaking the Curse of Diemensionality with Neural Networks
Caution
This is not a pure Mathematics course!
Some Literature
I. Goodfellow, Y. Bengio and A. Courville. Deep Learning. (2016).Available from http://www.deeplearningbook.org.
L Devroye, L Gyorfi, G Lugosi. A Probabilistic Theory of PatternRecognition (2013). Springer.
F. Cucker and S. Smale. On the Mathematical Foundations ofLearning. Bulletin of the AMS 39/1, pp 1 – 49 (2001).
F. Chollet. Deep Learning with Python (2017). Manning.
P. Grohs, D. Perekrestenko, D. Elbrachter, H. Bolcskei. Deep NeuralNetwork Approximation Theory. Available fromhttps://arxiv.org/abs/1901.02220.
J. Berner, P. Grohs, A. Jentzen. Analysis of the generalization error:Empirical risk minimization over deep artificial neural networksovercomes the curse of dimensionality in the numerical approximationof Black-Scholes partial differential equations. Available fromhttps://arxiv.org/abs/1809.03062.
0. Prelude
1.1 Basic Concepts
Definition of Learning
Definition [Mitchell (1997)]
“A computer program is said to learn from experience E with respectto some class of tasks T and performance measure P , if itsperformance at tasks in T , as measured by P , improves withexperience E”
...recall that this is not pure mathematics...
Definition of Learning
Definition [Mitchell (1997)]
“A computer program is said to learn from experience E with respectto some class of tasks T and performance measure P , if itsperformance at tasks in T , as measured by P , improves withexperience E”
...recall that this is not pure mathematics...
The Task T
Classification
Compute f : Rn → {1, . . . , k} which maps data x ∈ Rn to a categoryin {1, . . . , k}. Alternative: Compute f : Rn → Rk which maps datax ∈ Rn to a histogram with respect to k categories.
x = 7→ f(x) = 5.
The Task T
Classification
Compute f : Rn → {1, . . . , k} which maps data x ∈ Rn to a categoryin {1, . . . , k}. Alternative: Compute f : Rn → Rk which maps datax ∈ Rn to a histogram with respect to k categories.
x = 7→ f(x) = 5.
The Task T
Classification
Compute f : Rn → {1, . . . , k} which maps data x ∈ Rn to a categoryin {1, . . . , k}. Alternative: Compute f : Rn → Rk which maps datax ∈ Rn to a histogram with respect to k categories.
x = 7→ f(x) = 5.
The Task T
Regression
Predict a numerical value f : Rn → R.
Expected claim of insured person
Algorithmic trading
The Task T
Regression
Predict a numerical value f : Rn → R.
Expected claim of insured person
Algorithmic trading
The Task T
Regression
Predict a numerical value f : Rn → R.
Expected claim of insured person
Algorithmic trading
The Task T
Density Estimation
Estimate a probability density p : R→ R+ which can be interpretedas a probability distribution on the space that the examples weredrawn from.
Useful for many tasks in data processing, for example if weobserve corrupted data x we may estimate the original x as theargmax of p(x|x).
The Task T
Density Estimation
Estimate a probability density p : R→ R+ which can be interpretedas a probability distribution on the space that the examples weredrawn from.
Useful for many tasks in data processing, for example if weobserve corrupted data x we may estimate the original x as theargmax of p(x|x).
The Experience E
The experience typically consists of a dataset which consists of manyexamples (aka data points).
If these data points are labeled (for example in the classificationproblem, if we know the classifier of our given data points) wespeak of supervised learning.
If these data points are not labeled (for example in theclassification problem, the algorithm would have to find theclusters itself from the given dataset) we speak of unsupervisedlearning.
The Experience E
The experience typically consists of a dataset which consists of manyexamples (aka data points).
If these data points are labeled (for example in the classificationproblem, if we know the classifier of our given data points) wespeak of supervised learning.
If these data points are not labeled (for example in theclassification problem, the algorithm would have to find theclusters itself from the given dataset) we speak of unsupervisedlearning.
The Experience E
The experience typically consists of a dataset which consists of manyexamples (aka data points).
If these data points are labeled (for example in the classificationproblem, if we know the classifier of our given data points) wespeak of supervised learning.
If these data points are not labeled (for example in theclassification problem, the algorithm would have to find theclusters itself from the given dataset) we speak of unsupervisedlearning.
The Performance Measure P
In classification problems this is typically the accuracy, i.e., theproportion of examples for which the model produces the correctoutput.
Often the given dataset is split into a training set on which thealgorithm operates and a test set on which its performance ismeasured.
The Performance Measure P
In classification problems this is typically the accuracy, i.e., theproportion of examples for which the model produces the correctoutput.
Often the given dataset is split into a training set on which thealgorithm operates and a test set on which its performance ismeasured.
An Example: Linear Regression
The Task
Regression: Predict f : Rn → R.
The Experience
Training data ((xtraini , ytraini ))mi=1.
The Performance Measure
Given test data ((xtesti , ytesti ))ni=1 we evaluate the performance of anestimator f : Rn → R as the mean squared error
1
n
n∑i=1
|f(xtesti )− ytesti |2.
An Example: Linear Regression
The Task
Regression: Predict f : Rn → R.
The Experience
Training data ((xtraini , ytraini ))mi=1.
The Performance Measure
Given test data ((xtesti , ytesti ))ni=1 we evaluate the performance of anestimator f : Rn → R as the mean squared error
1
n
n∑i=1
|f(xtesti )− ytesti |2.
An Example: Linear Regression
The Task
Regression: Predict f : Rn → R.
The Experience
Training data ((xtraini , ytraini ))mi=1.
The Performance Measure
Given test data ((xtesti , ytesti ))ni=1 we evaluate the performance of anestimator f : Rn → R as the mean squared error
1
n
n∑i=1
|f(xtesti )− ytesti |2.
An Example: Linear Regression
The Task
Regression: Predict f : Rn → R.
The Experience
Training data ((xtraini , ytraini ))mi=1.
The Performance Measure
Given test data ((xtesti , ytesti ))ni=1 we evaluate the performance of anestimator f : Rn → R as the mean squared error
1
n
n∑i=1
|f(xtesti )− ytesti |2.
An Example: Linear Regression
The Computer Program
Define a Hypothesis Space
H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)
and, given training data
z = (xi, yi)mi=1,
and the defining the empirical risk
Ez(f) :=1
m
m∑i=1
(f(xi)− yi)2,
we let our algorithm find the minimizer (a.k.a. empirical targetfunction)
fH,z := argminf∈H Ez(f).
An Example: Linear Regression
The Computer Program
Define a Hypothesis Space
H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)
and, given training data
z = (xi, yi)mi=1,
and the defining the empirical risk
Ez(f) :=1
m
m∑i=1
(f(xi)− yi)2,
we let our algorithm find the minimizer (a.k.a. empirical targetfunction)
fH,z := argminf∈H Ez(f).
An Example: Linear Regression
The Computer Program
Define a Hypothesis Space
H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)
and, given training data
z = (xi, yi)mi=1,
and the defining the empirical risk
Ez(f) :=1
m
m∑i=1
(f(xi)− yi)2,
we let our algorithm find the minimizer (a.k.a. empirical targetfunction)
fH,z := argminf∈H Ez(f).
An Example: Linear Regression
The Computer Program
Define a Hypothesis Space
H = span{ϕ1, . . . , ϕl} ⊂ C(Rn)
and, given training data
z = (xi, yi)mi=1,
and the defining the empirical risk
Ez(f) :=1
m
m∑i=1
(f(xi)− yi)2,
we let our algorithm find the minimizer (a.k.a. empirical targetfunction)
fH,z := argminf∈H Ez(f).
An Example: Linear Regression
Computing the Empirical Target Function
LetA = (ϕj(xi))i,j ∈ Rm×l.
Every f ∈ H can be written as∑m
i=1wiϕi and we denotew := (wi)
li=1.
We let y := (yi)mi=1.
We get thatEz(f) = ‖Aw − y‖2 .
A minimizer is given by w∗ := A†y, and we get our estimate
f∗ :=
l∑i=1
(w∗)iϕi.
An Example: Linear Regression
Computing the Empirical Target Function
LetA = (ϕj(xi))i,j ∈ Rm×l.
Every f ∈ H can be written as∑m
i=1wiϕi and we denotew := (wi)
li=1.
We let y := (yi)mi=1.
We get thatEz(f) = ‖Aw − y‖2 .
A minimizer is given by w∗ := A†y, and we get our estimate
f∗ :=
l∑i=1
(w∗)iϕi.
An Example: Linear Regression
Computing the Empirical Target Function
LetA = (ϕj(xi))i,j ∈ Rm×l.
Every f ∈ H can be written as∑m
i=1wiϕi and we denotew := (wi)
li=1.
We let y := (yi)mi=1.
We get thatEz(f) = ‖Aw − y‖2 .
A minimizer is given by w∗ := A†y, and we get our estimate
f∗ :=
l∑i=1
(w∗)iϕi.
An Example: Linear Regression
Computing the Empirical Target Function
LetA = (ϕj(xi))i,j ∈ Rm×l.
Every f ∈ H can be written as∑m
i=1wiϕi and we denotew := (wi)
li=1.
We let y := (yi)mi=1.
We get thatEz(f) = ‖Aw − y‖2 .
A minimizer is given by w∗ := A†y, and we get our estimate
f∗ :=
l∑i=1
(w∗)iϕi.
An Example: Linear Regression
Computing the Empirical Target Function
LetA = (ϕj(xi))i,j ∈ Rm×l.
Every f ∈ H can be written as∑m
i=1wiϕi and we denotew := (wi)
li=1.
We let y := (yi)mi=1.
We get thatEz(f) = ‖Aw − y‖2 .
A minimizer is given by w∗ := A†y, and we get our estimate
f∗ :=
l∑i=1
(w∗)iϕi.
An Example: Linear Regression
Computing the Empirical Target Function
LetA = (ϕj(xi))i,j ∈ Rm×l.
Every f ∈ H can be written as∑m
i=1wiϕi and we denotew := (wi)
li=1.
We let y := (yi)mi=1.
We get thatEz(f) = ‖Aw − y‖2 .
A minimizer is given by w∗ := A†y, and we get our estimate
f∗ :=
l∑i=1
(w∗)iϕi.
Degree too low: underfitting. Degree to high: overfitting!
Degree too low: underfitting. Degree to high: overfitting!
Figure: Error with Polynomial Degree
Bias-Variance Problem
“Capacity” of the hypothesis space has to be adapted to thecomplexity of the target function and the sample size!
Figure: Error with Polynomial Degree
Bias-Variance Problem
“Capacity” of the hypothesis space has to be adapted to thecomplexity of the target function and the sample size!