![Page 1: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/1.jpg)
The Mathematics of Deep LearningPart 1: Continuous-time Theory
Helmut Bolcskei
Department of Information Technology and Electrical Engineering
June 2016
joint work with Thomas Wiatowski, Philipp Grohs, and Michael Tschannen
![Page 2: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/2.jpg)
Face recognition
![Page 3: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/3.jpg)
Face recognition
C. E. Shannon
J. von Neumann
F. Hausdorff
N. Wiener
Feature extraction through deep convolutionalneural networks (DCNs)
![Page 4: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/4.jpg)
Face recognition
C. E. Shannon
J. von Neumann
F. Hausdorff
N. Wiener
Feature extraction through deep convolutionalneural networks (DCNs)
![Page 5: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/5.jpg)
Go!
DCNs beat Go-champion Lee Sedol [Silver et al., 2016 ]
![Page 6: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/6.jpg)
Atari games
DCNs beat professional human Atari-players [Mnih et al., 2015 ]
![Page 7: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/7.jpg)
Describing the content of an image
DCNs generate sentences describing the content of animage [Vinyals et al., 2015 ]
“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989.”
![Page 8: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/8.jpg)
Describing the content of an image
DCNs generate sentences describing the content of animage [Vinyals et al., 2015 ]
“Carlos
Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989
.”
![Page 9: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/9.jpg)
Describing the content of an image
DCNs generate sentences describing the content of animage [Vinyals et al., 2015 ]
“Carlos Kleiber
conducting the Vienna Philharmonic’s New Year’s Concert 1989
.”
![Page 10: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/10.jpg)
Describing the content of an image
DCNs generate sentences describing the content of animage [Vinyals et al., 2015 ]
“Carlos Kleiber conducting the
Vienna Philharmonic’s New Year’s Concert 1989
.”
![Page 11: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/11.jpg)
Describing the content of an image
DCNs generate sentences describing the content of animage [Vinyals et al., 2015 ]
“Carlos Kleiber conducting the Vienna Philharmonic’s
New Year’s Concert 1989
.”
![Page 12: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/12.jpg)
Describing the content of an image
DCNs generate sentences describing the content of animage [Vinyals et al., 2015 ]
“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert
1989
.”
![Page 13: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/13.jpg)
Describing the content of an image
DCNs generate sentences describing the content of animage [Vinyals et al., 2015 ]
“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989.”
![Page 14: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/14.jpg)
Feature extraction and learning task
DCNs can be used
i) as stand-alone feature extrac-tors [Huang and LeCun,2006 ]
ii) to perform feature extractionand the learning task directly[LeCun et al., 1990 ]
input: f = ∈ Rn×n
feature extraction
feature vector Φ(f) ∈ RN
learning task
C. E. Shannonoutput:
![Page 15: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/15.jpg)
Feature extraction and learning task
DCNs can be used
i) as stand-alone feature extrac-tors [Huang and LeCun,2006 ]
ii) to perform feature extractionand the learning task directly[LeCun et al., 1990 ]
input: f = ∈ Rn×n
feature extraction
feature vector Φ(f) ∈ RN
learning task
C. E. Shannonoutput:
![Page 16: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/16.jpg)
Why are DCNs so successful?
“It is the guiding principle of many applied mathematicians that ifsomething mathematical works really well, there must be a good un-derlying mathematical reason for it, and we ought to be able tounderstand it.” [I. Daubechies, 2015 ]
![Page 17: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/17.jpg)
Translation invariance
Handwritten digits from the MNIST database [LeCun & Cortes, 1998 ]
Feature vector should be invariant to spatial location⇒ translation invariance
![Page 18: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/18.jpg)
Translation invariance
Handwritten digits from the MNIST database [LeCun & Cortes, 1998 ]
Feature vector should be invariant to spatial location⇒ translation invariance
![Page 19: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/19.jpg)
Deformation insensitivity
Handwritten digits from the MNIST database [LeCun & Cortes, 1998 ]
Different handwriting styles correspond to deformations of signals⇒ deformation insensitivity
![Page 20: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/20.jpg)
Deformation insensitivity
Handwritten digits from the MNIST database [LeCun & Cortes, 1998 ]
Different handwriting styles correspond to deformations of signals⇒ deformation insensitivity
![Page 21: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/21.jpg)
Mallat’s wavelet-modulus DCN
Mallat, 2012, initiated the mathematical analysis of featureextraction through DCNs
f
|f ∗ ψλ(k) |
|f ∗ ψλ(k) | ∗ φJ
||f ∗ ψλ(k) | ∗ ψλ(l) |
||f ∗ ψλ(k) | ∗ ψλ(l) | ∗ φJ
|||f ∗ ψλ(k) | ∗ ψλ(l) | ∗ ψλ(m) |
|f ∗ ψλ(p) |
|f ∗ ψλ(p) | ∗ φJ
||f ∗ ψλ(p) | ∗ ψλ(r) |
||f ∗ ψλ(p) | ∗ ψλ(r) | ∗ φJ
|||f ∗ ψλ(p) | ∗ ψλ(r) | ∗ ψλ(s) |
f ∗ φJ
![Page 22: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/22.jpg)
Mallat’s wavelet-modulus DCN
Features generated in the n-th network layer
ΦnW (f) :=
| · · · | |f ∗ ψλ(1) | ∗ ψλ(2) | · · · ∗ ψλ(n) | ∗ φJ
λ(1),...,λ(n)∈ΛW
f
|f ∗ ψλ(k) |
|f ∗ ψλ(k) | ∗ φJ
||f ∗ ψλ(k) | ∗ ψλ(l) |
||f ∗ ψλ(k) | ∗ ψλ(l) | ∗ φJ
|||f ∗ ψλ(k) | ∗ ψλ(l) | ∗ ψλ(m) |
|f ∗ ψλ(p) |
|f ∗ ψλ(p) | ∗ φJ
||f ∗ ψλ(p) | ∗ ψλ(r) |
||f ∗ ψλ(p) | ∗ ψλ(r) | ∗ φJ
|||f ∗ ψλ(p) | ∗ ψλ(r) | ∗ ψλ(s) |
f ∗ φJ
![Page 23: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/23.jpg)
Mallat’s wavelet-modulus DCN
Directional wavelet system φJ ∪ ψλλ∈ΛW ,
ΛW :=λ = (j, k) | j > −J, k ∈ 1, . . . ,K
‖f ∗φJ‖22+
∑λ∈ΛW
‖f ∗ψλ‖22 = ‖f‖22, ∀f ∈ L2(Rd)
![Page 24: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/24.jpg)
Mallat’s wavelet-modulus DCN
Directional wavelet system φJ ∪ ψλλ∈ΛW ,
ΛW :=λ = (j, k) | j > −J, k ∈ 1, . . . ,K
‖f ∗φJ‖22+
∑λ∈ΛW
‖f ∗ψλ‖22 = ‖f‖22, ∀f ∈ L2(Rd)
...and its edge detection capability [Mallat and Zhong, 1992 ]
|f ∗ ψλ(v) | =
![Page 25: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/25.jpg)
Mallat’s wavelet-modulus DCN
Directional wavelet system φJ ∪ ψλλ∈ΛW ,
ΛW :=λ = (j, k) | j > −J, k ∈ 1, . . . ,K
‖f ∗φJ‖22+
∑λ∈ΛW
‖f ∗ψλ‖22 = ‖f‖22, ∀f ∈ L2(Rd)
...and its edge detection capability [Mallat and Zhong, 1992 ]
|f ∗ ψλ(h) | =
![Page 26: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/26.jpg)
Mallat’s wavelet-modulus DCN
Directional wavelet system φJ ∪ ψλλ∈ΛW ,
ΛW :=λ = (j, k) | j > −J, k ∈ 1, . . . ,K
‖f ∗φJ‖22+
∑λ∈ΛW
‖f ∗ψλ‖22 = ‖f‖22, ∀f ∈ L2(Rd)
...and its edge detection capability [Mallat and Zhong, 1992 ]
|f ∗ ψλ(d) | =
![Page 27: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/27.jpg)
Mallat’s wavelet-modulus DCN
[Mallat, 2012 ] proved that ΦW is “horizontally” translation-invariant
limJ→∞
|||ΦW (Ttf)− ΦW (f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd,
and stable w.r.t. deformations (Fτf)(x) := f(x− τ(x)):
|||ΦW (Fτf)− ΦW (f)||| ≤ C(2−J‖τ‖∞ + J‖Dτ‖∞ + ‖D2τ‖∞
)‖f‖W ,
where ‖ · ‖W is a wavelet-dependent norm.
![Page 28: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/28.jpg)
Mallat’s wavelet-modulus DCN
[Mallat, 2012 ] proved that ΦW is “horizontally” translation-invariant
limJ→∞
|||ΦW (Ttf)− ΦW (f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd,
and stable w.r.t. deformations (Fτf)(x) := f(x− τ(x)):
|||ΦW (Fτf)− ΦW (f)||| ≤ C(2−J‖τ‖∞ + J‖Dτ‖∞ + ‖D2τ‖∞
)‖f‖W ,
where ‖ · ‖W is a wavelet-dependent norm.
Non-linear deformation (Fτf)(x) = f(x− τ(x)):
![Page 29: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/29.jpg)
Mallat’s wavelet-modulus DCN
[Mallat, 2012 ] proved that ΦW is “horizontally” translation-invariant
limJ→∞
|||ΦW (Ttf)− ΦW (f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd,
and stable w.r.t. deformations (Fτf)(x) := f(x− τ(x)):
|||ΦW (Fτf)− ΦW (f)||| ≤ C(2−J‖τ‖∞ + J‖Dτ‖∞ + ‖D2τ‖∞
)‖f‖W ,
where ‖ · ‖W is a wavelet-dependent norm.
Non-linear deformation (Fτf)(x) = f(x− τ(x)):
![Page 30: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/30.jpg)
Generalizations
The basic operations between consecutive layers
f f
ψλ(p) ψλ(q) ψλ(r) gλ(p) gλ(q) gλ(r)
| · | | · | | · | NL NL NL
Pool Pool Pool
General DCNs employ a wide variety of filters gλ
pre-specified and structured (e.g., wavelets [Serre et al., 2005 ])
pre-specified and unstructured (e.g., random filters [Jarrett etal., 2009 ])
learned in a supervised [Huang and LeCun, 2006 ] or anunsupervised [Ranzato et al., 2007 ] fashion
![Page 31: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/31.jpg)
Generalizations
The basic operations between consecutive layers
f f
ψλ(p) ψλ(q) ψλ(r) gλ(p) gλ(q) gλ(r)
| · | | · | | · | NL NL NL
Pool Pool Pool
General DCNs employ a wide variety of non-linearities
modulus [Mutch and Lowe, 2006 ]
hyperbolic tangent [Huang and LeCun, 2006 ]
rectified linear unit [Nair and Hinton, 2010 ]
logistic sigmoid [Glorot and Bengio, 2010 ]
![Page 32: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/32.jpg)
Generalizations
The basic operations between consecutive layers
f f
ψλ(p) ψλ(q) ψλ(r) gλ(p) gλ(q) gλ(r)
| · | | · | | · | NL NL NL
Pool Pool Pool
General DCNs employ intra-layer pooling
sub-sampling [Pinto et al., 2008 ]
average pooling [Jarrett et al., 2009 ]
max-pooling [Ranzato et al., 2007 ]
![Page 33: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/33.jpg)
Generalizations
The basic operations between consecutive layers
f f
ψλ(p) ψλ(q) ψλ(r) gλ(p) gλ(q) gλ(r)
| · | | · | | · | NL NL NL
Pool Pool Pool
General DCNs employ different filters, non-linearities, and poolingoperations in different network layers [LeCun et al., 2015 ]
![Page 34: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/34.jpg)
Generalizations
The basic operations between consecutive layers
f f
ψλ(p) ψλ(q) ψλ(r) gλ(p) gλ(q) gλ(r)
φJ φJ φJ
| · | | · | | · | NL NL NL
Pool Pool Pool
χ χ χ
General DCNs employ various output filters [He et al., 2015 ]
![Page 35: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/35.jpg)
General filters: Semi-discrete frames
Observation: Convolutions yield semi-discrete frame coefficients
(f ∗ gλ)(b) = 〈f, gλ(b− ·)〉 = 〈f, TbIgλ〉, (λ, b) ∈ Λ× Rd
Definition
Let gλλ∈Λ ⊆ L1(Rd) ∩ L2(Rd) be indexed by a countable set Λ.The collection
ΨΛ :=TbIgλ
(λ,b)∈Λ×Rd
is a semi-discrete frame for L2(Rd), if there exist constants A,B > 0such that
A‖f‖22 ≤∑λ∈Λ
∫Rd|〈f, TbIgλ〉|2db =
∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22,
for all f ∈ L2(Rd).
![Page 36: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/36.jpg)
General filters: Semi-discrete frames
Observation: Convolutions yield semi-discrete frame coefficients
(f ∗ gλ)(b) = 〈f, gλ(b− ·)〉 = 〈f, TbIgλ〉, (λ, b) ∈ Λ× Rd
Definition
Let gλλ∈Λ ⊆ L1(Rd) ∩ L2(Rd) be indexed by a countable set Λ.The collection
ΨΛ :=TbIgλ
(λ,b)∈Λ×Rd
is a semi-discrete frame for L2(Rd), if there exist constants A,B > 0such that
A‖f‖22 ≤∑λ∈Λ
∫Rd|〈f, TbIgλ〉|2db =
∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22,
for all f ∈ L2(Rd).
![Page 37: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/37.jpg)
General filters: Semi-discrete frames
A‖f‖22 ≤∑λ∈Λ
∫Rd|〈f, TbIgλ〉|2db =
∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Semi-discrete frames are rooted in continuous frame theory[Antoine et al., 1993 ], [Kaiser, 1994 ]
Sampling the translation parameter b ∈ Rd in (TbIgλ) on Zdleads to shift-invariant frames [Ron and Shen, 1995 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ(ω)|2 ≤ B, a.e. ω ∈ Rd
Structured semi-discrete frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 38: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/38.jpg)
General filters: Semi-discrete frames
A‖f‖22 ≤∑λ∈Λ
∫Rd|〈f, TbIgλ〉|2db =
∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Semi-discrete frames are rooted in continuous frame theory[Antoine et al., 1993 ], [Kaiser, 1994 ]
Sampling the translation parameter b ∈ Rd in (TbIgλ) on Zdleads to shift-invariant frames [Ron and Shen, 1995 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ(ω)|2 ≤ B, a.e. ω ∈ Rd
Structured semi-discrete frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 39: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/39.jpg)
General filters: Semi-discrete frames
A‖f‖22 ≤∑λ∈Λ
∫Rd|〈f, TbIgλ〉|2db =
∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Semi-discrete frames are rooted in continuous frame theory[Antoine et al., 1993 ], [Kaiser, 1994 ]
Sampling the translation parameter b ∈ Rd in (TbIgλ) on Zdleads to shift-invariant frames [Ron and Shen, 1995 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ(ω)|2 ≤ B, a.e. ω ∈ Rd
Structured semi-discrete frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 40: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/40.jpg)
General filters: Semi-discrete frames
A‖f‖22 ≤∑λ∈Λ
∫Rd|〈f, TbIgλ〉|2db =
∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Semi-discrete frames are rooted in continuous frame theory[Antoine et al., 1993 ], [Kaiser, 1994 ]
Sampling the translation parameter b ∈ Rd in (TbIgλ) on Zdleads to shift-invariant frames [Ron and Shen, 1995 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ(ω)|2 ≤ B, a.e. ω ∈ Rd
Structured semi-discrete frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 41: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/41.jpg)
General filters: Semi-discrete frames
A‖f‖22 ≤∑λ∈Λ
∫Rd|〈f, TbIgλ〉|2db =
∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Semi-discrete frames are rooted in continuous frame theory[Antoine et al., 1993 ], [Kaiser, 1994 ]
Sampling the translation parameter b ∈ Rd in (TbIgλ) on Zdleads to shift-invariant frames [Ron and Shen, 1995 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ(ω)|2 ≤ B, a.e. ω ∈ Rd
Structured semi-discrete frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 42: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/42.jpg)
General non-linearities
Observation: Essentially all non-linearities M : L2(Rd)→ L2(Rd)employed in the deep learning literature are
i) pointwise, i.e.,
(Mf)(x) = ρ(f(x)), x ∈ Rd,
for some ρ : C→ C,
ii) Lipschitz-continuous, i.e.,
‖M(f)−M(h)‖ ≤ L‖f − h‖, ∀ f, h,∈ L2(Rd),
for some L > 0,
iii) satisfy M(f) = 0 for f = 0.
![Page 43: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/43.jpg)
Incorporating pooling by sub-sampling
Pooling by sub-sampling can be emulated in continuous-time by the(unitary) dilation operator
f 7→ Rd/2f(R ·), f ∈ L2(Rd),
where R ≥ 1 is the sub-sampling factor.
![Page 44: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/44.jpg)
Different modules in different layers
Module-sequence Ω =((Ψn,Mn, Rn)
)n∈N
i) in the n-th network layer, replace the wavelet-modulusconvolution operation |f ∗ ψλ| by
Un[λn]f := Rd/2n (Mn(f ∗ gλn))(Rn·)
ii) extend the operator Un[λn] to paths on index sets
q = (λ1, λ2, . . . , λn) ∈ Λ1 × Λ2 × · · · × Λn := Λn1 , n ∈ N,
according to
U [q]f :=Un[λn] · · ·U2[λ2]U1[λ1]f
![Page 45: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/45.jpg)
Different modules in different layers
Module-sequence Ω =((Ψn,Mn, Rn)
)n∈N
i) in the n-th network layer, replace the wavelet-modulusconvolution operation |f ∗ ψλ| by
Un[λn]f := Rd/2n (Mn(f ∗ gλn))(Rn·)
ii) extend the operator Un[λn] to paths on index sets
q = (λ1, λ2, . . . , λn) ∈ Λ1 × Λ2 × · · · × Λn := Λn1 , n ∈ N,
according to
U [q]f :=Un[λn] · · ·U2[λ2]U1[λ1]f
![Page 46: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/46.jpg)
Output filters
[Mallat, 2012 ] employed the same low-pass filter φJ in everynetwork layer n to generate the output according to
ΦnW (f) :=
| · · · | |f ∗ψλ(1) |∗ψλ(2) | · · · ∗ψλ(n) |∗φJ
λ(1),...,λ(n)∈ΛW
Here, designate one of the atoms gλnλn∈Λn as the output-generating atom χn−1 := gλ∗n , λ∗n ∈ Λn, of the (n− 1)-th layer.
⇒ The atoms gλnλn∈Λn\λ∗n ∪ χn−1 are used across twoconsecutive layers!
![Page 47: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/47.jpg)
Output filters
[Mallat, 2012 ] employed the same low-pass filter φJ in everynetwork layer n to generate the output according to
ΦnW (f) :=
| · · · | |f ∗ψλ(1) |∗ψλ(2) | · · · ∗ψλ(n) |∗φJ
λ(1),...,λ(n)∈ΛW
Here, designate one of the atoms gλnλn∈Λn as the output-generating atom χn−1 := gλ∗n , λ∗n ∈ Λn, of the (n− 1)-th layer.
⇒ The atoms gλnλn∈Λn\λ∗n ∪ χn−1 are used across twoconsecutive layers!
![Page 48: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/48.jpg)
Generalized feature extractor
Features generated in the n-th network layer
ΦnΩ(f) :=
(U [q]f) ∗ χn
q∈Λn1
U [e]f = f
U[λ
(j)1
]f
(U[λ
(j)1
]f)∗ χ1
U[(λ
(j)1 , λ
(l)2
)]f
(U[(λ
(j)1 , λ
(l)2
)]f)∗ χ2
U[(λ
(j)1 , λ
(l)2 , λ
(m)3
)]f
U[λ
(p)1
]f
(U[λ
(p)1
]f)∗ χ1
U[(λ
(p)1 , λ
(r)2
)]f
(U[(λ
(p)1 , λ
(r)2
)]f)∗ χ2
U[(λ
(p)1 , λ
(r)2 , λ
(s)3
)]f
f ∗ χ0
![Page 49: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/49.jpg)
Generalized feature extractor
Features generated in the n-th network layer
ΦnΩ(f) :=
(U [q]f) ∗ χn
q∈Λn1
Ψ2
U [e]f = f
U[λ
(j)1
]f
(U[λ
(j)1
]f)∗ χ1
U[(λ
(j)1 , λ
(l)2
)]f
(U[(λ
(j)1 , λ
(l)2
)]f)∗ χ2
U[(λ
(j)1 , λ
(l)2 , λ
(m)3
)]f
U[λ
(p)1
]f
(U[λ
(p)1
]f)∗ χ1
U[(λ
(p)1 , λ
(r)2
)]f
(U[(λ
(p)1 , λ
(r)2
)]f)∗ χ2
U[(λ
(p)1 , λ
(r)2 , λ
(s)3
)]f
f ∗ χ0
![Page 50: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/50.jpg)
Vertical translation invariance
Theorem (Wiatowski and HB, 2015)
Assume that Ω =((Ψn,Mn, Rn)
)n∈N satisfies the admissibility
condition Bn ≤ min1, L−2n , for all n ∈ N. If there exists a constant
K > 0 such that
|χn(ω)||ω| ≤ K, a.e. ω ∈ Rd, ∀n ∈ N0,
then
|||ΦnΩ(Ttf)− Φn
Ω(f)||| ≤ 2π|t|KR1 . . . Rn
‖f‖2,
for all f ∈ L2(Rd), t ∈ Rd, n ∈ N.
![Page 51: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/51.jpg)
Vertical translation invariance
The admissibility condition
Bn ≤ min1, L−2n , ∀n ∈ N,
is easily satisfied by normalizing Ψn.
The decay condition
|χn(ω)||ω| ≤ K, a.e. ω ∈ Rd, ∀n ∈ N0,
is satisfied, e.g., if supn∈N0‖χn‖1 + ‖∇χn‖1 <∞.
If, in addition, limn→∞
R1 ·R2 · . . . ·Rn =∞, then
limn→∞
|||ΦnΩ(Ttf)− Φn
Ω(f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd.
![Page 52: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/52.jpg)
Vertical translation invariance
The admissibility condition
Bn ≤ min1, L−2n , ∀n ∈ N,
is easily satisfied by normalizing Ψn.
The decay condition
|χn(ω)||ω| ≤ K, a.e. ω ∈ Rd, ∀n ∈ N0,
is satisfied, e.g., if supn∈N0‖χn‖1 + ‖∇χn‖1 <∞.
If, in addition, limn→∞
R1 ·R2 · . . . ·Rn =∞, then
limn→∞
|||ΦnΩ(Ttf)− Φn
Ω(f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd.
![Page 53: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/53.jpg)
Vertical translation invariance
The admissibility condition
Bn ≤ min1, L−2n , ∀n ∈ N,
is easily satisfied by normalizing Ψn.
The decay condition
|χn(ω)||ω| ≤ K, a.e. ω ∈ Rd, ∀n ∈ N0,
is satisfied, e.g., if supn∈N0‖χn‖1 + ‖∇χn‖1 <∞.
If, in addition, limn→∞
R1 ·R2 · . . . ·Rn =∞, then
limn→∞
|||ΦnΩ(Ttf)− Φn
Ω(f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd.
![Page 54: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/54.jpg)
Philosophy behind invariance results
Mallat’s “horizontal” translation invariance:
limJ→∞
|||ΦW (Ttf)− ΦW (f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd
features become invariant in every network layer, but needsJ →∞applies to wavelet transform and modulus non-linearity withoutpooling
“Vertical” translation invariance:
limn→∞
|||ΦnΩ(Ttf)− Φn
Ω(f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd
features become more invariant with increasing network depth
applies to general filters, general non-linearities, and poolingthrough sub-sampling
![Page 55: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/55.jpg)
Philosophy behind invariance results
Mallat’s “horizontal” translation invariance:
limJ→∞
|||ΦW (Ttf)− ΦW (f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd
features become invariant in every network layer, but needsJ →∞
applies to wavelet transform and modulus non-linearity withoutpooling
“Vertical” translation invariance:
limn→∞
|||ΦnΩ(Ttf)− Φn
Ω(f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd
features become more invariant with increasing network depth
applies to general filters, general non-linearities, and poolingthrough sub-sampling
![Page 56: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/56.jpg)
Philosophy behind invariance results
Mallat’s “horizontal” translation invariance:
limJ→∞
|||ΦW (Ttf)− ΦW (f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd
features become invariant in every network layer, but needsJ →∞applies to wavelet transform and modulus non-linearity withoutpooling
“Vertical” translation invariance:
limn→∞
|||ΦnΩ(Ttf)− Φn
Ω(f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd
features become more invariant with increasing network depth
applies to general filters, general non-linearities, and poolingthrough sub-sampling
![Page 57: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/57.jpg)
Deformation sensitivity bounds
[Mallat, 2012 ] proved that ΦW is stable w.r.t. non-linear deforma-tions (Fτf)(x) = f(x− τ(x)) according to
|||ΦW (Fτf)−ΦW (f)||| ≤ C(2−J‖τ‖∞+J‖Dτ‖∞+ ‖D2τ‖∞
)‖f‖W ,
where HW := f ∈ L2(Rd) | ‖f‖W <∞ with
‖f‖W :=
∞∑n=0
( ∑q∈(ΛW )n1
‖U [q]‖22)1/2
![Page 58: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/58.jpg)
Deformation sensitivity for signal classes
Consider (Fτf)(x) = f(x− τ(x)) = f(x− e−x2)
x
f1(x), (Fτf1)(x)
x
f2(x), (Fτf2)(x)
For given τ the amount of deformation inducedcan depend drastically on f ∈ L2(Rd)
![Page 59: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/59.jpg)
Deformation sensitivity bounds: Band-limited signals
Theorem (Wiatowski and HB, 2015)
Assume that Ω =((Ψn,Mn, Rn)
)n∈N satisfies the admissibility
condition Bn ≤ min1, L−2n , for all n ∈ N. There exists a constant
C > 0 (that does not depend on Ω) such that for all
f ∈ f ∈ L2(Rd) | supp(f) ⊆ BR(0)
and all τ ∈ C1(Rd,Rd) with ‖Dτ‖∞ ≤ 12d , it holds that
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CR‖τ‖∞‖f‖2.
![Page 60: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/60.jpg)
Deformation sensitivity bounds: Cartoon functions
... and what about non-band-limited signals?
Image credit: middle [Mnih et al., 2015 ], right [Silver et al., 2016 ]
![Page 61: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/61.jpg)
Deformation sensitivity bounds: Cartoon functions
... and what about non-band-limited signals?
Image credit: middle [Mnih et al., 2015 ], right [Silver et al., 2016 ]
Take into account structural properties of natural images.
⇒ consider cartoon functions [Donoho, 2001 ]
![Page 62: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/62.jpg)
Deformation sensitivity bounds: Cartoon functions
... and what about non-band-limited signals?
Image credit: middle [Mnih et al., 2015 ], right [Silver et al., 2016 ]
The class of cartoon functions of maximal size K > 0:
CKCART := f1 + 1Bf2 | fi ∈ L2(Rd) ∩ C1(Rd,C), i = 1, 2,
|∇fi(x)| ≤ K(1 + |x|2)−d/2, vold−1(∂B) ≤ K, ‖f2‖∞ ≤ K
![Page 63: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/63.jpg)
Deformation sensitivity bounds: Cartoon functions
Theorem (Grohs et al., 2016)
Assume that Ω =((Ψn,Mn, Rn)
)n∈N satisfies the admissibility
condition Bn ≤ min1, L−2n , for all n ∈ N. For every K > 0 there
exists a constant CK > 0 (that does not depend on Ω) such that forall f ∈ CKCART and all τ ∈ C1(Rd,Rd) with ‖τ‖∞ < 1
2 and‖Dτ‖∞ ≤ 1
2d , it holds that
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CK‖τ‖1/2∞ .
![Page 64: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/64.jpg)
Deformation sensitivity bounds: Lipschitz functions
Cartoon functions reduce to Lipschitz functionsupon setting f2 = 0 in f1 + 1Bf2 ∈ CKCART
Corollary (Grohs et al., 2016)
Assume that Ω =((Ψn,Mn, Rn)
)n∈N satisfies the admissibility
condition Bn ≤ min1, L−2n , for all n ∈ N. For every K > 0 there
exists a constant CK > 0 (that does not depend on Ω) such that forall
f ∈f ∈ L2(Rd) | f Lipschitz-continuous, |∇fi(x)| ≤ K(1+|x|2)−d/2
and all τ ∈ C1(Rd,Rd) with ‖τ‖∞ < 1
2 and ‖Dτ‖∞ ≤ 12d , it holds
that|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CK‖τ‖∞.
![Page 65: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/65.jpg)
... and what about textures?
neither band-limited, nor a cartoon function,nor Lipschitz-continuous
![Page 66: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/66.jpg)
... and what about textures?
neither band-limited, nor a cartoon function,nor Lipschitz-continuous
![Page 67: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/67.jpg)
Philosophy behind deformation stability/sensitivity bounds
Mallat’s deformation stability bound:
|||ΦW (Fτf)−ΦW (f)||| ≤ C(2−J‖τ‖∞+J‖Dτ‖∞+‖D2τ‖∞
)‖f‖W ,
for all f ∈ HW ⊆ L2(Rd)
The signal class HW and the corresponding norm ‖ · ‖W dependon the mother wavelet (and hence the network)
Our deformation sensitivity bound:
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
The signal class C (band-limited functions or cartoon functions)is independent of the network
![Page 68: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/68.jpg)
Philosophy behind deformation stability/sensitivity bounds
Mallat’s deformation stability bound:
|||ΦW (Fτf)−ΦW (f)||| ≤ C(2−J‖τ‖∞+J‖Dτ‖∞+‖D2τ‖∞
)‖f‖W ,
for all f ∈ HW ⊆ L2(Rd)
Signal class description complexity implicit via norm ‖ · ‖W
Our deformation sensitivity bound:
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
Signal class description complexity explicit via CCR-band-limited functions: CC = O(R)cartoon functions of maximal size K: CC = O(K3/2)K-Lipschitz functions CC = O(K)
![Page 69: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/69.jpg)
Philosophy behind deformation stability/sensitivity bounds
Mallat’s deformation stability bound:
|||ΦW (Fτf)−ΦW (f)||| ≤ C(2−J‖τ‖∞+J‖Dτ‖∞+‖D2τ‖∞
)‖f‖W ,
for all f ∈ HW ⊆ L2(Rd)
Our deformation sensitivity bound:
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
Decay rate α > 0 of the deformation error is signal-class-specific (band-limited functions: α = 1, cartoon functions:α = 1
2 , Lipschitz functions: α = 1)
![Page 70: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/70.jpg)
Philosophy behind deformation stability/sensitivity bounds
Mallat’s deformation stability bound:
|||ΦW (Fτf)−ΦW (f)||| ≤ C(2−J‖τ‖∞+J‖Dτ‖∞+‖D2τ‖∞
)‖f‖W ,
for all f ∈ HW ⊆ L2(Rd)
The bound depends explicitly on higher order derivatives of τ
Our deformation sensitivity bound:
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
The bound implicitly depends on derivatives of τ via thecondition ‖Dτ‖∞ ≤ 1
2d
![Page 71: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/71.jpg)
Philosophy behind deformation stability/sensitivity bounds
Mallat’s deformation stability bound:
|||ΦW (Fτf)−ΦW (f)||| ≤ C(2−J‖τ‖∞+J‖Dτ‖∞+‖D2τ‖∞
)‖f‖W ,
for all f ∈ HW ⊆ L2(Rd)
The bound is coupled to horizontal translation invariance
limJ→∞
|||ΦW (Ttf)− ΦW (f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd
Our deformation sensitivity bound:
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
The bound is decoupled from vertical translation invariance
limn→∞
|||ΦnΩ(Ttf)− Φn
Ω(f)||| = 0, ∀f ∈ L2(Rd), ∀t ∈ Rd
![Page 72: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/72.jpg)
Proof sketch: Decoupling
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
1) Lipschitz continuity:
|||ΦΩ(f)− ΦΩ(h)||| ≤ ‖f − h‖2, ∀f, h ∈ L2(Rd),
established through (i) frame property of Ψn, (ii) Lipschitzcontinuity of non-linearities, and (iii) admissibility conditionBn ≤ min1, L−2
n
2) Signal-class-specific deformation sensitivity bound:
‖Fτf − f‖2 ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
3) Combine 1) and 2) to get
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ ‖Fτf − f‖2 ≤ CC‖τ‖α∞,
for all f ∈ C ⊆ L2(Rd)
![Page 73: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/73.jpg)
Proof sketch: Decoupling
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
1) Lipschitz continuity:
|||ΦΩ(f)− ΦΩ(h)||| ≤ ‖f − h‖2, ∀f, h ∈ L2(Rd),
established through (i) frame property of Ψn, (ii) Lipschitzcontinuity of non-linearities, and (iii) admissibility conditionBn ≤ min1, L−2
n 2) Signal-class-specific deformation sensitivity bound:
‖Fτf − f‖2 ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
3) Combine 1) and 2) to get
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ ‖Fτf − f‖2 ≤ CC‖τ‖α∞,
for all f ∈ C ⊆ L2(Rd)
![Page 74: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/74.jpg)
Proof sketch: Decoupling
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
1) Lipschitz continuity:
|||ΦΩ(f)− ΦΩ(h)||| ≤ ‖f − h‖2, ∀f, h ∈ L2(Rd),
established through (i) frame property of Ψn, (ii) Lipschitzcontinuity of non-linearities, and (iii) admissibility conditionBn ≤ min1, L−2
n 2) Signal-class-specific deformation sensitivity bound:
‖Fτf − f‖2 ≤ CC‖τ‖α∞, ∀f ∈ C ⊆ L2(Rd)
3) Combine 1) and 2) to get
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ ‖Fτf − f‖2 ≤ CC‖τ‖α∞,
for all f ∈ C ⊆ L2(Rd)
![Page 75: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/75.jpg)
Noise robustness
Lipschitz continuity of ΦΩ according to
|||ΦΩ(f)− ΦΩ(h)||| ≤ ‖f − h‖2, ∀f, h ∈ L2(Rd),
also implies robustness w.r.t. additive noise η ∈ L2(Rd) according to
|||ΦΩ(f + η)− ΦΩ(f)||| ≤ ‖η‖2
![Page 76: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/76.jpg)
Energy conservation
It is desirable to have
f 6= 0 ⇒ Φ(f) 6= 0,
or even better
|||Φ(f)||| ≥ AΦ‖f‖2, ∀f ∈ L2(Rd),
for some AΦ > 0.
[Waldspurger, 2015 ] proved—under analyticity assumptions on themother wavelet—that for real-valued signals f ∈ L2(Rd), ΦW
conserves energy according to
|||ΦW (f)||| = ‖f‖2
![Page 77: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/77.jpg)
Energy conservation
It is desirable to have
f 6= 0 ⇒ Φ(f) 6= 0,
or even better
|||Φ(f)||| ≥ AΦ‖f‖2, ∀f ∈ L2(Rd),
for some AΦ > 0.
[Waldspurger, 2015 ] proved—under analyticity assumptions on themother wavelet—that for real-valued signals f ∈ L2(Rd), ΦW
conserves energy according to
|||ΦW (f)||| = ‖f‖2
![Page 78: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/78.jpg)
Energy conservation
Theorem (Grohs et al., 2016)
Let Ω =((Ψn, | · |, 1)
)n∈N be a module-sequence employing modulus
non-linearities and no sub-sampling. For every n ∈ N, let the atomsof Ψn satisfy the following conditions:
i)∑
λn∈Λn\λ∗n |gλn(ω)|2 + |χn−1(ω)|2 = 1, a.e. ω ∈ Rd
ii)∑
λn∈Λn\λ∗n |gλn(ω)|2 = 0, a.e. ω ∈ Bδn(0), for some δn > 0
iii) all atoms gλn are analytic.
Then,|||ΦΩ(f)||| = ‖f‖2, ∀f ∈ L2(Rd)
Various structured frames satisfy conditions i)-iii)
![Page 79: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/79.jpg)
Energy conservation
Theorem (Grohs et al., 2016)
Let Ω =((Ψn, | · |, 1)
)n∈N be a module-sequence employing modulus
non-linearities and no sub-sampling. For every n ∈ N, let the atomsof Ψn satisfy the following conditions:
i)∑
λn∈Λn\λ∗n |gλn(ω)|2 + |χn−1(ω)|2 = 1, a.e. ω ∈ Rd
ii)∑
λn∈Λn\λ∗n |gλn(ω)|2 = 0, a.e. ω ∈ Bδn(0), for some δn > 0
iii) all atoms gλn are analytic.
Then,|||ΦΩ(f)||| = ‖f‖2, ∀f ∈ L2(Rd)
Various structured frames satisfy conditions i)-iii)
![Page 80: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/80.jpg)
Proof sketch: Energy conservationor “What does the modulus non-linearity do?”
ω
f(ω)
· · · · · ·gλ χ
ω
(f · gλ)(ω)
![Page 81: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/81.jpg)
Proof sketch: Energy conservationor “What does the modulus non-linearity do?”
ω
f(ω)
· · · · · ·gλ χ
ω
(f · gλ)(ω)
![Page 82: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/82.jpg)
Proof sketch: Energy conservationor “What does the modulus non-linearity do?”
ω
f(ω)
· · · · · ·gλ χ
ω
(f · gλ)(ω)
ω
Rf ·gλ(ω)
|(f ∗ gλ)(x)|2 Rf ·gλ
(ω)
|f ∗ gλ(x)|2 Rf ·gλ
(ω)
![Page 83: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/83.jpg)
Proof sketch: Energy conservationor “What does the modulus non-linearity do?”
ω
f(ω)
· · · · · ·gλ χ
ω
(f · gλ)(ω)
ω
Rf ·gλ(ω)
χ
|f ∗ gλ|2 ∗ χ =
|f ∗ gλ|2 ∗ χ =
![Page 84: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/84.jpg)
Two Meta–Theorems
Meta–Theorem
Vertical translation invariance and Lipschitz continuity (hence bydecoupling also deformation insensitivity) are guaranteed by thenetwork structure per se rather than the specific convolution kernels,non-linearities, and pooling operations.
Meta–Theorem
For networks employing the modulus non-linearity and no intra-layerpooling, energy conservation is guaranteed for quite generalconvolution kernels.
![Page 85: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/85.jpg)
Two Meta–Theorems
Meta–Theorem
Vertical translation invariance and Lipschitz continuity (hence bydecoupling also deformation insensitivity) are guaranteed by thenetwork structure per se rather than the specific convolution kernels,non-linearities, and pooling operations.
Meta–Theorem
For networks employing the modulus non-linearity and no intra-layerpooling, energy conservation is guaranteed for quite generalconvolution kernels.
![Page 86: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/86.jpg)
Deep Frame Net
Open source software:
MATLAB: http://www.nari.ee.ethz.ch/commth/research
Python: Coming soon!
![Page 87: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/87.jpg)
The Mathematics of Deep LearningPart 2: Discrete-time Theory
Helmut Bolcskei
Department of Information Technology and Electrical Engineering
June 2016
joint work with Thomas Wiatowski, Michael Tschannen, and Philipp Grohs
![Page 88: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/88.jpg)
Continuous-time theory
[Mallat, 2012 ] and [Wiatowski and HB, 2015 ] developed acontinuous-time theory for feature extraction through DCNs:
translation invariance results for L2(Rd)-functions
deformation sensitivity bounds for signal classes C ⊆ L2(Rd)
energy conservation for L2(Rd)-functions
![Page 89: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/89.jpg)
Practice is digital
In practice ... we need to handle discrete data!
f = ∈ Rn×n
![Page 90: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/90.jpg)
Practice is digital
In practice ... a wide variety of network architectures is used!
class probabilities [LeCun et al., 1990 ]
f
|f ∗ g1|
||f ∗ g1| ∗ h1|
|f ∗ gn|
||f ∗ gn| ∗ hm|
![Page 91: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/91.jpg)
Practice is digital
In practice ... a wide variety of network architectures is used!
f
|f ∗ g1|
|f ∗ g1|
||f ∗ g1| ∗ h1|
||f ∗ g1| ∗ h1| ∗ µ
|||f ∗ g1| ∗ h1| ∗ k1|
|f ∗ gm|
|f ∗ gm|
||f ∗ gm| ∗ hn|
||f ∗ gm| ∗ hn| ∗ µ
|||f ∗ gm| ∗ hn| ∗ kp|
f ∗ ϕlow
![Page 92: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/92.jpg)
Architecture of general DCNs
The basic operations between consecutive layers
f
g1 g2 gn· · ·
NL NL NL
Pool Pool Pool
DCNs employ a wide variety of filters gk
pre-specified and structured (e.g., wavelets [Serre et al., 2005 ])
pre-specified and unstructured (e.g., random filters [Jarrett etal., 2009 ])
learned in a supervised [Huang and LeCun, 2006 ] or anunsupervised [Ranzato et al., 2007 ] fashion
![Page 93: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/93.jpg)
Architecture of general DCNs
The basic operations between consecutive layers
f
g1 g2 gn· · ·
NL NL NL
Pool Pool Pool
DCNs employ a wide variety of non-linearities
modulus [Mutch and Lowe, 2006 ]
hyperbolic tangent [Huang and LeCun, 2006 ]
rectified linear unit [Nair and Hinton, 2010 ]
logistic sigmoid [Glorot and Bengio, 2010 ]
![Page 94: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/94.jpg)
Architecture of general DCNs
The basic operations between consecutive layers
f
g1 g2 gn· · ·
NL NL NL
Pool Pool Pool
DCNs employ pooling
sub-sampling [Pinto et al., 2008 ]
average pooling [Jarrett et al., 2009 ]
max-pooling [Ranzato et al., 2007 ]
![Page 95: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/95.jpg)
Architecture of general DCNs
The basic operations between consecutive layers
f
g1 g2 gn· · ·
NL NL NL
Pool Pool Pool
DCNs employ different filters, non-linearities, and pooling operationsin different network layers [LeCun et al., 2015 ]
![Page 96: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/96.jpg)
Architecture of general DCNs
f
g1 g2 gn· · ·
NL NL NL
Pool Pool Pool
χ χ χ
Which layers contribute to the network’s output?
the last layer only (e.g., class probabilities [LeCun et al., 1990 ])
subset of layers (e.g., shortcut connections [He et al., 2015 ])
all layers (e.g., low-pass filtering [Bruna and Mallat, 2013 ])
![Page 97: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/97.jpg)
Challenges
Challenges for discrete theory:
flexible architectures
signals of varying dimensions are propagated through the network
how to incorporate general pooling operators into the theory?
can not rely on asymptotics (finite network depth) to provenetwork properties (e.g., translation invariance)
nature is analog
what are appropriate signal classes to be considered?
![Page 98: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/98.jpg)
Challenges
Challenges for discrete theory:
flexible architectures
signals of varying dimensions are propagated through the network
how to incorporate general pooling operators into the theory?
can not rely on asymptotics (finite network depth) to provenetwork properties (e.g., translation invariance)
nature is analog
what are appropriate signal classes to be considered?
![Page 99: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/99.jpg)
Challenges
Challenges for discrete theory:
flexible architectures
signals of varying dimensions are propagated through the network
how to incorporate general pooling operators into the theory?
can not rely on asymptotics (finite network depth) to provenetwork properties (e.g., translation invariance)
nature is analog
what are appropriate signal classes to be considered?
![Page 100: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/100.jpg)
Challenges
Challenges for discrete theory:
flexible architectures
signals of varying dimensions are propagated through the network
how to incorporate general pooling operators into the theory?
can not rely on asymptotics (finite network depth) to provenetwork properties (e.g., translation invariance)
nature is analog
what are appropriate signal classes to be considered?
![Page 101: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/101.jpg)
Challenges
Challenges for discrete theory:
flexible architectures
signals of varying dimensions are propagated through the network
how to incorporate general pooling operators into the theory?
can not rely on asymptotics (finite network depth) to provenetwork properties (e.g., translation invariance)
nature is analog
what are appropriate signal classes to be considered?
![Page 102: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/102.jpg)
Challenges
Challenges for discrete theory:
flexible architectures
signals of varying dimensions are propagated through the network
how to incorporate general pooling operators into the theory?
can not rely on asymptotics (finite network depth) to provenetwork properties (e.g., translation invariance)
nature is analog
what are appropriate signal classes to be considered?
![Page 103: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/103.jpg)
Definitions
Signal space
HN := f : Z→ C | f [n] = f [n+N ], ∀n ∈ Z
p-Norm
‖f‖p :=( ∑n∈IN
|f [n]|p)1/p
, IN := 0, . . . , N − 1
Circular convolution
(f ∗ g)[n] :=∑k∈IN
f [k]g[n− k], f, g ∈ HN
Discrete Fourier transform
f [k] :=∑n∈IN
f [n]e−2πikn/N , f ∈ HN
![Page 104: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/104.jpg)
Filters: Shift-invariant frames for HN
Observation: Convolutions yield shift-invariant frame coefficients
(f ∗ gλ)[n] = 〈f, gλ(n− ·)〉 = 〈f, TnIgλ〉, (λ, n) ∈ Λ× IN
Definition
Let gλλ∈Λ ⊆ HN be indexed by a finite set Λ. The collection
ΨΛ :=TnIgλ
(λ,n)∈Λ×IN
is a shift-invariant frame for HN , if there exist constants A,B > 0such that
A‖f‖22 ≤∑λ∈Λ
∑n∈IN
|〈f, TnIgλ〉|2 =∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22,
for all f ∈ HN
![Page 105: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/105.jpg)
Filters: Shift-invariant frames for HN
Observation: Convolutions yield shift-invariant frame coefficients
(f ∗ gλ)[n] = 〈f, gλ(n− ·)〉 = 〈f, TnIgλ〉, (λ, n) ∈ Λ× IN
Definition
Let gλλ∈Λ ⊆ HN be indexed by a finite set Λ. The collection
ΨΛ :=TnIgλ
(λ,n)∈Λ×IN
is a shift-invariant frame for HN , if there exist constants A,B > 0such that
A‖f‖22 ≤∑λ∈Λ
∑n∈IN
|〈f, TnIgλ〉|2 =∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22,
for all f ∈ HN
![Page 106: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/106.jpg)
Filters: Shift-invariant frames for HN
A‖f‖22 ≤∑λ∈Λ
∑n∈IN
|〈f, TnIgλ〉|2 =∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Shift-invariant frames for L2(Rd) [Ron and Shen, 1995 ], for`2(Z) [HB et al., 1998 ] and [Cvetkovic and Vetterli, 1998 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ[k]|2 ≤ B, ∀ k ∈ IN
Frame lower bound A > 0 guarantees that no essential featuresof f are “lost” in the network
Structured shift-invariant frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 107: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/107.jpg)
Filters: Shift-invariant frames for HN
A‖f‖22 ≤∑λ∈Λ
∑n∈IN
|〈f, TnIgλ〉|2 =∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Shift-invariant frames for L2(Rd) [Ron and Shen, 1995 ], for`2(Z) [HB et al., 1998 ] and [Cvetkovic and Vetterli, 1998 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ[k]|2 ≤ B, ∀ k ∈ IN
Frame lower bound A > 0 guarantees that no essential featuresof f are “lost” in the network
Structured shift-invariant frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 108: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/108.jpg)
Filters: Shift-invariant frames for HN
A‖f‖22 ≤∑λ∈Λ
∑n∈IN
|〈f, TnIgλ〉|2 =∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Shift-invariant frames for L2(Rd) [Ron and Shen, 1995 ], for`2(Z) [HB et al., 1998 ] and [Cvetkovic and Vetterli, 1998 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ[k]|2 ≤ B, ∀ k ∈ IN
Frame lower bound A > 0 guarantees that no essential featuresof f are “lost” in the network
Structured shift-invariant frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 109: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/109.jpg)
Filters: Shift-invariant frames for HN
A‖f‖22 ≤∑λ∈Λ
∑n∈IN
|〈f, TnIgλ〉|2 =∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Shift-invariant frames for L2(Rd) [Ron and Shen, 1995 ], for`2(Z) [HB et al., 1998 ] and [Cvetkovic and Vetterli, 1998 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ[k]|2 ≤ B, ∀ k ∈ IN
Frame lower bound A > 0 guarantees that no essential featuresof f are “lost” in the network
Structured shift-invariant frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 110: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/110.jpg)
Filters: Shift-invariant frames for HN
A‖f‖22 ≤∑λ∈Λ
∑n∈IN
|〈f, TnIgλ〉|2 =∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Shift-invariant frames for L2(Rd) [Ron and Shen, 1995 ], for`2(Z) [HB et al., 1998 ] and [Cvetkovic and Vetterli, 1998 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ[k]|2 ≤ B, ∀ k ∈ IN
Frame lower bound A > 0 guarantees that no essential featuresof f are “lost” in the network
Structured shift-invariant frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 111: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/111.jpg)
Filters: Shift-invariant frames for HN
A‖f‖22 ≤∑λ∈Λ
∑n∈IN
|〈f, TnIgλ〉|2 =∑λ∈Λ
‖f ∗ gλ‖22 ≤ B‖f‖22
Shift-invariant frames for L2(Rd) [Ron and Shen, 1995 ], for`2(Z) [HB et al., 1998 ] and [Cvetkovic and Vetterli, 1998 ]
The frame condition can equivalently be expressed as
A ≤∑λ∈Λ
|gλ[k]|2 ≤ B, ∀ k ∈ IN
Frame lower bound A > 0 guarantees that no essential featuresof f are “lost” in the network
Structured shift-invariant frames: Weyl-Heisenberg frames,wavelets, (α)-curvelets, shearlets, and ridgelets
Λ is typically a collection of scales, directions, or frequency shifts
![Page 112: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/112.jpg)
Filters: Shift-invariant frames for HN
f
|f ∗ gλ(1)1
|
||f ∗ gλ(1)1
| ∗ gλ(1)2
|
|||f ∗ gλ(1)1
| ∗ gλ(1)2
| ∗ gλ(1)3
|
|f ∗ gλ(m)1
|
||f ∗ gλ(m)1
| ∗ gλ(n)2
|
|||f ∗ gλ(m)1
| ∗ gλ(n)2
| ∗ gλ(p)3
|
![Page 113: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/113.jpg)
Filters: Shift-invariant frames for HN
f
|f ∗ gλ(1)1
|
||f ∗ gλ(1)1
| ∗ gλ(1)2
|
|||f ∗ gλ(1)1
| ∗ gλ(1)2
| ∗ gλ(1)3
|
|f ∗ gλ(m)1
|
||f ∗ gλ(m)1
| ∗ gλ(n)2
|
|||f ∗ gλ(m)1
| ∗ gλ(n)2
| ∗ gλ(p)3
|
How to generate network output in the d-th layer?
![Page 114: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/114.jpg)
Filters: Shift-invariant frames for HN
f
|f ∗ gλ(1)1
|
|f ∗ gλ(1)1
| ∗ χ1
||f ∗ gλ(1)1
| ∗ gλ(1)2
|
||f ∗ gλ(1)1
| ∗ gλ(1)2
| ∗ χ2
|||f ∗ gλ(1)1
| ∗ gλ(1)2
| ∗ gλ(1)3
|
|f ∗ gλ(m)1
|
|f ∗ gλ(m)1
| ∗ χ1
||f ∗ gλ(m)1
| ∗ gλ(n)2
|
||f ∗ gλ(m)1
| ∗ gλ(n)2
| ∗ χ2
|||f ∗ gλ(m)1
| ∗ gλ(n)2
| ∗ gλ(p)3
|
f ∗ χ0
How to generate network output in the d-th layer?Convolution with general χd ∈ HNd+1
gives flexibility!
![Page 115: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/115.jpg)
Network output
A wide variety of architectures is encompassed, e.g.,
output: none⇒ χd = 0
output: propagated signals | · · · |f ∗ gλ(m)1
| ∗ · · · ∗ gλ(n)d
|⇒ χd = δ
output: filtered signals⇒ χd = filter (e.g., low-pass)
⇒ Ψd+1 ∪ TnIχdn∈INd+1forms a shift-invariant frame for HNd+1
Start with
Ad+1 ≤∑
λd+1∈Λd+1
|gλd+1[k]|2 ≤ Bd+1, ∀ k ∈ INd+1
,
and note that
Ad+1 ≤ |χd[k]|2 +∑
λd+1∈Λd+1
|gλd+1[k]|2 ≤ B′d+1, ∀ k ∈ INd+1
![Page 116: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/116.jpg)
Network output
A wide variety of architectures is encompassed, e.g.,
output: none⇒ χd = 0
output: propagated signals | · · · |f ∗ gλ(m)1
| ∗ · · · ∗ gλ(n)d
|⇒ χd = δ
output: filtered signals⇒ χd = filter (e.g., low-pass)
⇒ Ψd+1 ∪ TnIχdn∈INd+1forms a shift-invariant frame for HNd+1
Start with
Ad+1 ≤∑
λd+1∈Λd+1
|gλd+1[k]|2 ≤ Bd+1, ∀ k ∈ INd+1
,
and note that
Ad+1 ≤ |χd[k]|2 +∑
λd+1∈Λd+1
|gλd+1[k]|2 ≤ B′d+1, ∀ k ∈ INd+1
![Page 117: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/117.jpg)
Filters: Shift-invariant frames for HN
A wide variety of architectures is encompassed!
f
|f ∗ gλ(1)1
|
|f ∗ gλ(1)1
| ∗ δ
||f ∗ gλ(1)1
| ∗ gλ(1)2
|
||f ∗ gλ(1)1
| ∗ gλ(1)2
| ∗ 0
|||f ∗ gλ(1)1
| ∗ gλ(1)2
| ∗ gλ(1)3
|
|f ∗ gλ(m)1
|
|f ∗ gλ(m)1
| ∗ δ
||f ∗ gλ(m)1
| ∗ gλ(n)2
|
||f ∗ gλ(m)1
| ∗ gλ(n)2
| ∗ 0
|||f ∗ gλ(m)1
| ∗ gλ(n)2
| ∗ gλ(p)3
|
f ∗ ϕlow
![Page 118: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/118.jpg)
Filters: Shift-invariant frames for HN
A wide variety of architectures is encompassed!
f
|f ∗ gλ(1)1
|
|f ∗ gλ(1)1
|
||f ∗ gλ(1)1
| ∗ gλ(1)2
|
||f ∗ gλ(1)1
| ∗ gλ(1)2
| ∗ 0
|||f ∗ gλ(1)1
| ∗ gλ(1)2
| ∗ gλ(1)3
|
|f ∗ gλ(m)1
|
|f ∗ gλ(m)1
|
||f ∗ gλ(m)1
| ∗ gλ(n)2
|
||f ∗ gλ(m)1
| ∗ gλ(n)2
| ∗ 0
|||f ∗ gλ(m)1
| ∗ gλ(n)2
| ∗ gλ(p)3
|
f ∗ ϕlow
![Page 119: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/119.jpg)
Non-linearities
Observation: Essentially all non-linearities ρ : HN → HN employedin the deep learning literature are
i) pointwise, i.e.,
(ρf)[n] = ρ(f [n]), n ∈ IN ,
ii) Lipschitz-continuous, i.e.,
‖ρ(f)− ρ(h)‖2 ≤ L‖f − h‖2, ∀ f, h ∈ HN ,
for some L > 0
![Page 120: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/120.jpg)
Pooling P : HN → HN/S
Pooling: Combining nearby values / picking one representative value
Averaging:
(Pf)[n] =
Sn+S−1∑k=Sn
αk−Snf [k]
weights αkS−1k=0 can be learned [LeCun et al., 1998 ] or be
pre-specified [Pinto et al., 2008 ]
uniform averaging corresponds to αk = 1S , for k ∈ 0, . . . , S− 1
n
f [n]
![Page 121: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/121.jpg)
Pooling P : HN → HN/S
Pooling: Combining nearby values / picking one representative value
Averaging:
(Pf)[n] =
Sn+S−1∑k=Sn
αk−Snf [k]
weights αkS−1k=0 can be learned [LeCun et al., 1998 ] or be
pre-specified [Pinto et al., 2008 ]
uniform averaging corresponds to αk = 1S , for k ∈ 0, . . . , S− 1
n
(Pf)[n]
![Page 122: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/122.jpg)
Pooling P : HN → HN/S
Pooling: Combining nearby values / picking one representative value
Maximization:
(Pf)[n] = maxk∈nS,...,nS+S−1
|f [k]|
n
f [n]
![Page 123: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/123.jpg)
Pooling P : HN → HN/S
Pooling: Combining nearby values / picking one representative value
Maximization:
(Pf)[n] = maxk∈nS,...,nS+S−1
|f [k]|
n
(Pf)[n]
![Page 124: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/124.jpg)
Pooling P : HN → HN/S
Pooling: Combining nearby values / picking one representative value
Sub-sampling:
(Pf)[n] = f [Sn]
S = 1 corresponds to “no pooling”
n
f [n]
![Page 125: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/125.jpg)
Pooling P : HN → HN/S
Pooling: Combining nearby values / picking one representative value
Sub-sampling:
(Pf)[n] = f [Sn]
S = 1 corresponds to “no pooling”
n
(Pf)[n]
![Page 126: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/126.jpg)
Pooling
Common to all pooling operators Pd:
Lipschitz continuity with Lipschitz constant Rd:
averaging: Rd = S1/2d maxk∈0,...,Sd−1 |αd
k|maximization: Rd = 1
sub-sampling: Rd = 1
Pooling factor Sd:
“size” of the neighborhood values are combined in
dimensionality-reduction from d-th to (d+ 1)-th layer, i.e.,Nd+1 = Nd
Sd
![Page 127: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/127.jpg)
Pooling
Common to all pooling operators Pd:
Lipschitz continuity with Lipschitz constant Rd:
averaging: Rd = S1/2d maxk∈0,...,Sd−1 |αd
k|maximization: Rd = 1
sub-sampling: Rd = 1
Pooling factor Sd:
“size” of the neighborhood values are combined in
dimensionality-reduction from d-th to (d+ 1)-th layer, i.e.,Nd+1 = Nd
Sd
![Page 128: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/128.jpg)
Different modules in different layers
Module-sequence Ω =((Ψd, ρd, Pd)
)Dd=1
i) in the d-th network layer, we compute
Ud[λd]f := Pd(ρd(f ∗ gλd))
ii) extend the operator Ud[λd] to paths on index sets
q = (λ1, λ2, . . . , λd) ∈ Λ1×Λ2×· · ·×Λd := Λd1, d ∈ 1, . . . , D,
according to
U [q]f :=Ud[λd] · · ·U2[λ2]U1[λ1]f
![Page 129: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/129.jpg)
Different modules in different layers
Module-sequence Ω =((Ψd, ρd, Pd)
)Dd=1
i) in the d-th network layer, we compute
Ud[λd]f := Pd(ρd(f ∗ gλd))
ii) extend the operator Ud[λd] to paths on index sets
q = (λ1, λ2, . . . , λd) ∈ Λ1×Λ2×· · ·×Λd := Λd1, d ∈ 1, . . . , D,
according to
U [q]f :=Ud[λd] · · ·U2[λ2]U1[λ1]f
![Page 130: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/130.jpg)
Local and global properties
Features generated in the d-th network layer
ΦdΩ(f) :=
(U [q]f) ∗ χd
q∈Λd1
Ψ2
f
U[λ
(j)1
]f
(U[λ
(j)1
]f)∗ χ1
U[(λ
(j)1 , λ
(l)2
)]f
(U[(λ
(j)1 , λ
(l)2
)]f)∗ χ2
U[(λ
(j)1 , λ
(l)2 , λ
(m)3
)]f
U[λ
(p)1
]f
(U[λ
(p)1
]f)∗ χ1
U[(λ
(p)1 , λ
(r)2
)]f
(U[(λ
(p)1 , λ
(r)2
)]f)∗ χ2
U[(λ
(p)1 , λ
(r)2 , λ
(s)3
)]f
f ∗ χ0
![Page 131: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/131.jpg)
Global properties: Lipschitz continuity
Theorem (Wiatowski et al., 2016)
Assume that Ω =((Ψd, ρd, Pd)
)Dd=1
satisfies the admissibility
condition Bd ≤ min1, R−2d L−2
d , for all d ∈ 1, . . . , D. Then, thefeature extractor is Lipschitz-continuous, i.e.,
|||ΦΩ(f)− ΦΩ(h)||| ≤ ‖f − h‖2, ∀f, h ∈ HN1 .
![Page 132: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/132.jpg)
Global properties: Lipschitz continuity
Theorem (Wiatowski et al., 2016)
Assume that Ω =((Ψd, ρd, Pd)
)Dd=1
satisfies the admissibility
condition Bd ≤ min1, R−2d L−2
d , for all d ∈ 1, . . . , D. Then, thefeature extractor is Lipschitz-continuous, i.e.,
|||ΦΩ(f)− ΦΩ(h)||| ≤ ‖f − h‖2, ∀f, h ∈ HN1 .
... this implies ...
robustness w.r.t. additive noise η ∈ L2(Rd) according to
|||ΦΩ(f + η)− ΦΩ(f)||| ≤ ‖η‖2, ∀f ∈ HN1
an upper bound on the feature vector’s energy according to
|||ΦΩ(f)||| ≤ ‖f‖2, ∀f ∈ HN1
![Page 133: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/133.jpg)
Global properties: Lipschitz continuity
Theorem (Wiatowski et al., 2016)
Assume that Ω =((Ψd, ρd, Pd)
)Dd=1
satisfies the admissibility
condition Bd ≤ min1, R−2d L−2
d , for all d ∈ 1, . . . , D. Then, thefeature extractor is Lipschitz-continuous, i.e.,
|||ΦΩ(f)− ΦΩ(h)||| ≤ ‖f − h‖2, ∀f, h ∈ HN1 .
... this implies ...
robustness w.r.t. additive noise η ∈ L2(Rd) according to
|||ΦΩ(f + η)− ΦΩ(f)||| ≤ ‖η‖2, ∀f ∈ HN1
an upper bound on the feature vector’s energy according to
|||ΦΩ(f)||| ≤ ‖f‖2, ∀f ∈ HN1
![Page 134: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/134.jpg)
Global properties: Lipschitz continuity
Theorem (Wiatowski et al., 2016)
Assume that Ω =((Ψd, ρd, Pd)
)Dd=1
satisfies the admissibility
condition Bd ≤ min1, R−2d L−2
d , for all d ∈ 1, . . . , D. Then, thefeature extractor is Lipschitz-continuous, i.e.,
|||ΦΩ(f)− ΦΩ(h)||| ≤ ‖f − h‖2, ∀f, h ∈ HN1 .
The admissibility condition
Bd ≤ min1, R−2d L−2
d , ∀d ∈ 1, . . . , D,
is easily satisfied by normalizing the frame elements in Ψd
![Page 135: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/135.jpg)
Global properties: Deformation sensitivity bounds
Network output should be independent of cameras (of differentresolutions), and insensitive to small acquisition jitters
⇒ Want to analyze sensitivity w.r.t. continuous-timedeformations
(Fτf)(x) = f(x− τ(x)), x ∈ R,
and hence consider
(Fτf)[n] = f(n/N − τ(n/N)), n ∈ IN
![Page 136: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/136.jpg)
Global properties: Deformation sensitivity bounds
Network output should be independent of cameras (of differentresolutions), and insensitive to small acquisition jitters
⇒ Want to analyze sensitivity w.r.t. continuous-timedeformations
(Fτf)(x) = f(x− τ(x)), x ∈ R,
and hence consider
(Fτf)[n] = f(n/N − τ(n/N)), n ∈ IN
![Page 137: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/137.jpg)
Global properties: Deformation sensitivity bounds
Goal: Deformation sensitivity bounds for practically relevant signalclasses
Image credit: middle [Mnih et al., 2015 ], right [Silver et al., 2016 ]
![Page 138: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/138.jpg)
Global properties: Deformation sensitivity bounds
Goal: Deformation sensitivity bounds for practically relevant signalclasses
Image credit: middle [Mnih et al., 2015 ], right [Silver et al., 2016 ]
Take into account structural properties of natural images
⇒ consider cartoon functions [Donoho, 2001 ]
![Page 139: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/139.jpg)
Global properties: Deformation sensitivity bounds
Goal: Deformation sensitivity bounds for practically relevant signalclasses
Image credit: middle [Mnih et al., 2015 ], right [Silver et al., 2016 ]
Continuous-time [Donoho, 2001 ]:
Cartoon functions of maximal variation K > 0:
CKCART := c1 + 1[a,b]c2 | |ci(x)− ci(y)| ≤ K|x− y|,∀x, y ∈ R, i = 1, 2, ‖c2‖∞ ≤ K
![Page 140: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/140.jpg)
Global properties: Deformation sensitivity bounds
Goal: Deformation sensitivity bounds for practically relevant signalclasses
Image credit: middle [Mnih et al., 2015 ], right [Silver et al., 2016 ]
Discrete-time [Wiatowski et al., 2016 ]:
Sampled cartoon functions of length N and maximal variationK > 0:
CN,KCART :=f [n] = c(n/N), n ∈ IN
∣∣∣ c = (c1 + 1[a,b]c2) ∈ CKCART
![Page 141: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/141.jpg)
Global properties: Deformation sensitivity bounds
Theorem (Wiatowski et al., 2016)
Assume that Ω =((Ψd, ρd, Pd)
)Dd=1
satisfies the admissibility
condition Bd ≤ min1, R−2d L−2
d , for all d ∈ 1, . . . , D. For everyN1 ∈ N, every K > 0, and every τ : [0, 1]→ [−1, 1], it holds that
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ 4KN1/21 ‖τ‖1/2∞ ,
for all f ∈ CN1,KCART .
![Page 142: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/142.jpg)
Philosophy behind deformation sensitivity bounds
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ 4KN1/21 ‖τ‖1/2∞ , ∀f ∈ CN1,K
CART
Bound depends explicitly on the analog signal’s descriptioncomplexity via K and N1
Lipschitz exponent α = 12 for ‖τ‖∞ is signal-class-specific
(larger Lipschitz exponents for smoother functions)
Particularizing to translations: τt(x) = t, x ∈ [0, 1], results intranslation sensitivity bound according to
|||ΦΩ(Fτtf)− ΦΩ(f)||| ≤ 4KN1/21 |t|1/2, ∀f ∈ CN1,K
CART
![Page 143: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/143.jpg)
Philosophy behind deformation sensitivity bounds
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ 4KN1/21 ‖τ‖1/2∞ , ∀f ∈ CN1,K
CART
Bound depends explicitly on the analog signal’s descriptioncomplexity via K and N1
Lipschitz exponent α = 12 for ‖τ‖∞ is signal-class-specific
(larger Lipschitz exponents for smoother functions)
Particularizing to translations: τt(x) = t, x ∈ [0, 1], results intranslation sensitivity bound according to
|||ΦΩ(Fτtf)− ΦΩ(f)||| ≤ 4KN1/21 |t|1/2, ∀f ∈ CN1,K
CART
![Page 144: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/144.jpg)
Philosophy behind deformation sensitivity bounds
|||ΦΩ(Fτf)− ΦΩ(f)||| ≤ 4KN1/21 ‖τ‖1/2∞ , ∀f ∈ CN1,K
CART
Bound depends explicitly on the analog signal’s descriptioncomplexity via K and N1
Lipschitz exponent α = 12 for ‖τ‖∞ is signal-class-specific
(larger Lipschitz exponents for smoother functions)
Particularizing to translations: τt(x) = t, x ∈ [0, 1], results intranslation sensitivity bound according to
|||ΦΩ(Fτtf)− ΦΩ(f)||| ≤ 4KN1/21 |t|1/2, ∀f ∈ CN1,K
CART
![Page 145: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/145.jpg)
Global properties: Energy conservation
Theorem (Wiatowski et al., 2016)
Let Ω =((Ψn, | · |, P sub
S=1))n∈N be a module-sequence employing
modulus non-linearities and no pooling. For every d ∈ 1, . . . , D, letthe atoms of Ψd satisfy∑
λd∈Λd
|gλd [k]|2 + |χd−1[k]|2 = 1, ∀k ∈ INd .
Let the output-generating atom of the last layer be the deltafunction, i.e., χD−1 = δ, then
|||ΦΩ(f)||| = ‖f‖2, ∀f ∈ HN1 .
![Page 146: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/146.jpg)
Local properties
f
U[λ
(j)1
]f
(U[λ
(j)1
]f)∗ χ1
U[(λ
(j)1 , λ
(l)2
)]f
(U[(λ
(j)1 , λ
(l)2
)]f)∗ χ2
U[(λ
(j)1 , λ
(l)2 , λ
(m)3
)]f
U[λ
(p)1
]f
(U[λ
(p)1
]f)∗ χ1
U[(λ
(p)1 , λ
(r)2
)]f
(U[(λ
(p)1 , λ
(r)2
)]f)∗ χ2
U[(λ
(p)1 , λ
(r)2 , λ
(s)3
)]f
f ∗ χ0
![Page 147: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/147.jpg)
Local properties: Lipschitz continuity
Theorem (Wiatowski et al., 2016)
The features generated in the d-th network layer are Lipschitz-continuous with Lipschitz constant
LdΩ := ‖χd‖1(∏d
k=1BkL2kR
2k
)1/2,
i.e.,|||Φd
Ω(f)− ΦdΩ(h)||| ≤ LdΩ‖f − h‖2, ∀f, h ∈ HN1 .
![Page 148: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/148.jpg)
Local properties: Lipschitz continuity
Theorem (Wiatowski et al., 2016)
The features generated in the d-th network layer are Lipschitz-continuous with Lipschitz constant
LdΩ := ‖χd‖1(∏d
k=1BkL2kR
2k
)1/2,
i.e.,|||Φd
Ω(f)− ΦdΩ(h)||| ≤ LdΩ‖f − h‖2, ∀f, h ∈ HN1 .
The Lipschitz constant LdΩ
determines the noise sensitivity of ΦdΩ(f) according to
|||ΦdΩ(f + η)− Φd
Ω(f)||| ≤ LdΩ‖η‖2, ∀f ∈ HN1
![Page 149: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/149.jpg)
Local properties: Lipschitz continuity
Theorem (Wiatowski et al., 2016)
The features generated in the d-th network layer are Lipschitz-continuous with Lipschitz constant
LdΩ := ‖χd‖1(∏d
k=1BkL2kR
2k
)1/2,
i.e.,|||Φd
Ω(f)− ΦdΩ(h)||| ≤ LdΩ‖f − h‖2, ∀f, h ∈ HN1 .
The Lipschitz constant LdΩ
impacts the energy of ΦdΩ(f) according to
|||ΦdΩ(f)||| ≤ LdΩ‖f‖2, ∀f ∈ HN1
![Page 150: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/150.jpg)
Local properties: Lipschitz continuity
Theorem (Wiatowski et al., 2016)
The features generated in the d-th network layer are Lipschitz-continuous with Lipschitz constant
LdΩ := ‖χd‖1(∏d
k=1BkL2kR
2k
)1/2,
i.e.,|||Φd
Ω(f)− ΦdΩ(h)||| ≤ LdΩ‖f − h‖2, ∀f, h ∈ HN1 .
The Lipschitz constant LdΩ
quantifies the impact of deformations τ according to
|||ΦdΩ(Fτf)− Φd
Ω(f)||| ≤ 4LdΩKN1/21 ‖τ‖1/2∞ , ∀f ∈ CN1,K
CART
![Page 151: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/151.jpg)
Local properties: Lipschitz continuity
Theorem (Wiatowski et al., 2016)
The features generated in the d-th network layer are Lipschitz-continuous with Lipschitz constant
LdΩ := ‖χd‖1(∏d
k=1BkL2kR
2k
)1/2,
i.e.,|||Φd
Ω(f)− ΦdΩ(h)||| ≤ LdΩ‖f − h‖2, ∀f, h ∈ HN1 .
The Lipschitz constant LdΩ
is hence a characteristic constant for the features ΦdΩ(f)
generated in the d-th network layer
![Page 152: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/152.jpg)
Local properties: Lipschitz continuity
LdΩ =‖χd‖1B
1/2d LdRd
‖χd−1‖1Ld−1
Ω
If ‖χd‖1 < ‖χd−1‖1B
1/2d LdRd
, then LdΩ < Ld−1Ω , and hence
the features ΦdΩ(f) are less deformation-sensitive than Φd−1
Ω (f),thanks to
|||ΦdΩ(Fτf)− Φd
Ω(f)||| ≤ 4LdΩKN1/21 ‖τ‖1/2∞ , ∀f ∈ CN1,K
CART
the features ΦdΩ(f) contain less energy than Φd−1
Ω (f), owing to
|||ΦdΩ(f)||| ≤ LdΩ‖f‖2, ∀f ∈ HN1
⇒ Tradeoff between deformation sensitivity and energy preservation!
![Page 153: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/153.jpg)
Local properties: Lipschitz continuity
LdΩ =‖χd‖1B
1/2d LdRd
‖χd−1‖1Ld−1
Ω
If ‖χd‖1 < ‖χd−1‖1B
1/2d LdRd
, then LdΩ < Ld−1Ω , and hence
the features ΦdΩ(f) are less deformation-sensitive than Φd−1
Ω (f),thanks to
|||ΦdΩ(Fτf)− Φd
Ω(f)||| ≤ 4LdΩKN1/21 ‖τ‖1/2∞ , ∀f ∈ CN1,K
CART
the features ΦdΩ(f) contain less energy than Φd−1
Ω (f), owing to
|||ΦdΩ(f)||| ≤ LdΩ‖f‖2, ∀f ∈ HN1
⇒ Tradeoff between deformation sensitivity and energy preservation!
![Page 154: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/154.jpg)
Local properties: Lipschitz continuity
LdΩ =‖χd‖1B
1/2d LdRd
‖χd−1‖1Ld−1
Ω
If ‖χd‖1 < ‖χd−1‖1B
1/2d LdRd
, then LdΩ < Ld−1Ω , and hence
the features ΦdΩ(f) are less deformation-sensitive than Φd−1
Ω (f),thanks to
|||ΦdΩ(Fτf)− Φd
Ω(f)||| ≤ 4LdΩKN1/21 ‖τ‖1/2∞ , ∀f ∈ CN1,K
CART
the features ΦdΩ(f) contain less energy than Φd−1
Ω (f), owing to
|||ΦdΩ(f)||| ≤ LdΩ‖f‖2, ∀f ∈ HN1
⇒ Tradeoff between deformation sensitivity and energy preservation!
![Page 155: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/155.jpg)
Local properties: Covariance-Invariance
Theorem (Wiatowski et al., 2016)
Let Skdk=1 be pooling factors. The features generated in the d-thnetwork layer are translation-covariant according to
ΦdΩ(Tmf) = T m
S1...SdΦd
Ω(f),
for all f ∈ HN1 and all m ∈ Z with mS1...Sd
∈ Z.
Translation covariance on signal grid induced by the poolingfactors
In the absence of pooling, i.e., Sk = 1, for k ∈ 1, . . . , d, weget translation covariance w.r.t. the fine grid the input signalf ∈ HN1 lives on
![Page 156: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/156.jpg)
Local properties: Covariance-Invariance
Theorem (Wiatowski et al., 2016)
Let Skdk=1 be pooling factors. The features generated in the d-thnetwork layer are translation-covariant according to
ΦdΩ(Tmf) = T m
S1...SdΦd
Ω(f),
for all f ∈ HN1 and all m ∈ Z with mS1...Sd
∈ Z.
Translation covariance on signal grid induced by the poolingfactors
In the absence of pooling, i.e., Sk = 1, for k ∈ 1, . . . , d, weget translation covariance w.r.t. the fine grid the input signalf ∈ HN1 lives on
![Page 157: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/157.jpg)
Local properties: Covariance-Invariance
Theorem (Wiatowski et al., 2016)
Let Skdk=1 be pooling factors. The features generated in the d-thnetwork layer are translation-covariant according to
ΦdΩ(Tmf) = T m
S1...SdΦd
Ω(f),
for all f ∈ HN1 and all m ∈ Z with mS1...Sd
∈ Z.
Translation covariance on signal grid induced by the poolingfactors
In the absence of pooling, i.e., Sk = 1, for k ∈ 1, . . . , d, weget translation covariance w.r.t. the fine grid the input signalf ∈ HN1 lives on
![Page 158: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/158.jpg)
Experiments
![Page 159: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/159.jpg)
Experiments
The implementation in a nutshell
Filters: Tensorized wavelets
extract visual features w.r.t. 3 directions (horizontal, vertical,diagonal)
efficiently implemented using the algorithme a trous[Holschneider et al., 1989 ]
Non-linearities: Modulus, rectified linear unit, hyperbolictangent, logistic sigmoid
Pooling: no pooling, sub-sampling, max-pooling, average-pooling
Output-generating atoms: Low-pass filters
![Page 160: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/160.jpg)
Experiments
The implementation in a nutshell
Filters: Tensorized wavelets
extract visual features w.r.t. 3 directions (horizontal, vertical,diagonal)
efficiently implemented using the algorithme a trous[Holschneider et al., 1989 ]
Non-linearities: Modulus, rectified linear unit, hyperbolictangent, logistic sigmoid
Pooling: no pooling, sub-sampling, max-pooling, average-pooling
Output-generating atoms: Low-pass filters
![Page 161: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/161.jpg)
Experiments
The implementation in a nutshell
Filters: Tensorized wavelets
extract visual features w.r.t. 3 directions (horizontal, vertical,diagonal)
efficiently implemented using the algorithme a trous[Holschneider et al., 1989 ]
Non-linearities: Modulus, rectified linear unit, hyperbolictangent, logistic sigmoid
Pooling: no pooling, sub-sampling, max-pooling, average-pooling
Output-generating atoms: Low-pass filters
![Page 162: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/162.jpg)
Experiments
The implementation in a nutshell
Filters: Tensorized wavelets
extract visual features w.r.t. 3 directions (horizontal, vertical,diagonal)
efficiently implemented using the algorithme a trous[Holschneider et al., 1989 ]
Non-linearities: Modulus, rectified linear unit, hyperbolictangent, logistic sigmoid
Pooling: no pooling, sub-sampling, max-pooling, average-pooling
Output-generating atoms: Low-pass filters
![Page 163: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/163.jpg)
Experiment: Handwritten digit classification
Dataset: MNIST database of handwritten digits [LeCun &Cortes, 1998 ]; 60,000 training and 10,000 test images
Setup for ΦΩ: D = 3 layers; same filters, non-linearities, andpooling operators in all layers
Classifier: SVM with radial basis function kernel [Vapnik, 1995 ]
Dimensionality reduction: Supervised orthogonal least squaresscheme [Chen et al., 1991 ]
![Page 164: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/164.jpg)
Experiment: Handwritten digit classification
Classification error in percent:
Haar wavelet Bi-orthogonal waveletabs ReLU tanh LogSig abs ReLU tanh LogSig
n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26
modulus and ReLU perform better than tanh and LogSig
pooling-results (S = 2) are competitive with those withoutpooling at significanly lower computational cost
State-of-the-art: 0.43 [Bruna and Mallat, 2013 ]
similar feature extraction network with directional, but non-separable, wavelets and no poolingsignificantly higher computational complexity
![Page 165: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/165.jpg)
Experiment: Handwritten digit classification
Classification error in percent:
Haar wavelet Bi-orthogonal waveletabs ReLU tanh LogSig abs ReLU tanh LogSig
n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26
modulus and ReLU perform better than tanh and LogSig
pooling-results (S = 2) are competitive with those withoutpooling at significanly lower computational cost
State-of-the-art: 0.43 [Bruna and Mallat, 2013 ]
similar feature extraction network with directional, but non-separable, wavelets and no poolingsignificantly higher computational complexity
![Page 166: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/166.jpg)
Experiment: Handwritten digit classification
Classification error in percent:
Haar wavelet Bi-orthogonal waveletabs ReLU tanh LogSig abs ReLU tanh LogSig
n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26
modulus and ReLU perform better than tanh and LogSig
pooling-results (S = 2) are competitive with those withoutpooling at significanly lower computational cost
State-of-the-art: 0.43 [Bruna and Mallat, 2013 ]
similar feature extraction network with directional, but non-separable, wavelets and no poolingsignificantly higher computational complexity
![Page 167: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/167.jpg)
Experiment: Handwritten digit classification
Classification error in percent:
Haar wavelet Bi-orthogonal waveletabs ReLU tanh LogSig abs ReLU tanh LogSig
n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26
modulus and ReLU perform better than tanh and LogSig
pooling-results (S = 2) are competitive with those withoutpooling at significanly lower computational cost
State-of-the-art: 0.43 [Bruna and Mallat, 2013 ]
similar feature extraction network with directional, but non-separable, wavelets and no poolingsignificantly higher computational complexity
![Page 168: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/168.jpg)
Experiment: Feature importance evaluation
Question: Which features are important in
handwritten digit classification?
detection of facial landmarks (eyes, nose, mouth) throughregression?
Compare importance of features corresponding to (i) different layers,(ii) wavelet scales, and (iii) wavelet directions.
![Page 169: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/169.jpg)
Experiment: Feature importance evaluation
Setup for ΦΩ:
D = 4 layers; Haar wavelets with J = 3 scales and modulusnon-linearity in every network layer
no pooling in the first layer, average pooling with uniformweights in the second and third layer (S = 2)
![Page 170: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/170.jpg)
Experiment: Feature importance evaluation
Handwritten digit classification:
Dataset: MNIST database (10,000 training and 10,000 testimages)
Random forest classifier [Breiman, 2001 ] with 30 trees
Feature importance: Gini importance [Breiman, 1984 ]
Facial landmark detection:
Dataset: Caltech Web Faces database (7092 images; 80% fortraining, 20% for testing)
Random forest regressor [Breiman, 2001 ] with 30 trees
Feature importance: Gini importance [Breiman, 1984 ]
![Page 171: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/171.jpg)
Experiment: Feature importance evaluation
Handwritten digit classification:
Dataset: MNIST database (10,000 training and 10,000 testimages)
Random forest classifier [Breiman, 2001 ] with 30 trees
Feature importance: Gini importance [Breiman, 1984 ]
Facial landmark detection:
Dataset: Caltech Web Faces database (7092 images; 80% fortraining, 20% for testing)
Random forest regressor [Breiman, 2001 ] with 30 trees
Feature importance: Gini importance [Breiman, 1984 ]
![Page 172: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/172.jpg)
Experiment: Feature importance evaluation
Average cumulative feature importance: Digit classification
0/- 1/0 1/1 1/2 2/0 2/1 2/2 3/0 3/1 3/2
0.0
0.1
0.2
digits
j = 0
j = 1
j = 2
0/- 1/0 1/1 1/2 2/0 2/1 2/2 3/0 3/1 3/2
0.0
0.1
0.2
displaced digits
j = 0
j = 1
j = 2
triplet of bars [d/r] corresponds to horizontal r = 0, verticalr = 1, and diagonal r = 2 features in layer d
![Page 173: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/173.jpg)
Experiment: Feature importance evaluation
Average cumulative feature importance: Facial landmarks
0/- 1/0 1/1 1/2 2/0 2/1 2/2 3/0 3/1 3/2
0.0
0.1
0.2
left eye
j = 0
j = 1
j = 2
0/- 1/0 1/1 1/2 2/0 2/1 2/2 3/0 3/1 3/2
0.0
0.1
0.2
nose
j = 0
j = 1
j = 2
triplet of bars [d/r] corresponds to horizontal r = 0, verticalr = 1, and diagonal r = 2 features in layer d
![Page 174: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/174.jpg)
Experiment: Feature importance evaluation
Average cumulative feature importance per layer:
left eye right eye nose mouth digits disp. digits
Layer 0 0.020 0.023 0.016 0.014 0.046 0.004Layer 1 0.629 0.646 0.576 0.490 0.426 0.094Layer 2 0.261 0.236 0.298 0.388 0.337 0.280Layer 3 0.090 0.095 0.110 0.108 0.192 0.622
Given a particular machine learning task, it may be attractive togenerate output in individual layers only!
![Page 175: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/175.jpg)
Experiment: Feature importance evaluation
Average cumulative feature importance per layer:
left eye right eye nose mouth digits disp. digits
Layer 0 0.020 0.023 0.016 0.014 0.046 0.004Layer 1 0.629 0.646 0.576 0.490 0.426 0.094Layer 2 0.261 0.236 0.298 0.388 0.337 0.280Layer 3 0.090 0.095 0.110 0.108 0.192 0.622
Digit classification: Features in deeper layers have higherimportance
⇒ exploit vertical reduction in translation / deformation sensitivity
Given a particular machine learning task, it may be attractive togenerate output in individual layers only!
![Page 176: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/176.jpg)
Experiment: Feature importance evaluation
Average cumulative feature importance per layer:
left eye right eye nose mouth digits disp. digits
Layer 0 0.020 0.023 0.016 0.014 0.046 0.004Layer 1 0.629 0.646 0.576 0.490 0.426 0.094Layer 2 0.261 0.236 0.298 0.388 0.337 0.280Layer 3 0.090 0.095 0.110 0.108 0.192 0.622
Facial landmark detection: Features in shallower layers havehigher importance as they are translation-covariant on a finergrid
Given a particular machine learning task, it may be attractive togenerate output in individual layers only!
![Page 177: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/177.jpg)
Experiment: Feature importance evaluation
Average cumulative feature importance per layer:
left eye right eye nose mouth digits disp. digits
Layer 0 0.020 0.023 0.016 0.014 0.046 0.004Layer 1 0.629 0.646 0.576 0.490 0.426 0.094Layer 2 0.261 0.236 0.298 0.388 0.337 0.280Layer 3 0.090 0.095 0.110 0.108 0.192 0.622
Facial landmark detection: Features in shallower layers havehigher importance as they are translation-covariant on a finergrid
Given a particular machine learning task, it may be attractive togenerate output in individual layers only!
![Page 178: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/178.jpg)
Deep Frame Net
Open source software:
MATLAB: http://www.nari.ee.ethz.ch/commth/research
Python: Coming soon!
![Page 179: The Mathematics of Deep Learning Part 1: Continuous-time Theorypfister.ee.duke.edu/nasit16/Boelcskei.pdf · 2016. 6. 28. · Part 1: Continuous-time Theory Helmut B}olcskei ... Mallat’s](https://reader035.vdocuments.net/reader035/viewer/2022071609/6147d694a830d0442101b206/html5/thumbnails/179.jpg)
Thank you
“If you ask me anything I don’t know, I’m not going to answer.”
Y. Berra