![Page 1: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/1.jpg)
TTIC 31230, Fundamentals of Deep Learning
David McAllester, Winter 2019
The Fundamental Equations of Deep Learning
1
![Page 2: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/2.jpg)
Early History
1943: McCullock and Pitts introduced the linear threshold“neuron”.
1962: Rosenblatt applies a “Hebbian” learning rule. Novikoffproved the perceptron convergence theorem.
1969: Minsky and Papert publish the book Perceptrons.
The Perceptrons book greatly discourages work in artificialneural networks. Symbolic methods dominate AI researchthrough the 1970s.
2
![Page 3: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/3.jpg)
80s Renaissance
1980: Fukushima introduces the neocognitron (a form of CNN)
1984: Valiant defines PAC learnability and stimulates learn-ing theory. Wins the Turing Award in 2010.
1985: Hinton and Sejnowski introduce the Boltzman machine
1986: Rummelhart, Hinton and Williams demonstrate empir-ical success with backpropagation (itself dating back to 1961).
3
![Page 4: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/4.jpg)
90s and 00s: Research In the Shadows
1997: Schmidhuber et al. introduce LSTMs
1998: LeCunn introduces convolutional neural networks (CNNs)(LeNet).
2003: Bengio introduces neural language modeling.
4
![Page 5: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/5.jpg)
Current Era
2012: Alexnet dominates the Imagenet computer vision chal-lenge.
Google speech recognition converts to deep learning.
Both developments come out of Hinton’s group in Toronto.
2013: Refinement of AlexNet continues to dramatically im-prove computer vision.
5
![Page 6: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/6.jpg)
Current Era
2014: Neural machine translation appears (Seq2Seq models).
Variational auto-encoders (VAEs) appear.
Generative Adversarial Networks (GANs) appear.
Graph neural networks appear (GNNs) revolutionizing the pre-diction of molecular properties.
Dramatic improvement in computer vision and speech recog-nition continues.
6
![Page 7: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/7.jpg)
Current Era
2015: Google converts to neural machine translation leadingto dramatic improvements.
ResNet (residual connections) appear. This makes yet anotherdramatic improvement in computer vision.
2016: Alphago defeats Lee Sedol.
7
![Page 8: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/8.jpg)
Current Era
2017: AlphaZero learns both go and chess at super-humanlevels in a mater of hours entirely form self-play and advancescomputer go far beyond human abilities.
Unsupervised machine translation is demonstrated.
Progressive GANs demonstrate high resolution realistic facegeneration.
8
![Page 9: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/9.jpg)
Current Era
2018: Unsupervised pre-training significantly improves a broadrange of NLP tasks including question answering (but dialogueremains unsolved).
AlphaFold revolutionizes protein structure prediction.
2019: Vector quantized VAEs (VQ-VAE) demstrate that VAEscan be competative with GANs for high-resolution image gen-eration.
Super-human performance is achieved on the GLUE naturallangauge understanding benchmark.
9
![Page 10: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/10.jpg)
2019: Natural Language Understanding
GLUE: General Language Understanding Evaluation
ArXiv 1804.07461
10
![Page 11: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/11.jpg)
BERT and GLUE
11
![Page 12: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/12.jpg)
BERT and SuperGLUE
12
![Page 13: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/13.jpg)
Generative Adversarial Nets (GANs)
Goodfellow et al., 2014
13
![Page 14: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/14.jpg)
Moore’s Law of AI
ArXiv 1406.2661, 1511.06434, 1607.07536, 1710.10196, 1812.04948Goodfellow, ICLR 2019 Invited Talk
14
![Page 15: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/15.jpg)
GANs for Imagenet
15
![Page 16: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/16.jpg)
BigGANs, Brock et al., 2018
16
![Page 17: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/17.jpg)
Variational Auto Encoders (VAEs, 2015)
[Alec Radford, 2015]
17
![Page 18: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/18.jpg)
VAEs in 2019
VQ-VAE-2, Razavi et al. June, 2019
18
![Page 19: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/19.jpg)
VAEs in 2019
VQ-VAE-2, Razavi et al. June, 2019
19
![Page 20: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/20.jpg)
What is a Deep Network?
VGG, Zisserman, 2014
Davi Frossard
138 Million Parameters
20
![Page 21: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/21.jpg)
What is a Deep Network?
We assume some set X of possible inputs, some set Y of pos-sible outputs, and a parameter vector Φ ∈ Rd.
For Φ ∈ Rd and x ∈ X and y ∈ Y a deep network computesa probability PΦ(y|x).
21
![Page 22: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/22.jpg)
The Fundamental Equation of Deep Learning
We assume a “population” probability distribution Pop onpairs (x, y).
Φ∗ = argminΦ
E(x,y)∼Pop − ln PΦ(y|x)
This loss function L(x, y,Φ) = − ln PΦ(y|x) is called crossentropy loss.
22
![Page 23: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/23.jpg)
A Second Fundamental Equation
Softmax: Converting Scores to Probabilities
We start from a “score” function sΦ(y|x) ∈ R.
PΦ(y|x) =1
ZesΦ(y|x); Z =
∑y
esΦ(y|x)
= softmaxy
sΦ(y|x)
23
![Page 24: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/24.jpg)
Note the Final Softmax Layer
Davi Frossard
24
![Page 25: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/25.jpg)
How Many Possibilities
We have y ∈ Y where Y is some set of “possibilities”.
Binary: Y = {−1, 1}
Multiclass: Y = {y1, . . . yk} k manageable.
Structured: y is a “structured object” like a sentence. Here|Y | is unmanageable.
25
![Page 26: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/26.jpg)
Binary Classification
We have a population distribution over (x, y) with y ∈ {−1, 1}.
We compute a single score sΦ(x) where
for sΦ(x) ≥ 0 predict y = 1
for sΦ(x) < 0 predict y = −1
26
![Page 27: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/27.jpg)
Softmax for Binary Classification
PΦ(y|x) =1
Zeys(x)
=eys(x)
eys(x) + e−ys(x)
=1
1 + e−2ys(x)
=1
1 + e−m(y)m(y|x) = 2ys(x) is the margin
27
![Page 28: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/28.jpg)
Logistic Regression for Binary Classification
Φ∗ = argminΦ
E(x,y)∼Pop L(x, y,Φ)
= argminΦ
E(x,y)∼Pop − lnPΦ(y|x)
= argminΦ
E(x,y)∼Pop ln(
1 + e−m(y|x))
ln(
1 + e−m(y|x))≈ 0 for m(y|x) >> 1
ln(
1 + e−m(y|x))≈ −m(y|x) for −m(y|x) >> 1
28
![Page 29: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/29.jpg)
Log Loss vs. Hinge Loss (SVM loss)
29
![Page 30: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/30.jpg)
Image Classification (Multiclass Classification)
We have a population distribution over (x, y) with y ∈ {y1, . . . , yk}.
PΦ(y|x) = softmaxy
sΦ(y|x)
Φ∗ = argminΦ
E(x,y)∼Pop L(x, y,Φ)
= argminΦ
E(x,y)∼Pop − lnPΦ(y|x)
30
![Page 31: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/31.jpg)
Machine Translation (Structured Labeling)We have a population of translation pairs (x, y) with x ∈ V ∗xand y ∈ V ∗y where Vx and Vy are source and target vocabu-laries respectively.
PΦ(wt+1|x,w1, . . . , wt) = softmaxw∈Vy∪<EOS>
sΦ(w | x,w1, . . . , wt)
PΦ(y|x) =
|y|∏t=0
PΦ(yt+1 | x, y1, . . . , yt)
Φ∗ = argminΦ
E(x,y)∼Pop L(x, y,Φ)
= argminΦ
E(x,y)∼Pop − ln PΦ(y|x)
31
![Page 32: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/32.jpg)
Fuundamental Equation: Unconditional Form
Φ∗ = argminΦ
Ey∼Pop − lnPΦ(y)
![Page 33: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/33.jpg)
Entropy of a Distribution
The entropy of a distribution P is defined by
H(P ) = Ey∼Pop − lnP (y) in units of “nats”
H2(P ) = Ey∼Pop − log2 P (y) in units of bits
Example: Let Q be a uniform distribution on 256 values.
Ey∼Q − log2 Q(y) = − log21
256= log2 256 = 8 bits = 1 byte
1 nat = 1ln 2 bits ≈ 1.44 bits
33
![Page 34: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/34.jpg)
The Coding Interpretation of Entropy
We can interpret H2(Q) as the number of bits required anaverage to represent items drawn from distribution Q.
We want to use fewer bits for common items.
There exists a representation where, for all y, the number ofbits used to represent y is no larger than − log2 y + 1 (Shan-non’s source coding theorem).
H(Q) =1
ln 2H2(Q) ≈ 1.44 H2(Q)
34
![Page 35: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/35.jpg)
Cross Entropy
Let P and Q be two distribution on the same set.
H(P,Q) = Ey∼P − ln Q(y)
Φ∗ = argminΦ
H(Pop, PΦ)
H(P,Q) also has a data compression interpretation.
H(P,Q) can be interpreted as 1.44 times the number of bitsused to code draws from P when using the imperfect codedefined by Q.
35
![Page 36: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/36.jpg)
Entropy, Cross Entropy and KL Divergence
Let P and Q be two distribution on the same set.
Entropy : H(P ) = Ey∼P − ln P (y)
CrossEntropy : H(P,Q) = Ey∼P − ln Q(y)
KL Divergence : KL(P,Q) = H(P,Q)−H(P )
= Ey∼P lnP (y)Q(y)
We have H(P,Q) ≥ H(P ) or equivalently KL(P,Q) ≥ 0.
36
![Page 37: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/37.jpg)
The Universality Assumption
Φ∗ = argminΦ
H(Pop, PΦ) = argminΦ
H(Pop) +KL(Pop, PΦ)
Universality assumption: PΦ can represent any distributionand Φ can be fully optimized.
This is clearly false for deep networks. But it gives importantinsights like:
PΦ∗ = Pop
This is the motivatation for the fundamental equation.
37
![Page 38: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/38.jpg)
Asymmetry of Cross Entropy
Consider
Φ∗ = argminΦ
H(P,QΦ) (1)
Φ∗ = argminΦ
H(QΦ, P ) (2)
For (1) QΦ must cover all of the support of P .
For (2) QΦ concentrates all mass on the point maximizing P .
38
![Page 39: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/39.jpg)
Asymmetry of KL DivergenceConsider
Φ∗ = argminΦ
KL(P,QΦ)
= argminΦ
H(P,QΦ) (1)
Φ∗ = argminΦ
KL(QΦ, P )
= argminΦ
H(QΦ, P )−H(QΦ) (2)
If QΦ is not universally expressive we have that (1) still forcesQΦ to cover all of P (or else the KL divergence is infinite)while (2) allows QΦ to be restricted to a single mode of P (acommon outcome).
![Page 40: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/40.jpg)
Proving KL(P,Q) ≥ 0: Jensen’s Inequality
For f convex (upward curving) we have
E[f (x)] ≥ f (E[x])
40
![Page 41: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/41.jpg)
Proving KL(P,Q) ≥ 0
KL(P,Q) = Ey∼P − logQ(y)
P (y)
≥ − logEy∼PQ(y)
P (y)
= − log∑y
P (y)Q(y)
P (y)
= − log∑y
Q(y)
= 0
41
![Page 42: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/42.jpg)
Summary
Φ∗ = argminΦ H(Pop, PΦ) unconditional
Φ∗ = argminΦ Ex∼Pop H(Pop(y|x), PΦ(y|x)) conditional
Entropy : H(P ) = Ey∼P − ln P (y)
CrossEntropy : H(P,Q) = Ey∼P − ln Q(y)
KL Divergence : KL(P,Q) = H(P,Q)−H(P )
= Ey∼P lnP (y)Q(y)
H(P,Q) ≥ H(P ), KL(P,Q) ≥ 0, argminQ H(P,Q) = P
![Page 43: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/43.jpg)
Appendix: The Rearrangement Trick
KL(P,Q) = Ex∼P lnP (x)
Q(x)
= Ex∼P [(− lnQ(x))− (− lnP (x))]
= (Ex∼P − lnQ(x))− (Ex∼P − lnP (x))
= H(P,Q)−H(P )
In general Ex∼P ln (∏
iAi) = Ex∼P∑
i lnAi
![Page 44: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/44.jpg)
Appendix: The Rearrangement Trick
ELBO = Ez∼PΨ(z|y) lnPΦ(z, y)
PΨ(z|y)
= Ez∼PΨ(z|y) lnPΦ(z)PΦ(y|z)
PΨ(z|y)
= Ez∼PΨ(z|y) lnPΦ(y)PΦ(z|y)
PΨ(z|y)
Each of the last two expressions can be grouped three differentways leading to six ways of writing the ELBO.
![Page 45: TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early](https://reader034.vdocuments.net/reader034/viewer/2022043004/5f86dc8c76b1bb53bb350705/html5/thumbnails/45.jpg)
END