autoencoders and representation learning
TRANSCRIPT
![Page 1: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/1.jpg)
Autoencoders and Representation Learning
Deep Learning DecalHosted by Machine Learning at Berkeley
1
![Page 2: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/2.jpg)
Overview
Agenda
Background
Autoencoders
Regularized Autoencoders
Representation Learning
Representation Learning Techniques
Questions
2
![Page 3: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/3.jpg)
Background
![Page 4: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/4.jpg)
Review: Typical Neural Net Characteristics
So far, Deep Learning Models have things in common:
• Input Layer: (maybe vectorized), quantitative representation
• Hidden Layer(s): Apply transformations with nonlinearity
• Output Layer: Result for classification, regression, translation,
segmentation, etc.
• Models used for supervised learning
3
![Page 5: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/5.jpg)
Review: Typical Neural Net Characteristics
So far, Deep Learning Models have things in common:
• Input Layer: (maybe vectorized), quantitative representation
• Hidden Layer(s): Apply transformations with nonlinearity
• Output Layer: Result for classification, regression, translation,
segmentation, etc.
• Models used for supervised learning
3
![Page 6: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/6.jpg)
Review: Typical Neural Net Characteristics
So far, Deep Learning Models have things in common:
• Input Layer: (maybe vectorized), quantitative representation
• Hidden Layer(s): Apply transformations with nonlinearity
• Output Layer: Result for classification, regression, translation,
segmentation, etc.
• Models used for supervised learning
3
![Page 7: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/7.jpg)
Review: Typical Neural Net Characteristics
So far, Deep Learning Models have things in common:
• Input Layer: (maybe vectorized), quantitative representation
• Hidden Layer(s): Apply transformations with nonlinearity
• Output Layer: Result for classification, regression, translation,
segmentation, etc.
• Models used for supervised learning
3
![Page 8: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/8.jpg)
Review: Typical Neural Net Characteristics
So far, Deep Learning Models have things in common:
• Input Layer: (maybe vectorized), quantitative representation
• Hidden Layer(s): Apply transformations with nonlinearity
• Output Layer: Result for classification, regression, translation,
segmentation, etc.
• Models used for supervised learning
3
![Page 9: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/9.jpg)
Example Through Diagram
4
![Page 10: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/10.jpg)
Changing the Objective
Today’s lecture: unsupervised learning with neural networks.
5
![Page 11: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/11.jpg)
Autoencoders
![Page 12: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/12.jpg)
Autoencoders: Definition
Autoencoders are neural networks that are trained to copy their
inputs to their outputs.
• Usually constrained in particular ways to make this task more
difficult.
• Structure is almost always organized into encoder network, f,
and decoder network, g : model “ gpfpxqq
• Trained by gradient descent with reconstruction loss:
measures differences between input and output e.g. MSE :
Jpθq “ |gpfpxqq ´ x|2
6
![Page 13: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/13.jpg)
Autoencoders: Definition
Autoencoders are neural networks that are trained to copy their
inputs to their outputs.
• Usually constrained in particular ways to make this task more
difficult.
• Structure is almost always organized into encoder network, f,
and decoder network, g : model “ gpfpxqq
• Trained by gradient descent with reconstruction loss:
measures differences between input and output e.g. MSE :
Jpθq “ |gpfpxqq ´ x|2
6
![Page 14: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/14.jpg)
Autoencoders: Definition
Autoencoders are neural networks that are trained to copy their
inputs to their outputs.
• Usually constrained in particular ways to make this task more
difficult.
• Structure is almost always organized into encoder network, f,
and decoder network, g : model “ gpfpxqq
• Trained by gradient descent with reconstruction loss:
measures differences between input and output e.g. MSE :
Jpθq “ |gpfpxqq ´ x|2
6
![Page 15: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/15.jpg)
Autoencoders: Definition
Autoencoders are neural networks that are trained to copy their
inputs to their outputs.
• Usually constrained in particular ways to make this task more
difficult.
• Structure is almost always organized into encoder network, f,
and decoder network, g : model “ gpfpxqq
• Trained by gradient descent with reconstruction loss:
measures differences between input and output e.g. MSE :
Jpθq “ |gpfpxqq ´ x|2
6
![Page 16: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/16.jpg)
Not an Entirely New Idea
7
![Page 17: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/17.jpg)
Undercomplete Autoencoders
Undercomplete Autoeconders are defined to have a hidden layer
h, with smaller dimension than input layer.
• Network must model x in lower dim. space + map latent
space accurately back to input space.
• Encoder network: function that returns a useful, compressed
representation of input.
• If network has only linear transformations, encoder learns
PCA. With typical nonlinearities, network learns generalized,
more powerful version of PCA.
8
![Page 18: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/18.jpg)
Undercomplete Autoencoders
Undercomplete Autoeconders are defined to have a hidden layer
h, with smaller dimension than input layer.
• Network must model x in lower dim. space + map latent
space accurately back to input space.
• Encoder network: function that returns a useful, compressed
representation of input.
• If network has only linear transformations, encoder learns
PCA. With typical nonlinearities, network learns generalized,
more powerful version of PCA.
8
![Page 19: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/19.jpg)
Undercomplete Autoencoders
Undercomplete Autoeconders are defined to have a hidden layer
h, with smaller dimension than input layer.
• Network must model x in lower dim. space + map latent
space accurately back to input space.
• Encoder network: function that returns a useful, compressed
representation of input.
• If network has only linear transformations, encoder learns
PCA. With typical nonlinearities, network learns generalized,
more powerful version of PCA.
8
![Page 20: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/20.jpg)
Undercomplete Autoencoders
Undercomplete Autoeconders are defined to have a hidden layer
h, with smaller dimension than input layer.
• Network must model x in lower dim. space + map latent
space accurately back to input space.
• Encoder network: function that returns a useful, compressed
representation of input.
• If network has only linear transformations, encoder learns
PCA. With typical nonlinearities, network learns generalized,
more powerful version of PCA.
8
![Page 21: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/21.jpg)
Visualizing Undercomplete Autoencoders
9
![Page 22: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/22.jpg)
Caveats and Dangers
Unless careful, autoencoders will not learn meaningful
representations.
• Reconstruction loss: indifferent to latent space
characteristics. (not true for PCA).
• Higher representational power gives flexibility for suboptimal
encodings.
• Pathological case: hidden layer is only one dimension, learnsindex mappings: x piq Ñ i Ñ x piq
• Not very realistic, but completely plausible.
10
![Page 23: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/23.jpg)
Caveats and Dangers
Unless careful, autoencoders will not learn meaningful
representations.
• Reconstruction loss: indifferent to latent space
characteristics. (not true for PCA).
• Higher representational power gives flexibility for suboptimal
encodings.
• Pathological case: hidden layer is only one dimension, learnsindex mappings: x piq Ñ i Ñ x piq
• Not very realistic, but completely plausible.
10
![Page 24: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/24.jpg)
Caveats and Dangers
Unless careful, autoencoders will not learn meaningful
representations.
• Reconstruction loss: indifferent to latent space
characteristics. (not true for PCA).
• Higher representational power gives flexibility for suboptimal
encodings.
• Pathological case: hidden layer is only one dimension, learnsindex mappings: x piq Ñ i Ñ x piq
• Not very realistic, but completely plausible.
10
![Page 25: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/25.jpg)
Caveats and Dangers
Unless careful, autoencoders will not learn meaningful
representations.
• Reconstruction loss: indifferent to latent space
characteristics. (not true for PCA).
• Higher representational power gives flexibility for suboptimal
encodings.
• Pathological case: hidden layer is only one dimension, learnsindex mappings: x piq Ñ i Ñ x piq
• Not very realistic, but completely plausible.
10
![Page 26: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/26.jpg)
Caveats and Dangers
Unless careful, autoencoders will not learn meaningful
representations.
• Reconstruction loss: indifferent to latent space
characteristics. (not true for PCA).
• Higher representational power gives flexibility for suboptimal
encodings.
• Pathological case: hidden layer is only one dimension, learnsindex mappings: x piq Ñ i Ñ x piq
• Not very realistic, but completely plausible.
10
![Page 27: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/27.jpg)
How Constraints Correspond to Effective Manifold Learning
We need to impose additional constraints besides reconstruction
loss to learn manifolds.
• Data manifold Ñ concentrated high probability of being in
training set.
• Constraining complexity or imposing regularization promotes
learning a more defined ”surface” and the variations that
shape manifold.
• Ñ Autoencoders should only learn necessary variations to
reconstruct training examples.
11
![Page 28: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/28.jpg)
How Constraints Correspond to Effective Manifold Learning
We need to impose additional constraints besides reconstruction
loss to learn manifolds.
• Data manifold Ñ concentrated high probability of being in
training set.
• Constraining complexity or imposing regularization promotes
learning a more defined ”surface” and the variations that
shape manifold.
• Ñ Autoencoders should only learn necessary variations to
reconstruct training examples.
11
![Page 29: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/29.jpg)
How Constraints Correspond to Effective Manifold Learning
We need to impose additional constraints besides reconstruction
loss to learn manifolds.
• Data manifold Ñ concentrated high probability of being in
training set.
• Constraining complexity or imposing regularization promotes
learning a more defined ”surface” and the variations that
shape manifold.
• Ñ Autoencoders should only learn necessary variations to
reconstruct training examples.
11
![Page 30: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/30.jpg)
How Constraints Correspond to Effective Manifold Learning
We need to impose additional constraints besides reconstruction
loss to learn manifolds.
• Data manifold Ñ concentrated high probability of being in
training set.
• Constraining complexity or imposing regularization promotes
learning a more defined ”surface” and the variations that
shape manifold.
• Ñ Autoencoders should only learn necessary variations to
reconstruct training examples.
11
![Page 31: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/31.jpg)
Visualizing Manifolds
Extract 2D manifold of data which exists in 3D:
12
![Page 32: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/32.jpg)
Regularized Autoencoders
![Page 33: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/33.jpg)
Stochastic Autoencoders
Rethink the underlying idea of autoencoders. Instead of
encoding/decoding functions, we can see them as describing
encoding/decoding probability distributions like so:
pencoder ph|xq “ pmodelph|xq
pdecoder px|hq “ pmodelpx|hq
These distributions are called stochastic encoders and decoders
respectively.
13
![Page 34: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/34.jpg)
Distribution View of Autoencoders
Consider stochastic decoder gphq as a generative model and its
relationship to the joint distribution
pmodelpx,hq “ pmodelphq ¨ pmodelpx|hq
ln pmodelpx,hq “ ln pmodelphq ` ln pmodelpx|hq
• If h is given from encoding network, then we want most likely
x to output.
• Finding MLE of x,h « maximizing pmodelpx,hq
• pmodelphq is prior across latent space values. This term can
be regularizing.
14
![Page 35: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/35.jpg)
Distribution View of Autoencoders
Consider stochastic decoder gphq as a generative model and its
relationship to the joint distribution
pmodelpx,hq “ pmodelphq ¨ pmodelpx|hq
ln pmodelpx,hq “ ln pmodelphq ` ln pmodelpx|hq
• If h is given from encoding network, then we want most likely
x to output.
• Finding MLE of x,h « maximizing pmodelpx,hq
• pmodelphq is prior across latent space values. This term can
be regularizing.
14
![Page 36: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/36.jpg)
Distribution View of Autoencoders
Consider stochastic decoder gphq as a generative model and its
relationship to the joint distribution
pmodelpx,hq “ pmodelphq ¨ pmodelpx|hq
ln pmodelpx,hq “ ln pmodelphq ` ln pmodelpx|hq
• If h is given from encoding network, then we want most likely
x to output.
• Finding MLE of x,h « maximizing pmodelpx,hq
• pmodelphq is prior across latent space values. This term can
be regularizing.
14
![Page 37: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/37.jpg)
Distribution View of Autoencoders
Consider stochastic decoder gphq as a generative model and its
relationship to the joint distribution
pmodelpx,hq “ pmodelphq ¨ pmodelpx|hq
ln pmodelpx,hq “ ln pmodelphq ` ln pmodelpx|hq
• If h is given from encoding network, then we want most likely
x to output.
• Finding MLE of x,h « maximizing pmodelpx,hq
• pmodelphq is prior across latent space values. This term can
be regularizing.
14
![Page 38: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/38.jpg)
Distribution View of Autoencoders
Consider stochastic decoder gphq as a generative model and its
relationship to the joint distribution
pmodelpx,hq “ pmodelphq ¨ pmodelpx|hq
ln pmodelpx,hq “ ln pmodelphq ` ln pmodelpx|hq
• If h is given from encoding network, then we want most likely
x to output.
• Finding MLE of x,h « maximizing pmodelpx,hq
• pmodelphq is prior across latent space values. This term can
be regularizing.
14
![Page 39: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/39.jpg)
Meaning of Generative
By assuming a prior over latent space, can pick values from
underlying probability distribution!
15
![Page 40: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/40.jpg)
Sparse Autoencoders
Sparse Autoencoders have modified loss function with sparsity
penalty on latent variables: Jpθq “ Lpx , gpf pxqq ` Ωphq
• L1 reg as example: Assume Laplacian prior on latent space
vars:
pmodelphi q “λ
2e´λ|hi |
The log likelihood becomes:
´ ln pmodelphq “ λÿ
i
|hi | ` const. “ Ωphq
16
![Page 41: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/41.jpg)
Sparse Autoencoders
Sparse Autoencoders have modified loss function with sparsity
penalty on latent variables: Jpθq “ Lpx , gpf pxqq ` Ωphq
• L1 reg as example: Assume Laplacian prior on latent space
vars:
pmodelphi q “λ
2e´λ|hi |
The log likelihood becomes:
´ ln pmodelphq “ λÿ
i
|hi | ` const. “ Ωphq
16
![Page 42: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/42.jpg)
Sparse Autoencoders
Sparse Autoencoders have modified loss function with sparsity
penalty on latent variables: Jpθq “ Lpx , gpf pxqq ` Ωphq
• L1 reg as example: Assume Laplacian prior on latent space
vars:
pmodelphi q “λ
2e´λ|hi |
The log likelihood becomes:
´ ln pmodelphq “ λÿ
i
|hi | ` const. “ Ωphq
16
![Page 43: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/43.jpg)
Variational Autoencoders
Idea: Allocate space for storing parameters of probability
distribution.
• Latent space variables for mean, std dev of distribution
• Flow: Input Ñ encode to statistics vectors Ñ sample a latent
vector Ñ decode for reconstruction
• Loss: Reconstruction + K-L Divergence
17
![Page 44: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/44.jpg)
Variational Autoencoders
Idea: Allocate space for storing parameters of probability
distribution.
• Latent space variables for mean, std dev of distribution
• Flow: Input Ñ encode to statistics vectors Ñ sample a latent
vector Ñ decode for reconstruction
• Loss: Reconstruction + K-L Divergence
17
![Page 45: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/45.jpg)
Variational Autoencoders
Idea: Allocate space for storing parameters of probability
distribution.
• Latent space variables for mean, std dev of distribution
• Flow: Input Ñ encode to statistics vectors Ñ sample a latent
vector Ñ decode for reconstruction
• Loss: Reconstruction + K-L Divergence
17
![Page 46: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/46.jpg)
Variational Autoencoders
Idea: Allocate space for storing parameters of probability
distribution.
• Latent space variables for mean, std dev of distribution
• Flow: Input Ñ encode to statistics vectors Ñ sample a latent
vector Ñ decode for reconstruction
• Loss: Reconstruction + K-L Divergence
17
![Page 47: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/47.jpg)
Visualizing Variational Autoencoders
Latent space explicitly encodes distribution statistics! Typically
made to encode unit gaussian.
18
![Page 48: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/48.jpg)
K-L Divergence
Variational Autoencoder Loss also needs K-L divergence. Measures
difference between distributions
19
![Page 49: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/49.jpg)
Denoising Autoencoders
Sparse autoencoders motivated by particular purpose (generative
modeling). Denoising autoencoders are useful for... denoising.
• For every input x, we apply corrupting function C p¨q to create
noisy version: x “ C pxq.
• Loss function changes: Jpx, gpfpxqqq Ñ Jpx, gpfpxqqq.
• f, g will necessarily learn pdatapxq because learning identity
function will not give good loss.
20
![Page 50: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/50.jpg)
Denoising Autoencoders
Sparse autoencoders motivated by particular purpose (generative
modeling). Denoising autoencoders are useful for... denoising.
• For every input x, we apply corrupting function C p¨q to create
noisy version: x “ C pxq.
• Loss function changes: Jpx, gpfpxqqq Ñ Jpx, gpfpxqqq.
• f, g will necessarily learn pdatapxq because learning identity
function will not give good loss.
20
![Page 51: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/51.jpg)
Denoising Autoencoders
Sparse autoencoders motivated by particular purpose (generative
modeling). Denoising autoencoders are useful for... denoising.
• For every input x, we apply corrupting function C p¨q to create
noisy version: x “ C pxq.
• Loss function changes: Jpx, gpfpxqqq Ñ Jpx, gpfpxqqq.
• f, g will necessarily learn pdatapxq because learning identity
function will not give good loss.
20
![Page 52: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/52.jpg)
Denoising Autoencoders
Sparse autoencoders motivated by particular purpose (generative
modeling). Denoising autoencoders are useful for... denoising.
• For every input x, we apply corrupting function C p¨q to create
noisy version: x “ C pxq.
• Loss function changes: Jpx, gpfpxqqq Ñ Jpx, gpfpxqqq.
• f, g will necessarily learn pdatapxq because learning identity
function will not give good loss.
20
![Page 53: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/53.jpg)
Visualizing Denoising Autoencoders
By having to remove noise, model must know difference between
noise and actual image.
21
![Page 54: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/54.jpg)
Visualizing Denoising Autoencoders
The corrupting function C p¨q can corrupt in any direction Ñ
autoencoder must learn ”location” of data manifold and its
distribution pdatapxq.
22
![Page 55: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/55.jpg)
Contractive Autoencoders
Contractive Autoencoders are explicitly encouraged to learn a
manifold through their loss function.
Desirable property: Points close to each other in input space
maintain that property in the latent space.
• This will be true if fpxq “ h is continuous, has small
derivatives.
• We can use the Frobenius Norm of the Jacobian Matrix as
a regularization term:
Ωpf, xq “ λ
ˇ
ˇ
ˇ
ˇ
Bfpxq
Bx
ˇ
ˇ
ˇ
ˇ
2
F
23
![Page 56: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/56.jpg)
Contractive Autoencoders
Contractive Autoencoders are explicitly encouraged to learn a
manifold through their loss function.
Desirable property: Points close to each other in input space
maintain that property in the latent space.
• This will be true if fpxq “ h is continuous, has small
derivatives.
• We can use the Frobenius Norm of the Jacobian Matrix as
a regularization term:
Ωpf, xq “ λ
ˇ
ˇ
ˇ
ˇ
Bfpxq
Bx
ˇ
ˇ
ˇ
ˇ
2
F
23
![Page 57: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/57.jpg)
Contractive Autoencoders
Contractive Autoencoders are explicitly encouraged to learn a
manifold through their loss function.
Desirable property: Points close to each other in input space
maintain that property in the latent space.
• This will be true if fpxq “ h is continuous, has small
derivatives.
• We can use the Frobenius Norm of the Jacobian Matrix as
a regularization term:
Ωpf, xq “ λ
ˇ
ˇ
ˇ
ˇ
Bfpxq
Bx
ˇ
ˇ
ˇ
ˇ
2
F
23
![Page 58: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/58.jpg)
Contractive Autoencoders
Contractive Autoencoders are explicitly encouraged to learn a
manifold through their loss function.
Desirable property: Points close to each other in input space
maintain that property in the latent space.
• This will be true if fpxq “ h is continuous, has small
derivatives.
• We can use the Frobenius Norm of the Jacobian Matrix as
a regularization term:
Ωpf, xq “ λ
ˇ
ˇ
ˇ
ˇ
Bfpxq
Bx
ˇ
ˇ
ˇ
ˇ
2
F
23
![Page 59: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/59.jpg)
Contractive Autoencoders
Contractive Autoencoders are explicitly encouraged to learn a
manifold through their loss function.
Desirable property: Points close to each other in input space
maintain that property in the latent space.
• This will be true if fpxq “ h is continuous, has small
derivatives.
• We can use the Frobenius Norm of the Jacobian Matrix as
a regularization term:
Ωpf, xq “ λ
ˇ
ˇ
ˇ
ˇ
Bfpxq
Bx
ˇ
ˇ
ˇ
ˇ
2
F
23
![Page 60: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/60.jpg)
Jacobian and Frobenius Norm
The Jacobian Matrix for vector-valued function f pxq:
J “
»
—
—
—
—
–
Bf1Bx1
Bf1Bx2
. . . Bf1Bxn
Bf2Bx1
Bf2Bx2
. . . Bf2Bxn
...... . . .
...BfnBx1
BfnBx2
. . . BfnBxn
fi
ffi
ffi
ffi
ffi
fl
The Frobenius Norm for a matrix M:
||M||F “
d
ÿ
i ,j
M2ij
24
![Page 61: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/61.jpg)
Jacobian and Frobenius Norm
The Jacobian Matrix for vector-valued function f pxq:
J “
»
—
—
—
—
–
Bf1Bx1
Bf1Bx2
. . . Bf1Bxn
Bf2Bx1
Bf2Bx2
. . . Bf2Bxn
...... . . .
...BfnBx1
BfnBx2
. . . BfnBxn
fi
ffi
ffi
ffi
ffi
fl
The Frobenius Norm for a matrix M:
||M||F “
d
ÿ
i ,j
M2ij
24
![Page 62: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/62.jpg)
In-Depth Look at Contractive Autoencoders
Called contractive because they contract neighborhood of input
space into smaller, localized group in latent space.
• This contractive effect is designed to only occur locally.
• The Jacobian Matrix will see most of its eigenvalues drop
below 1 Ñ contracted directions
• But some directions will have eigenvalues (significantly) above
1 Ñ directions that explain most of the variance in data
25
![Page 63: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/63.jpg)
In-Depth Look at Contractive Autoencoders
Called contractive because they contract neighborhood of input
space into smaller, localized group in latent space.
• This contractive effect is designed to only occur locally.
• The Jacobian Matrix will see most of its eigenvalues drop
below 1 Ñ contracted directions
• But some directions will have eigenvalues (significantly) above
1 Ñ directions that explain most of the variance in data
25
![Page 64: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/64.jpg)
In-Depth Look at Contractive Autoencoders
Called contractive because they contract neighborhood of input
space into smaller, localized group in latent space.
• This contractive effect is designed to only occur locally.
• The Jacobian Matrix will see most of its eigenvalues drop
below 1 Ñ contracted directions
• But some directions will have eigenvalues (significantly) above
1 Ñ directions that explain most of the variance in data
25
![Page 65: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/65.jpg)
In-Depth Look at Contractive Autoencoders
Called contractive because they contract neighborhood of input
space into smaller, localized group in latent space.
• This contractive effect is designed to only occur locally.
• The Jacobian Matrix will see most of its eigenvalues drop
below 1 Ñ contracted directions
• But some directions will have eigenvalues (significantly) above
1 Ñ directions that explain most of the variance in data
25
![Page 66: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/66.jpg)
Example: MNIST in 2D manifold
26
![Page 67: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/67.jpg)
The Big Idea of Regularized Autoencoders
Previous slides underscore the central balance of regularized
autoencoders:
• Be sensitive to inputs (reconstruction loss) Ñ generate good
reconstructions of data drawn from data distribution
• Be insensitive to inputs (regularization penalty) Ñ learn
actual data distribution
27
![Page 68: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/68.jpg)
The Big Idea of Regularized Autoencoders
Previous slides underscore the central balance of regularized
autoencoders:
• Be sensitive to inputs (reconstruction loss) Ñ generate good
reconstructions of data drawn from data distribution
• Be insensitive to inputs (regularization penalty) Ñ learn
actual data distribution
27
![Page 69: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/69.jpg)
The Big Idea of Regularized Autoencoders
Previous slides underscore the central balance of regularized
autoencoders:
• Be sensitive to inputs (reconstruction loss) Ñ generate good
reconstructions of data drawn from data distribution
• Be insensitive to inputs (regularization penalty) Ñ learn
actual data distribution
27
![Page 70: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/70.jpg)
Connecting Denoising and Contractive Autoencoders
Alain and Bengio (2013) showed that denoising penalty on tiny
Gaussian noise is, in the limit, « contractive penalty on x, gpfpxqq.
• Denoising Autoencoders make reconstruction function resist
small, finite-sized perturbations in input.
• Contractive Autoencoders make feature encoding function
resist infinitesimal perturbations in input.
28
![Page 71: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/71.jpg)
Connecting Denoising and Contractive Autoencoders
Alain and Bengio (2013) showed that denoising penalty on tiny
Gaussian noise is, in the limit, « contractive penalty on x, gpfpxqq.
• Denoising Autoencoders make reconstruction function resist
small, finite-sized perturbations in input.
• Contractive Autoencoders make feature encoding function
resist infinitesimal perturbations in input.
28
![Page 72: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/72.jpg)
Connecting Denoising and Contractive Autoencoders
Alain and Bengio (2013) showed that denoising penalty on tiny
Gaussian noise is, in the limit, « contractive penalty on x, gpfpxqq.
• Denoising Autoencoders make reconstruction function resist
small, finite-sized perturbations in input.
• Contractive Autoencoders make feature encoding function
resist infinitesimal perturbations in input.
28
![Page 73: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/73.jpg)
Connecting Denoising and Contractive Autoencoders
Handling noise „ Contractive property
29
![Page 74: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/74.jpg)
Representational Power, Layer Size and Depth
Deeper autoencoders tend to generalize better and train more
efficiently than shallow ones.
• Common strategy: greedily pre-train layers and stack them
• For contractive autoencoders, calculating Jacobian for deep
networks is expensive. Good idea to do layer-by-layer.
30
![Page 75: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/75.jpg)
Representational Power, Layer Size and Depth
Deeper autoencoders tend to generalize better and train more
efficiently than shallow ones.
• Common strategy: greedily pre-train layers and stack them
• For contractive autoencoders, calculating Jacobian for deep
networks is expensive. Good idea to do layer-by-layer.
30
![Page 76: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/76.jpg)
Representational Power, Layer Size and Depth
Deeper autoencoders tend to generalize better and train more
efficiently than shallow ones.
• Common strategy: greedily pre-train layers and stack them
• For contractive autoencoders, calculating Jacobian for deep
networks is expensive. Good idea to do layer-by-layer.
30
![Page 77: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/77.jpg)
Applications of Autoencoders
• Dimensionality Reduction: Make high-quality,
low-dimension representation of data
• Information Retrieval: Locate value in database which isjust autoencoded key.
• If you need binary for hash table, use sigmoid in final layer.
31
![Page 78: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/78.jpg)
Applications of Autoencoders
• Dimensionality Reduction: Make high-quality,
low-dimension representation of data
• Information Retrieval: Locate value in database which isjust autoencoded key.
• If you need binary for hash table, use sigmoid in final layer.
31
![Page 79: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/79.jpg)
Applications of Autoencoders
• Dimensionality Reduction: Make high-quality,
low-dimension representation of data
• Information Retrieval: Locate value in database which isjust autoencoded key.
• If you need binary for hash table, use sigmoid in final layer.
31
![Page 80: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/80.jpg)
Representation Learning
![Page 81: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/81.jpg)
The Power of Representations
Representations are important: try long division with Roman
numerals
Other examples: Variables in algebra, cartesian grid for analytic
geometry, binary encodings for information theory, electronics
32
![Page 82: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/82.jpg)
The Power of Representations
Representations are important: try long division with Roman
numerals
Other examples: Variables in algebra, cartesian grid for analytic
geometry, binary encodings for information theory, electronics 32
![Page 83: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/83.jpg)
Representations in Deep Learning
A good representation of data makes subsequent tasks easier -
more tractable, less expensive.
• Feedforward nets: Hidden layers make representation for
output layer (linear classifier)
• Conv nets: Maintain topology of input, convert into 3-D
convolutions, pooling, etc.
• Autoencoders: The entire mission of the architecture
33
![Page 84: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/84.jpg)
Representations in Deep Learning
A good representation of data makes subsequent tasks easier -
more tractable, less expensive.
• Feedforward nets: Hidden layers make representation for
output layer (linear classifier)
• Conv nets: Maintain topology of input, convert into 3-D
convolutions, pooling, etc.
• Autoencoders: The entire mission of the architecture
33
![Page 85: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/85.jpg)
Representations in Deep Learning
A good representation of data makes subsequent tasks easier -
more tractable, less expensive.
• Feedforward nets: Hidden layers make representation for
output layer (linear classifier)
• Conv nets: Maintain topology of input, convert into 3-D
convolutions, pooling, etc.
• Autoencoders: The entire mission of the architecture
33
![Page 86: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/86.jpg)
Symbolic Representations
Vector P Rn, each spot symbolizing exactly one category.
• Example: Bag-of-words (one-hot or n-grams) in NLP.
• All words / n-grams equally distant from one another.
• Representation does not capture features!
Fundamentally limited : „ Opnq possible representations.
34
![Page 87: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/87.jpg)
Symbolic Representations
Vector P Rn, each spot symbolizing exactly one category.
• Example: Bag-of-words (one-hot or n-grams) in NLP.
• All words / n-grams equally distant from one another.
• Representation does not capture features!
Fundamentally limited : „ Opnq possible representations.
34
![Page 88: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/88.jpg)
Symbolic Representations
Vector P Rn, each spot symbolizing exactly one category.
• Example: Bag-of-words (one-hot or n-grams) in NLP.
• All words / n-grams equally distant from one another.
• Representation does not capture features!
Fundamentally limited : „ Opnq possible representations.
34
![Page 89: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/89.jpg)
Symbolic Representations
Vector P Rn, each spot symbolizing exactly one category.
• Example: Bag-of-words (one-hot or n-grams) in NLP.
• All words / n-grams equally distant from one another.
• Representation does not capture features!
Fundamentally limited : „ Opnq possible representations.
34
![Page 90: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/90.jpg)
Symbolic Representations
Vector P Rn, each spot symbolizing exactly one category.
• Example: Bag-of-words (one-hot or n-grams) in NLP.
• All words / n-grams equally distant from one another.
• Representation does not capture features!
Fundamentally limited : „ Opnq possible representations.
34
![Page 91: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/91.jpg)
Symbolic Representations
Vector P Rn, each spot symbolizing exactly one category.
• Example: Bag-of-words (one-hot or n-grams) in NLP.
• All words / n-grams equally distant from one another.
• Representation does not capture features!
Fundamentally limited : „ Opnq possible representations.
34
![Page 92: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/92.jpg)
Distributed Representations
Have vector P Rn, each possible vector symbolizing one category.
• Example: Word embeddings in NLP.
• Can encode similarity and meaningful distance in embedding
space.
• Spots can encode features: number of legs vs. is a dog
Pretty much always preferred : „ Opknq possible representations,
where k is number of values a feature can take on.
35
![Page 93: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/93.jpg)
Distributed Representations
Have vector P Rn, each possible vector symbolizing one category.
• Example: Word embeddings in NLP.
• Can encode similarity and meaningful distance in embedding
space.
• Spots can encode features: number of legs vs. is a dog
Pretty much always preferred : „ Opknq possible representations,
where k is number of values a feature can take on.
35
![Page 94: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/94.jpg)
Distributed Representations
Have vector P Rn, each possible vector symbolizing one category.
• Example: Word embeddings in NLP.
• Can encode similarity and meaningful distance in embedding
space.
• Spots can encode features: number of legs vs. is a dog
Pretty much always preferred : „ Opknq possible representations,
where k is number of values a feature can take on.
35
![Page 95: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/95.jpg)
Distributed Representations
Have vector P Rn, each possible vector symbolizing one category.
• Example: Word embeddings in NLP.
• Can encode similarity and meaningful distance in embedding
space.
• Spots can encode features: number of legs vs. is a dog
Pretty much always preferred : „ Opknq possible representations,
where k is number of values a feature can take on.
35
![Page 96: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/96.jpg)
Distributed Representations
Have vector P Rn, each possible vector symbolizing one category.
• Example: Word embeddings in NLP.
• Can encode similarity and meaningful distance in embedding
space.
• Spots can encode features: number of legs vs. is a dog
Pretty much always preferred : „ Opknq possible representations,
where k is number of values a feature can take on.
35
![Page 97: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/97.jpg)
Distributed Representations
Have vector P Rn, each possible vector symbolizing one category.
• Example: Word embeddings in NLP.
• Can encode similarity and meaningful distance in embedding
space.
• Spots can encode features: number of legs vs. is a dog
Pretty much always preferred : „ Opknq possible representations,
where k is number of values a feature can take on.
35
![Page 98: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/98.jpg)
Benefits of Distributed Representations
Distributed representation Ñ Data Manifold
Also: less dimensionality, faster training
36
![Page 99: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/99.jpg)
Benefits of Distributed Representations
Distributed representation Ñ Data Manifold
Also: less dimensionality, faster training
36
![Page 100: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/100.jpg)
Representation Learning Techniques
![Page 101: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/101.jpg)
Greedy Layer-Wise Unsupervised Pretraining
Pivotal technique that allowed training of deep nets without
specialized properties (convolution, recurrence, etc.)
• Key Idea: Leverage representations learned for one task to
solve another.
• Train each layer of feedforward net greedily as a
representation learning alg. e.g. autoencoder.
• Continue stacking layers. Output of all prior layers is input for
next one.
• Fine tune, i.e. jointly train, all layers once each has learned
representations.
37
![Page 102: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/102.jpg)
Greedy Layer-Wise Unsupervised Pretraining
Pivotal technique that allowed training of deep nets without
specialized properties (convolution, recurrence, etc.)
• Key Idea: Leverage representations learned for one task to
solve another.
• Train each layer of feedforward net greedily as a
representation learning alg. e.g. autoencoder.
• Continue stacking layers. Output of all prior layers is input for
next one.
• Fine tune, i.e. jointly train, all layers once each has learned
representations.
37
![Page 103: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/103.jpg)
Greedy Layer-Wise Unsupervised Pretraining
Pivotal technique that allowed training of deep nets without
specialized properties (convolution, recurrence, etc.)
• Key Idea: Leverage representations learned for one task to
solve another.
• Train each layer of feedforward net greedily as a
representation learning alg. e.g. autoencoder.
• Continue stacking layers. Output of all prior layers is input for
next one.
• Fine tune, i.e. jointly train, all layers once each has learned
representations.
37
![Page 104: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/104.jpg)
Greedy Layer-Wise Unsupervised Pretraining
Pivotal technique that allowed training of deep nets without
specialized properties (convolution, recurrence, etc.)
• Key Idea: Leverage representations learned for one task to
solve another.
• Train each layer of feedforward net greedily as a
representation learning alg. e.g. autoencoder.
• Continue stacking layers. Output of all prior layers is input for
next one.
• Fine tune, i.e. jointly train, all layers once each has learned
representations.
37
![Page 105: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/105.jpg)
Greedy Layer-Wise Unsupervised Pretraining
Pivotal technique that allowed training of deep nets without
specialized properties (convolution, recurrence, etc.)
• Key Idea: Leverage representations learned for one task to
solve another.
• Train each layer of feedforward net greedily as a
representation learning alg. e.g. autoencoder.
• Continue stacking layers. Output of all prior layers is input for
next one.
• Fine tune, i.e. jointly train, all layers once each has learned
representations.
37
![Page 106: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/106.jpg)
Greedy Layer-Wise Unsupervised Pretraining (Contd.)
Works on two assumptions:
• Picking initial parameters has regularizing effect and improves
generalization, optimization. (Not well understood)
• Learning properties of input distribution can help in mapping
inputs to outputs. (Better understood)
Sometimes helpful, sometimes not:
• Effective for word embeddings - replaces one-hot. Also for
very complex functions shaped by input data distribution
• Useful when few labeled, many unlabeled examples -
semi-supervised learning
• Less effective for images - topology is already present.
38
![Page 107: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/107.jpg)
Greedy Layer-Wise Unsupervised Pretraining (Contd.)
Works on two assumptions:
• Picking initial parameters has regularizing effect and improves
generalization, optimization. (Not well understood)
• Learning properties of input distribution can help in mapping
inputs to outputs. (Better understood)
Sometimes helpful, sometimes not:
• Effective for word embeddings - replaces one-hot. Also for
very complex functions shaped by input data distribution
• Useful when few labeled, many unlabeled examples -
semi-supervised learning
• Less effective for images - topology is already present.
38
![Page 108: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/108.jpg)
Greedy Layer-Wise Unsupervised Pretraining (Contd.)
Works on two assumptions:
• Picking initial parameters has regularizing effect and improves
generalization, optimization. (Not well understood)
• Learning properties of input distribution can help in mapping
inputs to outputs. (Better understood)
Sometimes helpful, sometimes not:
• Effective for word embeddings - replaces one-hot. Also for
very complex functions shaped by input data distribution
• Useful when few labeled, many unlabeled examples -
semi-supervised learning
• Less effective for images - topology is already present.
38
![Page 109: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/109.jpg)
Greedy Layer-Wise Unsupervised Pretraining (Contd.)
Works on two assumptions:
• Picking initial parameters has regularizing effect and improves
generalization, optimization. (Not well understood)
• Learning properties of input distribution can help in mapping
inputs to outputs. (Better understood)
Sometimes helpful, sometimes not:
• Effective for word embeddings - replaces one-hot. Also for
very complex functions shaped by input data distribution
• Useful when few labeled, many unlabeled examples -
semi-supervised learning
• Less effective for images - topology is already present.
38
![Page 110: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/110.jpg)
Greedy Layer-Wise Unsupervised Pretraining (Contd.)
Works on two assumptions:
• Picking initial parameters has regularizing effect and improves
generalization, optimization. (Not well understood)
• Learning properties of input distribution can help in mapping
inputs to outputs. (Better understood)
Sometimes helpful, sometimes not:
• Effective for word embeddings - replaces one-hot. Also for
very complex functions shaped by input data distribution
• Useful when few labeled, many unlabeled examples -
semi-supervised learning
• Less effective for images - topology is already present.
38
![Page 111: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/111.jpg)
Greedy Layer-Wise Unsupervised Pretraining (Contd.)
Works on two assumptions:
• Picking initial parameters has regularizing effect and improves
generalization, optimization. (Not well understood)
• Learning properties of input distribution can help in mapping
inputs to outputs. (Better understood)
Sometimes helpful, sometimes not:
• Effective for word embeddings - replaces one-hot. Also for
very complex functions shaped by input data distribution
• Useful when few labeled, many unlabeled examples -
semi-supervised learning
• Less effective for images - topology is already present.
38
![Page 112: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/112.jpg)
Greedy Layer-Wise Unsupervised Pretraining (Contd.)
Works on two assumptions:
• Picking initial parameters has regularizing effect and improves
generalization, optimization. (Not well understood)
• Learning properties of input distribution can help in mapping
inputs to outputs. (Better understood)
Sometimes helpful, sometimes not:
• Effective for word embeddings - replaces one-hot. Also for
very complex functions shaped by input data distribution
• Useful when few labeled, many unlabeled examples -
semi-supervised learning
• Less effective for images - topology is already present.
38
![Page 113: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/113.jpg)
39
![Page 114: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/114.jpg)
Multi-Task Learning
Given two similar learning tasks and their labeled data: D1,D2.
D2 has few examples compared to D1.
• Idea: (Pre-)Train network on D1, then work D2.
• Hopefully, low-level features from D1 are useful for D2, and
fine-tuning is enough for D2.
40
![Page 115: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/115.jpg)
Multi-Task Learning
Given two similar learning tasks and their labeled data: D1,D2.
D2 has few examples compared to D1.
• Idea: (Pre-)Train network on D1, then work D2.
• Hopefully, low-level features from D1 are useful for D2, and
fine-tuning is enough for D2.
40
![Page 116: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/116.jpg)
Multi-Task Learning
Given two similar learning tasks and their labeled data: D1,D2.
D2 has few examples compared to D1.
• Idea: (Pre-)Train network on D1, then work D2.
• Hopefully, low-level features from D1 are useful for D2, and
fine-tuning is enough for D2.
40
![Page 117: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/117.jpg)
Transfer Learning
Inputs are similar, while labels are different between D1,D2.
Ex: Images of dogs or cats (D1). Then classify images as horse or
cow (D2).
• Low-level features of inputs, are same: lighting, animal
orientations, edges, faces.
• Labels are fundamentally different.
• Learning of D1 will establish latent space where dists. are
separated. Then adjust to assign D2 labels to transformed D2
by pre-trained network.
41
![Page 118: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/118.jpg)
Transfer Learning
Inputs are similar, while labels are different between D1,D2.
Ex: Images of dogs or cats (D1). Then classify images as horse or
cow (D2).
• Low-level features of inputs, are same: lighting, animal
orientations, edges, faces.
• Labels are fundamentally different.
• Learning of D1 will establish latent space where dists. are
separated. Then adjust to assign D2 labels to transformed D2
by pre-trained network.
41
![Page 119: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/119.jpg)
Transfer Learning
Inputs are similar, while labels are different between D1,D2.
Ex: Images of dogs or cats (D1). Then classify images as horse or
cow (D2).
• Low-level features of inputs, are same: lighting, animal
orientations, edges, faces.
• Labels are fundamentally different.
• Learning of D1 will establish latent space where dists. are
separated. Then adjust to assign D2 labels to transformed D2
by pre-trained network.
41
![Page 120: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/120.jpg)
Transfer Learning
Inputs are similar, while labels are different between D1,D2.
Ex: Images of dogs or cats (D1). Then classify images as horse or
cow (D2).
• Low-level features of inputs, are same: lighting, animal
orientations, edges, faces.
• Labels are fundamentally different.
• Learning of D1 will establish latent space where dists. are
separated. Then adjust to assign D2 labels to transformed D2
by pre-trained network.
41
![Page 121: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/121.jpg)
Domain Adaptation
Labels are similar, while the inputs are different.
Ex: Speech-to-text system for person 1 (D1). Then train to also
work for person 2 (D2).
• For both, text must be valid English sentences, so labels are
similar.
• Speakers may have diff. pitch depth, accents, etc Ñdifferent
inputs.
• Training on D1 gives model power to map noise to English in
general. Just adjust to assign D2 input to D2 labels.
42
![Page 122: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/122.jpg)
Domain Adaptation
Labels are similar, while the inputs are different.
Ex: Speech-to-text system for person 1 (D1). Then train to also
work for person 2 (D2).
• For both, text must be valid English sentences, so labels are
similar.
• Speakers may have diff. pitch depth, accents, etc Ñdifferent
inputs.
• Training on D1 gives model power to map noise to English in
general. Just adjust to assign D2 input to D2 labels.
42
![Page 123: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/123.jpg)
Domain Adaptation
Labels are similar, while the inputs are different.
Ex: Speech-to-text system for person 1 (D1). Then train to also
work for person 2 (D2).
• For both, text must be valid English sentences, so labels are
similar.
• Speakers may have diff. pitch depth, accents, etc Ñdifferent
inputs.
• Training on D1 gives model power to map noise to English in
general. Just adjust to assign D2 input to D2 labels.
42
![Page 124: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/124.jpg)
Domain Adaptation
Labels are similar, while the inputs are different.
Ex: Speech-to-text system for person 1 (D1). Then train to also
work for person 2 (D2).
• For both, text must be valid English sentences, so labels are
similar.
• Speakers may have diff. pitch depth, accents, etc Ñdifferent
inputs.
• Training on D1 gives model power to map noise to English in
general. Just adjust to assign D2 input to D2 labels.
42
![Page 125: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/125.jpg)
43
![Page 126: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/126.jpg)
More About Multi-Task Learning
So far: supervised, but also works with unsupervised and RL.
Deeper networks make significant impact in Multi-Task Learning.
• One-Shot Learning only uses one labeled example from D2.
Training from D1 gives clean separations in space and then
can infer whole cluster labels.
• Zero-Shot Learning is able to work with zero labeled trainingexamples. Learn ppy |x ,T q, T being a context variable
• For example: T is sentences: cats have four legs, pointy ears,
fur, etc. x is images, with y being label of cat or not.
44
![Page 127: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/127.jpg)
More About Multi-Task Learning
So far: supervised, but also works with unsupervised and RL.
Deeper networks make significant impact in Multi-Task Learning.
• One-Shot Learning only uses one labeled example from D2.
Training from D1 gives clean separations in space and then
can infer whole cluster labels.
• Zero-Shot Learning is able to work with zero labeled trainingexamples. Learn ppy |x ,T q, T being a context variable
• For example: T is sentences: cats have four legs, pointy ears,
fur, etc. x is images, with y being label of cat or not.
44
![Page 128: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/128.jpg)
More About Multi-Task Learning
So far: supervised, but also works with unsupervised and RL.
Deeper networks make significant impact in Multi-Task Learning.
• One-Shot Learning only uses one labeled example from D2.
Training from D1 gives clean separations in space and then
can infer whole cluster labels.
• Zero-Shot Learning is able to work with zero labeled trainingexamples. Learn ppy |x ,T q, T being a context variable
• For example: T is sentences: cats have four legs, pointy ears,
fur, etc. x is images, with y being label of cat or not.
44
![Page 129: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/129.jpg)
More About Multi-Task Learning
So far: supervised, but also works with unsupervised and RL.
Deeper networks make significant impact in Multi-Task Learning.
• One-Shot Learning only uses one labeled example from D2.
Training from D1 gives clean separations in space and then
can infer whole cluster labels.
• Zero-Shot Learning is able to work with zero labeled trainingexamples. Learn ppy |x ,T q, T being a context variable
• For example: T is sentences: cats have four legs, pointy ears,
fur, etc. x is images, with y being label of cat or not.
44
![Page 130: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/130.jpg)
More About Multi-Task Learning
So far: supervised, but also works with unsupervised and RL.
Deeper networks make significant impact in Multi-Task Learning.
• One-Shot Learning only uses one labeled example from D2.
Training from D1 gives clean separations in space and then
can infer whole cluster labels.
• Zero-Shot Learning is able to work with zero labeled trainingexamples. Learn ppy |x ,T q, T being a context variable
• For example: T is sentences: cats have four legs, pointy ears,
fur, etc. x is images, with y being label of cat or not.
44
![Page 131: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/131.jpg)
Isolating Causal Factors
Two desirable properties of representations. They often coincide:
• Disentangled Causes: for rep. ppxq, we want to know ppy|xqi.e., does y cause x.
• If x, y correlated, then ppxq and ppy|xq will be strongly tied.
We want this relation to be clear, hence disentangled.
• Easy Modeling: representations that have sparse feature
vectors which imply independent features
45
![Page 132: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/132.jpg)
Isolating Causal Factors
Two desirable properties of representations. They often coincide:
• Disentangled Causes: for rep. ppxq, we want to know ppy|xqi.e., does y cause x.
• If x, y correlated, then ppxq and ppy|xq will be strongly tied.
We want this relation to be clear, hence disentangled.
• Easy Modeling: representations that have sparse feature
vectors which imply independent features
45
![Page 133: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/133.jpg)
Isolating Causal Factors
Two desirable properties of representations. They often coincide:
• Disentangled Causes: for rep. ppxq, we want to know ppy|xqi.e., does y cause x.
• If x, y correlated, then ppxq and ppy|xq will be strongly tied.
We want this relation to be clear, hence disentangled.
• Easy Modeling: representations that have sparse feature
vectors which imply independent features
45
![Page 134: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/134.jpg)
Isolating Causal Factors
Two desirable properties of representations. They often coincide:
• Disentangled Causes: for rep. ppxq, we want to know ppy|xqi.e., does y cause x.
• If x, y correlated, then ppxq and ppy|xq will be strongly tied.
We want this relation to be clear, hence disentangled.
• Easy Modeling: representations that have sparse feature
vectors which imply independent features
45
![Page 135: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/135.jpg)
Ideal Latent Variables
Assume y is a causal factor of x and h represents all of those
factors.
• Joint distribution of model is: ppx,hq “ ppx|hqpphq
• Marginal probability of x is
ppxq “ÿ
h
pphqppx|hq “ Ehppx|hq
Thus, best latent var h (w.r.t. p(x)) explains x from a causal
point of view.
• ppy |xq depends on ppxq, hence h being causal is valuable.
46
![Page 136: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/136.jpg)
Ideal Latent Variables
Assume y is a causal factor of x and h represents all of those
factors.
• Joint distribution of model is: ppx,hq “ ppx|hqpphq
• Marginal probability of x is
ppxq “ÿ
h
pphqppx|hq “ Ehppx|hq
Thus, best latent var h (w.r.t. p(x)) explains x from a causal
point of view.
• ppy |xq depends on ppxq, hence h being causal is valuable.
46
![Page 137: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/137.jpg)
Ideal Latent Variables
Assume y is a causal factor of x and h represents all of those
factors.
• Joint distribution of model is: ppx,hq “ ppx|hqpphq
• Marginal probability of x is
ppxq “ÿ
h
pphqppx|hq “ Ehppx|hq
Thus, best latent var h (w.r.t. p(x)) explains x from a causal
point of view.
• ppy |xq depends on ppxq, hence h being causal is valuable.
46
![Page 138: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/138.jpg)
Ideal Latent Variables
Assume y is a causal factor of x and h represents all of those
factors.
• Joint distribution of model is: ppx,hq “ ppx|hqpphq
• Marginal probability of x is
ppxq “ÿ
h
pphqppx|hq “ Ehppx|hq
Thus, best latent var h (w.r.t. p(x)) explains x from a causal
point of view.
• ppy |xq depends on ppxq, hence h being causal is valuable.
46
![Page 139: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/139.jpg)
The Real World
Real world data often has more causes than can/should be
encoded.
• Humans fail to detect changes in environment unimportant to
current task.
• Must establish learnable measures of saliency to attach to
features.
• Example: Autoencoders trained on images often fail to
register important small objects like ping pong balls.
47
![Page 140: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/140.jpg)
The Real World
Real world data often has more causes than can/should be
encoded.
• Humans fail to detect changes in environment unimportant to
current task.
• Must establish learnable measures of saliency to attach to
features.
• Example: Autoencoders trained on images often fail to
register important small objects like ping pong balls.
47
![Page 141: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/141.jpg)
The Real World
Real world data often has more causes than can/should be
encoded.
• Humans fail to detect changes in environment unimportant to
current task.
• Must establish learnable measures of saliency to attach to
features.
• Example: Autoencoders trained on images often fail to
register important small objects like ping pong balls.
47
![Page 142: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/142.jpg)
The Real World
Real world data often has more causes than can/should be
encoded.
• Humans fail to detect changes in environment unimportant to
current task.
• Must establish learnable measures of saliency to attach to
features.
• Example: Autoencoders trained on images often fail to
register important small objects like ping pong balls.
47
![Page 143: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/143.jpg)
Failure of Traditional Loss Functions
48
![Page 144: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/144.jpg)
The Adversarial Approach to Saliency
MSE: salience presumably affects pixel intensity for large number
of pixels.
Adversarial: learn saliency by tricking a discriminator network
• Discriminator is trained to tell between ground truth and
generated data
• Discriminator can attach high saliency to small number of
pixels
• Framework of Generative Adversarial Networks (more later
in course).
49
![Page 145: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/145.jpg)
The Adversarial Approach to Saliency
MSE: salience presumably affects pixel intensity for large number
of pixels.
Adversarial: learn saliency by tricking a discriminator network
• Discriminator is trained to tell between ground truth and
generated data
• Discriminator can attach high saliency to small number of
pixels
• Framework of Generative Adversarial Networks (more later
in course).
49
![Page 146: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/146.jpg)
The Adversarial Approach to Saliency
MSE: salience presumably affects pixel intensity for large number
of pixels.
Adversarial: learn saliency by tricking a discriminator network
• Discriminator is trained to tell between ground truth and
generated data
• Discriminator can attach high saliency to small number of
pixels
• Framework of Generative Adversarial Networks (more later
in course).
49
![Page 147: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/147.jpg)
The Adversarial Approach to Saliency
MSE: salience presumably affects pixel intensity for large number
of pixels.
Adversarial: learn saliency by tricking a discriminator network
• Discriminator is trained to tell between ground truth and
generated data
• Discriminator can attach high saliency to small number of
pixels
• Framework of Generative Adversarial Networks (more later
in course).
49
![Page 148: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/148.jpg)
The Adversarial Approach to Saliency
MSE: salience presumably affects pixel intensity for large number
of pixels.
Adversarial: learn saliency by tricking a discriminator network
• Discriminator is trained to tell between ground truth and
generated data
• Discriminator can attach high saliency to small number of
pixels
• Framework of Generative Adversarial Networks (more later
in course).
49
![Page 149: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/149.jpg)
Comparing Traditional to Adversarial
50
![Page 150: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/150.jpg)
Conclusion
• The crux of autoencoders is representation learning.
• The crux of deep learning is representation learning.
• The crux of intelligence is probably representation learning.
51
![Page 151: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/151.jpg)
Conclusion
• The crux of autoencoders is representation learning.
• The crux of deep learning is representation learning.
• The crux of intelligence is probably representation learning.
51
![Page 152: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/152.jpg)
Conclusion
• The crux of autoencoders is representation learning.
• The crux of deep learning is representation learning.
• The crux of intelligence is probably representation learning.
51
![Page 153: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/153.jpg)
Conclusion
• The crux of autoencoders is representation learning.
• The crux of deep learning is representation learning.
• The crux of intelligence is probably representation learning.
51
![Page 154: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/154.jpg)
Questions
![Page 155: Autoencoders and Representation Learning](https://reader034.vdocuments.net/reader034/viewer/2022052013/62860bd5abe43f397667edca/html5/thumbnails/155.jpg)
Questions?
52