in vitro molecular evolution · • introduction to deep network • restricted boltzmann machine...
TRANSCRIPT
Deep Networks
CSE Course on Artificial Neural Networks &
CogSci Course on Cognitive Modeling and AI
BraSci Course on Computational Neuroscience
Fall 2011
November 3, 2011
Byoung-Tak Zhang
Computer Science and Engineering (CSE) &
Cognitive Science and Brain Science Programs
http://bi.snu.ac.kr/
Overview
• Introduction to Deep Network
• Restricted Boltzmann Machine (RBM)
• Learning RBM: Contrastive Divergence
• Deep Belief Networks (DBN)
• Examples
– Image Reconstruction
– Motion generation
USING A VAST, VAST MAJORITY OF SLIDES ORIGINALLY FROM: Geoffrey Hinton, Sue Becker, Yann Le Cun, Yoshua Bengio, Frank Wood, Honglak Lee, George Taylor
Motivation: why go deep?
• Deep Architectures can be representationally efficient
– Fewer computational units for same function
• Deep Representations might allow for a hierarchy or representation
– Allows non-local generalization
– Comprehensibility
• Multiple levels of latent variables allow combinatorial sharing of statistical strength
• Deep architectures work well (vision, audio, NLP, etc.)!
Learning to extract the orientation of a face patch (Salakhutdinov & Hinton, NIPS 2007)
The training and test sets for predicting
face orientation
11,000 unlabeled cases 100, 500, or 1000 labeled cases
face patches from new people
The root mean squared error in the orientation
when combining GP’s with deep belief nets
22.2 17.9 15.2
17.2 12.7 7.2
16.3 11.2 6.4
GP on
the
pixels
GP on
top-level
features
GP on top-level
features with
fine-tuning
100 labels
500 labels
1000 labels
Conclusion: The deep features are much better
than the pixels. Fine-tuning helps a lot.
Deep Autoencoders (Hinton & Salakhutdinov, 2006)
• They always looked like a really
nice way to do non-linear
dimensionality reduction:
– But it is very difficult to
optimize deep autoencoders
using backpropagation.
• We now have a much better way
to optimize them:
– First train a stack of 4 RBM’s
– Then “unroll” them.
– Then fine-tune with backprop.
1000 neurons
500 neurons
500 neurons
250 neurons
250 neurons
30
1000 neurons
28x28
28x28
1
2
3
4
4
3
2
1
W
W
W
W
W
W
W
W
T
T
T
T
linear
units
A comparison of methods for compressing
digit images to 30 real numbers.
real
data
30-D
deep auto
30-D logistic
PCA
30-D
PCA
Retrieving documents that are similar
to a query document
• We can use an autoencoder to find low-
dimensional codes for documents that allow
fast and accurate retrieval of similar
documents from a large set.
• We start by converting each document into a
“bag of words”. This a 2000 dimensional
vector that contains the counts for each of the
2000 commonest words.
How to compress the count vector
• We train the neural
network to reproduce its
input vector as its output
• This forces it to
compress as much
information as possible
into the 10 numbers in
the central bottleneck.
• These 10 numbers are
then a good way to
compare documents.
2000 reconstructed counts
500 neurons
2000 word counts
500 neurons
250 neurons
250 neurons
10
input
vector
output
vector
Performance of the autoencoder at
document retrieval
• Train on bags of 2000 words for 400,000 training cases of business documents.
– First train a stack of RBM’s. Then fine-tune with backprop.
• Test on a separate 400,000 documents.
– Pick one test document as a query. Rank order all the other test documents by using the cosine of the angle between codes.
– Repeat this using each of the 400,000 test documents as the query (requires 0.16 trillion comparisons).
• Plot the number of retrieved documents against the proportion that are in the same hand-labeled class as the query document.
Proportion of retrieved documents in same class as query
Number of documents retrieved
First compress all documents to 2 numbers using a type of PCA
Then use different colors for different document categories
First compress all documents to 2 numbers.
Then use different colors for different document categories
RESTRICTED BOLTZMANN
MACHINE (RBM)
Restricted Boltzmann Machines
• Restrict the connectivity to make learning
easier.
– Only one layer of hidden units.
– No connections between hidden units.
• In an RBM, the hidden units are
conditionally independent given the
visible states.
– So can quickly get an unbiased
sample from the posterior distribution
when given a data-vector.
– This is a big advantage over directed
belief nets
hidden
i
j
visible
Stochastic binary neurons
• These have a state of 1 or 0 which is a stochastic
function of the neuron’s bias, b, and the input it receives
from other neurons.
0.5
0
0
1
j
jijii
wsbsp
)exp(1)(
11
j
jiji wsb
)( 1isp
Stochastic units
• Replace the binary threshold units by binary stochastic
units that make biased random decisions.
– The temperature controls the amount of noise.
– Decreasing all the energy gaps between configurations
is equivalent to raising the noise level.
)()(
1
1
1
1)(
10
1
iii
TETwsi
sEsEEgapEnergy
ee
spij ijj
temperature
The Energy of a joint configuration
ji
ijji
unitsi
ii wssbsE vhvhvhhv ),(
bias of unit i
weight between units i and j
Energy with configuration v on the visible units and h on the hidden units
binary state of unit i in joint configuration v, h
indexes every non-identical pair of i and j once
v
h
Weights Energies Probabilities
• Each possible joint configuration of the visible and hidden units has an energy – The energy is determined by the weights and
biases (as in a Hopfield net).
• The energy of a joint configuration of the visible and hidden units determines its probability:
• The probability of a configuration over the visible units is found by summing the probabilities of all the joint configurations that contain it.
),(),(
hvEhvp e
LEARNING RBM:
CONTRASTIVE DIVERGENCE
Maximizing the training data log likelihood
• We want maximizing parameters
• Differentiate w.r.t. to all parameters and
perform gradient ascent to find optimal
parameters.
• The derivation is nasty.
D,,
1,, |
|
logmaxarg),,|(Dlogmaxarg11 d
c m
mm
m
mm
ncf
df
pnn
Assuming d’s drawn inde
pendently from p()
Standard PoE form
Over all training data.
PoE: product-of-experts
Equilibrium Is Hard to Achieve
• With:
can now train our PoE model.
• But… there’s a problem: – is computationally infeasible to obtain (esp. in
an inner gradient ascent loop).
– Sampling Markov Chain must converge to target distribution. Often this takes a very long time!
Pm
mm
Pm
mm cfdf |log|log
0
m
np
),,|(Dlog 1
P
A very surprising fact
• Everything that one weight needs to know about
the other weights and the data in order to do
maximum likelihood learning is contained in the
difference of two correlations.
freejijiij
ssssw
p
v
v)(log
Derivative of log probability of one training vector
Expected value of product of states at thermal equilibrium when the training vector is clamped on the visible units
Expected value of product of states at thermal equilibrium when nothing is clamped
The (theoretical) batch learning algorithm
• Positive phase – Clamp a data vector on the visible units. – Let the hidden units reach thermal equilibrium at a
temperature of 1 – Sample for all pairs of units – Repeat for all data vectors in the training set.
• Negative phase – Do not clamp any of the units – Let the whole network reach thermal equilibrium at
a temperature of 1 (where do we start?) – Sample for all pairs of units – Repeat many times to get good estimates
• Weight updates – Update each weight by an amount proportional to
the difference in in the two phases.
jiss
jiss
jiss
Solution: Contrastive Divergence!
• Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically)
• Why does this work? – Attempts to minimize the ways that the model
distorts the data.
10
|log|log
Pm
mm
Pm
mm cfdf
m
np
),,|(Dlog 1
Contrastive Divergence
• Maximum likelihood gradient: pull down energy
surface at the examples and pull it up
everywhere else, with more emphasis where
model puts more probability mass
• Contrastive divergence updates: pull down
energy surface at the examples and pull it up in
their neighborhood, with more emphasis where
model puts more probability mass
A picture of the Boltzmann machine learning
algorithm for an RBM
0 jiss1 jiss jiss
i
j
i
j
i
j
i
j
t = 0 t = 1 t = 2 t = infinity
)( 0 jijiij ssssw
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
a fantasy
Contrastive divergence learning:
A quick way to learn an RBM
0 jiss 1 jiss
i
j
i
j
t = 0 t = 1
)( 10 jijiij ssssw
Start with a training vector on the visible units.
Update all the hidden units in parallel
Update the all the visible units in parallel to get a “reconstruction”.
Update the hidden units again.
This is not following the gradient of the log likelihood. But it works well.
When we consider infinite directed nets it will be easy to see why it works.
reconstruction data
DEEP BELIEF NETWORKS
Deep Belief Networks(DBNs)
• Probabilistic generative model
• Deep architecture – multiple layers
• Unsupervised pre-learning provides a good
initialization of the network
– maximizing the lower-bound of the log-
likelihood of the data
• Supervised fine-tuning
– Generative: Up-down algorithm
– Discriminative: backpropagation
DBN structure
Deep Network Training
• Use unsupervised learning (greedy layer-wise training)
– Allows abstraction to develop naturally from one layer to another
– Help the network initialize with good parameters
• Perform supervised top-down training as final step
– Refine the features (intermediate layers) so that they become more relevant for the task
DBN Greedy training
• First step:
– Construct an RBM
with an input layer v
and a hidden layer h
– Train the RBM
DBN Greedy training
• Second step:
– Stack another hidden
layer on top of the
RBM to form a new
RBM
– Fix , sample
from as input.
Train as RBM
1W1h
1( | )Q h v2W
DBN Greedy training
• Third step:
– Continue to stack
layers on top of the
network, train it as
previous step, with
sample sampled from
• And so on…
2 1( | )Q h h
Why greedy learning works
• Each time we learn a new layer, the inference at the
layer below becomes incorrect, but the variational bound
on the log prob of the data improves (only true in theory -
ml).
• Since the bound starts as an equality, learning a new
layer never decreases the log prob of the data, provided
we start the learning from the tied weights that
implement the complementary prior.
• Now that we have a guarantee we can loosen the
restrictions and still feel confident.
– Allow layers to vary in size.
– Do not start the learning at each layer from the
weights in the layer below.
The generative model after learning 3 layers
• To generate data:
1. Get an equilibrium sample
from the top-level RBM by
performing alternating Gibbs
sampling for a long time.
2. Perform a top-down pass to
get states for all the other
layers.
So the lower level bottom-up
connections are not part of
the generative model. They
are just used for inference.
h2
data
h1
h3
2W
3W
1W
EXAMPLE:
IMAGE REGENERATION
A model of digit recognition
2000 top-level neurons
500 neurons
500 neurons
28 x 28
pixel
image
10 label
neurons
The model learns to generate combinations of
labels and images.
To perform recognition we start with a neutral
state of the label units and do an up-pass from
the image followed by a few iterations of the
top-level associative memory.
The top two layers form an
associative memory whose
energy landscape models the low
dimensional manifolds of the
digits.
The energy valleys have names
Demo: http://www.cs.toronto.edu/~hinton/adi/index.htm
Samples generated by letting the associative
memory run with one label clamped. There are
1000 iterations of alternating Gibbs sampling
between samples.
Examples of correctly recognized handwritten digits
that the neural network had never seen before
Its very
good
How well does it discriminate on MNIST test set with
no extra information about geometric distortions?
• Generative model based on RBM’s 1.25%
• Support Vector Machine (Decoste et. al.) 1.4%
• Backprop with 1000 hiddens (Platt) ~1.6%
• Backprop with 500 -->300 hiddens ~1.6%
• K-Nearest Neighbor ~ 3.3%
• See Le Cun et. al. 1998 for more results
• Its better than backprop and much more neurally plausible
because the neurons only need to send one kind of signal,
and the teacher can be another sensory input.
EXAMPLE:
MOTION REGENERATION
The conditional RBM model (Sutskever & Hinton 2007)
• Given the data and the previous hidden
state, the hidden units at time t are
conditionally independent.
– So online inference is very easy
• Learning can be done by using
contrastive divergence.
– Reconstruct the data at time t from
the inferred states of the hidden units
and the earlier states of the visibles.
– The temporal connections can be
learned as if they were additional
biases
t-2 t-1 t
t
)( reconjdatajiij sssw
i
j
Generating from a learned model
• The inputs from the earlier states
of the visible units create
dynamic biases for the hidden
and current visible units.
• Perform alternating Gibbs
sampling for a few iterations
between the hidden units and the
current visible units.
– This picks new hidden and
visible states that are
compatible with each other
and with the recent history. t-2 t-1 t
t
i
j
Conditional Deep Belief Network
cDBN as Bayesian filtering
Application for human motion
– Generating realistic human motion sequence
by successively sampling the t-th frame given
the previously sampled k frames
drunk graceful sexy
(Taylor et al, 2007)