deep learning. why? source: huang et al., communications acm 01/2014
TRANSCRIPT
Deep Learning
Why?
1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
1.0%
10.0%
100.0%
Speech recognition
Read Conversational
Year
Wor
d er
ror r
ate
Source: Huang et al., Communications ACM 01/2014
ISI OXFORD_VGG XRCE/INRIA University of Amsterdam
LEAR-XRCE SuperVision0%
5%
10%
15%
20%
25%
30%
35%
Large Scale Visual Recognition Challenge 2012Er
ror r
ate
the 2013 International Conference on Learning Representations, the 2013 ICASSP’s special session on New
Types of Deep Neural Network Learning for Speech Recognition and Related Applications, the 2013 ICML
Workshop for Audio, Speech, and Language Processing, the 2012, 2011, and 2010 NIPS Workshops on Deep Learning and
Unsupervised Feature Learning, 2013 ICML Workshop on Representation Learning Challenges, 2013 Intern. Conf. on
Learning Representations, 2012 ICML Workshop on Representation Learning, 2011 ICML Workshop on Learning Architectures, Representations, and Optimization for Speech and Visual Information Processing, 2009 ICML Workshop on Learning Feature Hierarchies, 2009 NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2012 ICASSP deep learning tutorial, the special section on Deep Learning for Speech and Language Processing in IEEE
Trans. Audio, Speech, and Language Processing (January 2012), the special issue on Learning Deep Architectures in
IEEE Trans.. Pattern Analysis and Machine Intelligence (2013)
”A fast learning algorithm fordeep belief nets”
-- Hinton et al., 2006
”Reducing the dimensionality of data with neural networks”-- Hinton & Salakhutdinov
Geoffrey HintonUniversity of Toronto
How?
Shallow learning
• SVM• Linear & Kernel Regression• Hidden Markov Models (HMM)• Gaussian Mixture Models (GMM)• Single hidden layer MLP• ...
Limited modeling capability of conceptsCannot make use of unlabeled data
Neuronal Networks
• Machine Learning• Knowledge from high dimensional data
• Classification• Input: features of data• supervised vs unsupervised• labeled data
• Neurons
hidden
output
inputi
j
k
vij
[ X1 , X2 , X3 ] zea
1
1
00
1
z
a
[ Y1 , Y2 ]
wjk
• Multiple Layers• Feed Forward• Connected Weights• 1-of-N Output
iji
iwxzj
Multi Layer Perceptron
i
j
k
vij
wjk
• Minimize error of calculated output
• Adjust weights• Gradient Descent
• Procedure• Forward Phase• Backpropagation
of errors
• For each sample, multiple epochs
Backpropagation
Best Practice• Normalization
• Prevent very high weights, Oscillation
• Overfitting/Generalisation• Validation Set, Early Stopping
• Mini-Batch Learning• update weights with multiple
input vectors combined
Problems with Backpropagation• Multiple hidden Layers• Get stuck in local optima
• start weights from random positions
• Slow convergence to optimum• large training set needed
• Only use labeled data• most data is unlabeled
Generative Approach
Restricted Boltzmann Machines
hidden
i
j
visible
visiijiwv
j
e
hp)(
1
1)( 1
• Unsupervised• Find complex regularities in
training data
• Bipartite Graph• visible, hidden layer
• Binary stochastic units• On/Off with probability
• 1 Iteration• Update Hidden Units• Reconstruct Visible Units
• Maximum Likelihood of training data
wij
Restricted Boltzmann Machines• Training Goal:
Best probable reproduction• unsupervised data
• find latent factors of data set
• Adjust weights to get maximum probability of input data
hidden
i
j
visible
visiijiwv
j
e
hp)(
1
1)( 1
wij
Training: Contrastive Divergence
t = 0 t = 1 reconstructiondata
i
j
i
j • Start with a training vector on the visible units.
• Update all the hidden units in parallel.
• Update the all the visible units in parallel to get a “reconstruction”.
• Update the hidden units again.
Example: Handwritten 2s
50 binary neurons that learn features
16 x 16 pixel image
Increment weights between an active pixel and an active feature
Decrement weights between an active pixel and an active feature
data (reality)
reconstruction
50 binary neurons that learn features
16 x 16 pixel image
The final 50 x 256 weights: Each unit grabs a different feature
Reconstruction from activated binary featuresData Data
New test image from the digit class that the model was trained on
Image from an unfamiliar digit class The network tries to see every image as a 2.
Reconstruction from activated binary features
Example: Reconstruction
Deep Architecture
• Backpropagation, RBM as building blocks• Multiple hidden layers• Motivation (why go deep?)
• Approximate complex decision boundary• Fewer computational units for
same functional mapping• Hierarchical Learning
• Increasingly complex features• work well in different domains
• Vision, Audio, …
Hierarchical Learning
• Natural progression from low level to high level structure as seen in natural complexity
• Easier to monitor what is being learnt and to guide the machine to better subspaces
Stacked RBMs
1W
2W
2h
1h
1h
v
1W
2W
2h
1h
v
copy binary state for each v
Compose the two RBM models to make a single DBN model
Train this RBM first
Then train this RBM
• First learn one layer at a time by stacking RBMs.
• Treat this as “pre-training” that finds a good initial set of weights which can then be fine-tuned by a local search procedure.
• Backpropagation can be used to fine-tune the model to be better at discrimination.
out
UsesDimensionality reduction
Dimensionality reduction
• Use a stacked RBM as deep auto-encoder
1. Train RBM with images as input & output
2. Limit one layer to few dimensions
Information has to pass through middle layer
Original
Deep RBN
PCA
Dimensionality reduction
Olivetti face data, 25x25 pixel images reconstructed from 30 dimensions (625 30)
Dimensionality reduction
PCA
Deep RBN
804’414 Reuters news stories, reduction to 2 dimensions
UsesClassification
Unlabeled data
Unlabeled data is readily available
Example: Images from the web
1. Download 10’000’000 images2. Train a 9-layer DNN3. Concepts are formed by DNN
70% better than previous state of the artBuilding High-level Features Using Large Scale Unsupervised LearningQuoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng
UsesAI
Artificial intelligence
Enduro, Atari 2600
Expert player: 368 points Deep Learning: 661 pointsPlaying Atari with Deep Reinforcement LearningVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller
UsesGenerative (Demo)
How to use it
How to use it
• Home page of Geoffrey Hintonhttps://www.cs.toronto.edu/~hinton/
• Portalhttp://deeplearning.net/
• Accord.NEThttp://accord-framework.net/