Deep Learning I - Korea learning/Deep Learning...Deep Learning I Deep learning is research on learning models with multilayer ... Application for Baduk Figure 6. Image Identification ... Deep Learning I

Download Deep Learning I - Korea   learning/Deep Learning...Deep Learning I Deep learning is research on learning models with multilayer ... Application for Baduk Figure 6. Image Identification ... Deep Learning I

Post on 30-Mar-2018

228 views

Category:

Documents

4 download

TRANSCRIPT

  • Deep Learning I

    Junhong Kim

    School of Industrial Management Engineering

    Korea University

    2016. 02. 01.

    1

  • Deep Learning I

    Deep learning is research on learning models with multilayer representations

    1. Multilayer (feed-forward) Neural Network

    2. Multilayer graphical model (Deep Belief Network, Deep Boltzmann machine)

    Each layer corresponds to a distributed representation

    1. Units in layer are not mutually exclusive(1) Each unit is a separate feature of the input(2) Two units can be active at the same time

    2. They do not correspond to a partitioning (clustering) of the input(1) In clustering, an input can only belong to a single cluster

    Figure 1. Multilayer Feed Forward Neural Network

  • Deep Learning I

    Figure 1. Multilayer Feed Forward Neural Network

  • Deep Learning I

    Figure 1. Multilayer Feed Forward Neural Network Figure 2. Inspiration from visual cortex

    Decision class based on each probability

    Visualization of compare Multilayer FFNN with Visual cortex

    Using Application

  • Deep Learning I

    Success Stories in Deep Learning Area

    Figure 3. Image Reconstruction Figure 4. Image Question Answering Figure 5. Application for Baduk Figure 6. Image Identification

    Figure 7. Speech Recognition

    And so on..

    Facebook.com/AIkorea(Facebook Group)

  • Deep Learning I

    Why training is hard ?

    Figure 8. Weight update path in FFNN

    [One] Optimization is harder (Underfitting) 1. Vanishing gradient problem2. Saturated units block gradient propagation

    Low rate of change convergence to zero

    Low rate of change convergence to zero

    High rate of change

    Problem

    Well known problem in RNN

  • Deep Learning I

    Why training is hard ?

    [Two] Overfitting1. We are exploring a space of complex functions2. Deep nets usually have lots of parameters

    Already we know that Neural Network might be in a high variance / low bias situation

    Figure 9. Visualization of Variance and Bias Problem

  • Deep Learning I

    Why training is hard ?

    Solution for [One] (Underfitting) Use better optimization method

    -> this is an active area of research

    Solution for [Two] (Overfitting) Unsupervised learning (Unsupervised pre-training ) Stochastic Dropout training

    Figure 10. Visualization of pre-training

    Figure 11. Visualization of Dropout learning

  • Deep Learning I

    Unsupervised Pre-training

    Initialize hidden layers unsupervised learning ( RBM, Autoencoder )

    We will use a greedy, layer-wise procedure-> Train one layer at a time, from first to last, with unsupervised criterion-> fix the parameters of previous hidden layers-> previous layers viewed as feature extraction

    Figure 9. Visualization of pre-training It have a look of other feature extraction method( PCA, MDS and so on..)

    Number of next layer unit(node)

    Number of present layer unit(node)

    10 10

    10 15

    15 10

  • Deep Learning I

    Unsupervised Pre-training

    We call this procedure unsupervised pre-training

    First layer : find hidden unit features that are more common in training inputs than in random inputsSecond layer : find combinations of hidden unit features that are more common than random hidden

    unit featuresThird layer : find combinations of combinations of combinations of

    Advantage Pre-training initializes the parameters in a region such that the near local optima overfit less the data

    Figure 9. Visualization of pre-training It is have a look of other feature extraction method( PCA, MDS and so on..)

    Number of next layer unit(node)

    Number of present layer unit(node)

    10 ( Same )

    10

    10( Smaller than..)

    15

    15( Bigger than..)

    10

  • Deep Learning I

    Fine Tuning

    Once all layers are pre-trained-> add output layer-> train the whole network using supervised learning

    Supervised learning is performed as in a regular feed-forward network-> forward/back propagation and update

    We call this last phase fine-tuning-> all parameters are tuned for the supervised task at hand

    Figure 10. Multi Layer FFNN

    Q. When we use fine tuning before calculate error rate.If we calculate large number of iteration update,Maybe occur to Vanishing gradient problem ?

  • Deep Learning I

    Pseudo code

    For I=1 to L ( Number of layers 1 == L)->build unsupervised training set (with = ) :

    -> train greedy module (RBM, autoencoder) on-> use hidden layer weights and biases of greedy module

    to initialize the deep network parameters ,

    Initialize +1 , +1 , randomly (as usual)

    Train the whole neural network using (supervised)stochastic gradient descent (with backprop)

    pre-training

    Fine-tuning

  • Deep Learning I

    Stacked restricted Boltzmann machines: Hinton, Teh and Osindero suggested this procedure with RBMs

    - A fast learning algorithm for deep belief nets. Hinton, Teh, Osindero., 2006.- To recognize shapes, first learn to generate images. Hinton, 2006.

    Stacked autoencoders: Bengio, Lamblin, Popovici and Larochelle studied and generalized the procedure to autoencoders

    - Greedy Layer-Wise Training of Deep Networks.Bengio, Lamblin, Popovici and Larochelle, 2007. Ranzato, Poultney, Chopra and LeCun also generalized it to sparse autoencoders

    - Efficient Learning of Sparse Representations with an Energy-Based Model.Ranzato, Poultney, Chopra and LeCun, 2007.

    Kind of stacked unsupervised learning methods

  • Deep Learning I

    Kind of stacked unsupervised learning methods

    Stacked denoising autoencoders: proposed by Vincent, Larochelle, Bengio and Manzagol- Extracting and Composing Robust Features with Denoising Autoencoders,Vincent, Larochelle, Bengio and Manzagol, 2008.

    And more: stacked semi-supervised embeddings

    - Deep Learning via Semi-Supervised Embedding. Weston, Ratle and Collobert, 2008. stacked kernel PCA

    - Kernel Methods for Deep Learning. Cho and Saul, 2009. stacked independent subspace analysis

    - Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. Le, Zou, Yeung and Ng, 2011.

  • Deep Learning I

    Impact of initialization

    Without pre-training

    Pre-training usingRegular autoencoder

    Pre-training usingRestrict Boltzmman machine

    Number of Hidden layers

    Figure 11. Impact of initialization

    An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Larochelle, Erhan, Courville, 2007

    Original MNIST

    Harder thanOriginal MNIST problem

  • Deep Learning I

    Impact of initialization

    Why Does Unsupervised Pre-training Help Deep Learning? Erhan, Bengio, Courville, Manzagol, Vincent and Bengio, 2011http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf

    1. IF when we increase hidden layer for classification(1) Without pre-training -> increase test error rate(2) Using pre-training -> decrease test error rate

    this result correct tend to comparatively smaller than

    large hidden units

    2. When we increase number of hidden units in 1,2 and 3 hidden layer

    (1) Without pre-training -> at one point, increase error rate(2) Using pre-training -> always decrease

    3. At one point, We fine Pre-training method better than without pre-training

    All of results based on best parameter

    http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf

  • Deep Learning I

    Choice of hidden layer size

    MINIST

    RotationMNIST

    Summary

    Except some cases, when increase total number of

    hidden units, test error rate is decreased

  • Deep Learning I

    Performance on different datasets

    Extracting and Composing Robust Features with Denoising Autoencoders, Vincent, Larochelle, Bengio and Manzagol, 2008.

    In numerous instances, (not always)1. Stacked Denoising Autoencoder

    compute best result2. Neural Network after pre-training

    method better than SVM (using Radial Basis kernel Function)

  • Deep Learning I

    Dropout

    Dropout: A Simple Way to Prevent Neural Networks from Overfitting https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

    Idea : ripple neural network byremoving hidden units stochastically

    -> each hidden unit is set to 0 with probability 0.5-> hidden units cannot co-adapt to other units

    Could use a different dropout probability,

    but 0.5 usually works well

    https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

  • Deep Learning I

    Dropout

  • Deep Learning I

    Dropout back-propagation

    If general back-propagation gradient decent method add a mask vector constraint,Its a dropout back propagation

  • Deep Learning I

    General back propagation simple example

    1

    2

    3

    4

    56

    x1 x2 x3 w14 w15 w24 w25 w34 w35 w36 w56 4 5 6

    1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1

  • Deep Learning I

    General back propagation simple example

    1. Learning Rate = 0.92. =

    = + 3. =

    = +

  • Deep Learning I

    Dropout back propagation simple example

    1

    2

    3

    4

    56

    x1 x2 x3 w14 w24 w34 w36 w56 4 5 6

    1 0 1 0.2 0.4 -0.5 -0.3 -0.2 -0.4 0.2 0.1

  • Deep Learning I

    Dropout back propagation simple example

    1. Learning Rate = 0.92. =

    = + 3. =

    = +

  • Deep Learning I

    Conclusion, Test time classification

    Regardless of General back propagation or dropout back propagation method,Selected weights based on dropout method are same in first update weight

    In conclusion, Dropout method have diversity. It depends on sequence of selected hidden units each hidden node.

    How to combine results?

    At test time, we replace the masks by their expectation

    -> this is simply the constant vector 0.5 if dropout probability is 0.5-> for single hidden layer, can show this is equivalent to taking the

    geometric average of all neural networks, with all possible binary masks

  • Thank you

    27

Recommended

View more >