deep learning i - korea universitydsba.korea.ac.kr/wp/wp-content/seminar/deep learning/deep...

Deep Learning I

Junhong Kim

School of Industrial Management Engineering

Korea University

2016. 02. 01.

1

Deep Learning I

Deep learning is research on learning models with multilayer representations

1. Multilayer (feed-forward) Neural Network

2. Multilayer graphical model (Deep Belief Network, Deep Boltzmann machine)

Each layer corresponds to a “distributed representation”

1. Units in layer are not mutually exclusive(1) Each unit is a separate feature of the input(2) Two units can be “active” at the same time

2. They do not correspond to a partitioning (clustering) of the input(1) In clustering, an input can only belong to a single cluster

Figure 1. Multilayer Feed Forward Neural Network

Deep Learning I

Figure 1. Multilayer Feed Forward Neural Network

Deep Learning I

Figure 1. Multilayer Feed Forward Neural Network Figure 2. Inspiration from visual cortex

Decision class based on each probability

Visualization of compare Multilayer FFNN with Visual cortex

Using Application

Deep Learning I

Success Stories in Deep Learning Area

Figure 3. Image Reconstruction Figure 4. Image Question Answering Figure 5. Application for Baduk Figure 6. Image Identification

Figure 7. Speech Recognition

And so on..

Facebook.com/AIkorea(Facebook Group)

Deep Learning I

Why training is hard ?

Figure 8. Weight update path in FFNN

[One] Optimization is harder (Underfitting) 1. Vanishing gradient problem2. Saturated units block gradient propagation

Low rate of change convergence to zero

Low rate of change convergence to zero

High rate of change

Problem…

Well known problem in RNN

Deep Learning I


[Two] Overfitting1. We are exploring a space of complex functions2. Deep nets usually have lots of parameters

Already we know that Neural Network might be in a high variance / low bias situation

Figure 9. Visualization of Variance and Bias Problem

Deep Learning I


Solution for [One] (Underfitting) Use better optimization method

-> this is an active area of research

Solution for [Two] (Overfitting) Unsupervised learning (Unsupervised pre-training ) Stochastic ‘Dropout’ training

Figure 10. Visualization of pre-training

Figure 11. Visualization of Dropout learning

Deep Learning I

Unsupervised Pre-training

Initialize hidden layers unsupervised learning ( RBM, Autoencoder )

We will use a greedy, layer-wise procedure-> Train one layer at a time, from first to last, with unsupervised criterion-> fix the parameters of previous hidden layers-> previous layers viewed as feature extraction

Figure 9. Visualization of pre-training It have a look of other feature extraction method( PCA, MDS and so on..)

Number of next layer unit(node)

Number of present layer unit(node)

10 10

10 15

15 10

Deep Learning I

Unsupervised Pre-training

We call this procedure unsupervised pre-training

First layer : find hidden unit features that are more common in training inputs than in random inputsSecond layer : find combinations of hidden unit features that are more common than random hidden

unit featuresThird layer : find combinations of combinations of combinations of…

Advantage Pre-training initializes the parameters in a region such that the near local optima overfit less the data

Figure 9. Visualization of pre-training It is have a look of other feature extraction method( PCA, MDS and so on..)

Number of next layer unit(node)

Number of present layer unit(node)

10 ( Same )

10

10( Smaller than..)

15

15( Bigger than..)

10

Deep Learning I

Fine Tuning

Once all layers are pre-trained-> add output layer-> train the whole network using supervised learning

Supervised learning is performed as in a regular feed-forward network-> forward/back propagation and update

We call this last phase fine-tuning-> all parameters are tuned for the supervised task at hand

Figure 10. Multi Layer FFNN

Q. When we use fine tuning before calculate error rate.If we calculate large number of iteration update,Maybe occur to ‘Vanishing gradient problem’ ?

Deep Learning I

Pseudo code

For I=1 to L ( Number of layers – 1 == L)->build unsupervised training set (with ℎ 𝑜 𝑥 = 𝑥 ) :

-> train ‘‘greedy module’’ (RBM, autoencoder) on-> use hidden layer weights and biases of greedy module

to initialize the deep network parameters 𝑊 𝑙 , 𝑏 𝑙

Initialize 𝑊 𝐿+1 , 𝑏 𝐿+1 , randomly (as usual)

Train the whole neural network using (supervised)stochastic gradient descent (with backprop)

pre-training

Fine-tuning

Deep Learning I

Stacked restricted Boltzmann machines:‣ Hinton, Teh and Osindero suggested this procedure with RBMs

- A fast learning algorithm for deep belief nets. Hinton, Teh, Osindero., 2006.- To recognize shapes, first learn to generate images. Hinton, 2006.

Stacked autoencoders:‣ Bengio, Lamblin, Popovici and Larochelle studied and generalized the procedure to autoencoders

- Greedy Layer-Wise Training of Deep Networks.Bengio, Lamblin, Popovici and Larochelle, 2007.‣ Ranzato, Poultney, Chopra and LeCun also generalized it to sparse autoencoders

- Efficient Learning of Sparse Representations with an Energy-Based Model.Ranzato, Poultney, Chopra and LeCun, 2007.

Kind of stacked unsupervised learning methods

Deep Learning I

Kind of stacked unsupervised learning methods

Stacked denoising autoencoders:‣ proposed by Vincent, Larochelle, Bengio and Manzagol- Extracting and Composing Robust Features with Denoising Autoencoders,Vincent, Larochelle, Bengio and Manzagol, 2008.

And more:‣ stacked semi-supervised embeddings

- Deep Learning via Semi-Supervised Embedding. Weston, Ratle and Collobert, 2008.‣ stacked kernel PCA

- Kernel Methods for Deep Learning. Cho and Saul, 2009.‣ stacked independent subspace analysis

- Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. Le, Zou, Yeung and Ng, 2011.

Deep Learning I

Impact of initialization

Without pre-training

Pre-training usingRegular autoencoder

Pre-training usingRestrict Boltzmman machine

Number of Hidden layers

Figure 11. Impact of initialization

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Larochelle, Erhan, Courville, 2007

Original MNIST

Harder thanOriginal MNIST problem

Deep Learning I

Impact of initialization

Why Does Unsupervised Pre-training Help Deep Learning? Erhan, Bengio, Courville, Manzagol, Vincent and Bengio, 2011http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf

1. IF when we increase hidden layer for classification(1) Without pre-training -> increase test error rate(2) Using pre-training -> decrease test error rate

this result correct tend to comparatively smaller than

large hidden units

2. When we increase number of hidden units in 1,2 and 3 hidden layer

(1) Without pre-training -> at one point, increase error rate(2) Using pre-training -> always decrease

3. At one point, We fine Pre-training method better than without pre-training

All of results based on best parameter

http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf

Deep Learning I

Choice of hidden layer size

MINIST

RotationMNIST

Summary

Except some cases, when increase total number of

hidden units, test error rate is decreased

Deep Learning I

Performance on different datasets

Extracting and Composing Robust Features with Denoising Autoencoders, Vincent, Larochelle, Bengio and Manzagol, 2008.

In numerous instances, (not always)1. Stacked Denoising Autoencoder

compute best result2. Neural Network after pre-training

method better than SVM (using Radial Basis kernel Function)

Deep Learning I

Dropout

Dropout: A Simple Way to Prevent Neural Networks from Overfitting https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

Idea : ‘ripple’ neural network byremoving hidden units stochastically

-> each hidden unit is set to 0 with probability 0.5-> hidden units cannot co-adapt to other units

Could use a different dropout probability,

but ‘0.5’ usually works well

https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

Deep Learning I

Dropout

Deep Learning I

Dropout back-propagation

If general back-propagation gradient decent method add a mask vector constraint,It’s a dropout back propagation

Deep Learning I

General back propagation simple example

1

2

3

4

56

x1 x2 x3 w14 w15 w24 w25 w34 w35 w36 w56 𝜃4 𝜃5 𝜃6

1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1

Deep Learning I

General back propagation simple example

1. Learning Rate = 0.92. ∆𝜽𝒋 = 𝒍 𝑬𝒓𝒓𝒋

𝜽𝒋 = 𝜽𝒋 + ∆𝜽𝒋3. ∆𝒘𝒊𝒋 = 𝒍 𝑬𝒓𝒓𝒋𝑶𝒊

𝒘𝒊𝒋=𝒘𝒊𝒋 + ∆𝒘𝒊𝒋

Deep Learning I

Dropout back propagation simple example

1

2

3

4

56

x1 x2 x3 w14 w24 w34 w36 w56 𝜃4 𝜃5 𝜃6

1 0 1 0.2 0.4 -0.5 -0.3 -0.2 -0.4 0.2 0.1

Deep Learning I

Dropout back propagation simple example

1. Learning Rate = 0.92. ∆𝜽𝒋 = 𝒍 𝑬𝒓𝒓𝒋

𝜽𝒋 = 𝜽𝒋 + ∆𝜽𝒋3. ∆𝒘𝒊𝒋 = 𝒍 𝑬𝒓𝒓𝒋𝑶𝒊

𝒘𝒊𝒋=𝒘𝒊𝒋 + ∆𝒘𝒊𝒋

Deep Learning I

Conclusion, Test time classification

Regardless of General back propagation or dropout back propagation method,Selected weights based on dropout method are same in first update weight

In conclusion, Dropout method have diversity. It depends on sequence of selected hidden units each hidden node.

How to combine results?

At test time, we replace the masks by their expectation

-> this is simply the constant vector 0.5 if dropout probability is 0.5-> for single hidden layer, can show this is equivalent to taking the

geometric average of all neural networks, with all possible binary masks

Thank you

27

deep learning i - korea universitydsba.korea.ac.kr/wp/wp-content/seminar/deep learning/deep...

Documents