Deep Learning I - Korea learning/Deep Learning...Deep Learning I Deep learning is research on learning models with multilayer ... Application for Baduk Figure 6. Image Identification ... Deep Learning I

Download Deep Learning I - Korea   learning/Deep Learning...Deep Learning I Deep learning is research on learning models with multilayer ... Application for Baduk Figure 6. Image Identification ... Deep Learning I

Post on 30-Mar-2018

228 views

Category:

Documents

4 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Deep Learning I </p><p>Junhong Kim</p><p>School of Industrial Management Engineering</p><p>Korea University</p><p>2016. 02. 01.</p><p>1</p></li><li><p>Deep Learning I</p><p> Deep learning is research on learning models with multilayer representations</p><p>1. Multilayer (feed-forward) Neural Network</p><p>2. Multilayer graphical model (Deep Belief Network, Deep Boltzmann machine)</p><p> Each layer corresponds to a distributed representation</p><p>1. Units in layer are not mutually exclusive(1) Each unit is a separate feature of the input(2) Two units can be active at the same time</p><p>2. They do not correspond to a partitioning (clustering) of the input(1) In clustering, an input can only belong to a single cluster</p><p>Figure 1. Multilayer Feed Forward Neural Network</p></li><li><p>Deep Learning I</p><p>Figure 1. Multilayer Feed Forward Neural Network</p></li><li><p>Deep Learning I</p><p>Figure 1. Multilayer Feed Forward Neural Network Figure 2. Inspiration from visual cortex</p><p>Decision class based on each probability</p><p>Visualization of compare Multilayer FFNN with Visual cortex</p><p>Using Application</p></li><li><p>Deep Learning I</p><p>Success Stories in Deep Learning Area</p><p>Figure 3. Image Reconstruction Figure 4. Image Question Answering Figure 5. Application for Baduk Figure 6. Image Identification</p><p>Figure 7. Speech Recognition</p><p>And so on..</p><p>Facebook.com/AIkorea(Facebook Group)</p></li><li><p>Deep Learning I</p><p>Why training is hard ?</p><p>Figure 8. Weight update path in FFNN</p><p> [One] Optimization is harder (Underfitting) 1. Vanishing gradient problem2. Saturated units block gradient propagation</p><p>Low rate of change convergence to zero</p><p>Low rate of change convergence to zero</p><p>High rate of change</p><p>Problem</p><p>Well known problem in RNN</p></li><li><p>Deep Learning I</p><p>Why training is hard ?</p><p> [Two] Overfitting1. We are exploring a space of complex functions2. Deep nets usually have lots of parameters</p><p>Already we know that Neural Network might be in a high variance / low bias situation</p><p>Figure 9. Visualization of Variance and Bias Problem</p></li><li><p>Deep Learning I</p><p>Why training is hard ?</p><p>Solution for [One] (Underfitting) Use better optimization method</p><p>-&gt; this is an active area of research</p><p>Solution for [Two] (Overfitting) Unsupervised learning (Unsupervised pre-training ) Stochastic Dropout training</p><p>Figure 10. Visualization of pre-training</p><p>Figure 11. Visualization of Dropout learning</p></li><li><p>Deep Learning I</p><p>Unsupervised Pre-training</p><p>Initialize hidden layers unsupervised learning ( RBM, Autoencoder )</p><p>We will use a greedy, layer-wise procedure-&gt; Train one layer at a time, from first to last, with unsupervised criterion-&gt; fix the parameters of previous hidden layers-&gt; previous layers viewed as feature extraction</p><p>Figure 9. Visualization of pre-training It have a look of other feature extraction method( PCA, MDS and so on..)</p><p>Number of next layer unit(node)</p><p>Number of present layer unit(node)</p><p>10 10</p><p>10 15</p><p>15 10</p></li><li><p>Deep Learning I</p><p>Unsupervised Pre-training</p><p> We call this procedure unsupervised pre-training</p><p>First layer : find hidden unit features that are more common in training inputs than in random inputsSecond layer : find combinations of hidden unit features that are more common than random hidden </p><p>unit featuresThird layer : find combinations of combinations of combinations of</p><p> Advantage Pre-training initializes the parameters in a region such that the near local optima overfit less the data</p><p>Figure 9. Visualization of pre-training It is have a look of other feature extraction method( PCA, MDS and so on..)</p><p>Number of next layer unit(node)</p><p>Number of present layer unit(node)</p><p>10 ( Same )</p><p>10</p><p>10( Smaller than..)</p><p>15</p><p>15( Bigger than..)</p><p>10</p></li><li><p>Deep Learning I</p><p>Fine Tuning</p><p> Once all layers are pre-trained-&gt; add output layer-&gt; train the whole network using supervised learning</p><p> Supervised learning is performed as in a regular feed-forward network-&gt; forward/back propagation and update</p><p> We call this last phase fine-tuning-&gt; all parameters are tuned for the supervised task at hand</p><p>Figure 10. Multi Layer FFNN</p><p>Q. When we use fine tuning before calculate error rate.If we calculate large number of iteration update,Maybe occur to Vanishing gradient problem ?</p></li><li><p>Deep Learning I</p><p>Pseudo code</p><p> For I=1 to L ( Number of layers 1 == L)-&gt;build unsupervised training set (with = ) :</p><p>-&gt; train greedy module (RBM, autoencoder) on-&gt; use hidden layer weights and biases of greedy module</p><p>to initialize the deep network parameters , </p><p> Initialize +1 , +1 , randomly (as usual)</p><p> Train the whole neural network using (supervised)stochastic gradient descent (with backprop)</p><p>pre-training</p><p>Fine-tuning</p></li><li><p>Deep Learning I</p><p>Stacked restricted Boltzmann machines: Hinton, Teh and Osindero suggested this procedure with RBMs</p><p>- A fast learning algorithm for deep belief nets. Hinton, Teh, Osindero., 2006.- To recognize shapes, first learn to generate images. Hinton, 2006.</p><p>Stacked autoencoders: Bengio, Lamblin, Popovici and Larochelle studied and generalized the procedure to autoencoders</p><p>- Greedy Layer-Wise Training of Deep Networks.Bengio, Lamblin, Popovici and Larochelle, 2007. Ranzato, Poultney, Chopra and LeCun also generalized it to sparse autoencoders</p><p>- Efficient Learning of Sparse Representations with an Energy-Based Model.Ranzato, Poultney, Chopra and LeCun, 2007.</p><p>Kind of stacked unsupervised learning methods</p></li><li><p>Deep Learning I</p><p>Kind of stacked unsupervised learning methods</p><p>Stacked denoising autoencoders: proposed by Vincent, Larochelle, Bengio and Manzagol- Extracting and Composing Robust Features with Denoising Autoencoders,Vincent, Larochelle, Bengio and Manzagol, 2008.</p><p>And more: stacked semi-supervised embeddings</p><p>- Deep Learning via Semi-Supervised Embedding. Weston, Ratle and Collobert, 2008. stacked kernel PCA</p><p>- Kernel Methods for Deep Learning. Cho and Saul, 2009. stacked independent subspace analysis</p><p>- Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. Le, Zou, Yeung and Ng, 2011.</p></li><li><p>Deep Learning I</p><p>Impact of initialization</p><p>Without pre-training </p><p>Pre-training usingRegular autoencoder </p><p>Pre-training usingRestrict Boltzmman machine </p><p>Number of Hidden layers</p><p>Figure 11. Impact of initialization</p><p>An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Larochelle, Erhan, Courville, 2007</p><p>Original MNIST</p><p>Harder thanOriginal MNIST problem</p></li><li><p>Deep Learning I</p><p>Impact of initialization</p><p>Why Does Unsupervised Pre-training Help Deep Learning? Erhan, Bengio, Courville, Manzagol, Vincent and Bengio, 2011http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf</p><p>1. IF when we increase hidden layer for classification(1) Without pre-training -&gt; increase test error rate(2) Using pre-training -&gt; decrease test error rate</p><p>this result correct tend to comparatively smaller than </p><p>large hidden units</p><p>2. When we increase number of hidden units in 1,2 and 3 hidden layer</p><p>(1) Without pre-training -&gt; at one point, increase error rate(2) Using pre-training -&gt; always decrease</p><p>3. At one point, We fine Pre-training method better than without pre-training</p><p>All of results based on best parameter</p><p>http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf</p></li><li><p>Deep Learning I</p><p>Choice of hidden layer size</p><p>MINIST</p><p>RotationMNIST</p><p>Summary</p><p>Except some cases, when increase total number of </p><p>hidden units, test error rate is decreased</p></li><li><p>Deep Learning I</p><p>Performance on different datasets</p><p>Extracting and Composing Robust Features with Denoising Autoencoders, Vincent, Larochelle, Bengio and Manzagol, 2008.</p><p>In numerous instances, (not always)1. Stacked Denoising Autoencoder</p><p>compute best result2. Neural Network after pre-training </p><p>method better than SVM (using Radial Basis kernel Function)</p></li><li><p>Deep Learning I</p><p>Dropout</p><p>Dropout: A Simple Way to Prevent Neural Networks from Overfitting https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</p><p> Idea : ripple neural network byremoving hidden units stochastically</p><p>-&gt; each hidden unit is set to 0 with probability 0.5-&gt; hidden units cannot co-adapt to other units</p><p> Could use a different dropout probability, </p><p>but 0.5 usually works well</p><p>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</p></li><li><p>Deep Learning I</p><p>Dropout</p></li><li><p>Deep Learning I</p><p>Dropout back-propagation</p><p> If general back-propagation gradient decent method add a mask vector constraint,Its a dropout back propagation</p></li><li><p>Deep Learning I</p><p>General back propagation simple example</p><p>1</p><p>2</p><p>3</p><p>4</p><p>56</p><p>x1 x2 x3 w14 w15 w24 w25 w34 w35 w36 w56 4 5 6</p><p>1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1</p></li><li><p>Deep Learning I</p><p>General back propagation simple example</p><p>1. Learning Rate = 0.92. = </p><p> = + 3. = </p><p>= + </p></li><li><p>Deep Learning I</p><p>Dropout back propagation simple example</p><p>1</p><p>2</p><p>3</p><p>4</p><p>56</p><p>x1 x2 x3 w14 w24 w34 w36 w56 4 5 6</p><p>1 0 1 0.2 0.4 -0.5 -0.3 -0.2 -0.4 0.2 0.1</p></li><li><p>Deep Learning I</p><p>Dropout back propagation simple example</p><p>1. Learning Rate = 0.92. = </p><p> = + 3. = </p><p>= + </p></li><li><p>Deep Learning I</p><p>Conclusion, Test time classification</p><p> Regardless of General back propagation or dropout back propagation method,Selected weights based on dropout method are same in first update weight</p><p> In conclusion, Dropout method have diversity. It depends on sequence of selected hidden units each hidden node.</p><p>How to combine results?</p><p> At test time, we replace the masks by their expectation</p><p>-&gt; this is simply the constant vector 0.5 if dropout probability is 0.5-&gt; for single hidden layer, can show this is equivalent to taking the </p><p>geometric average of all neural networks, with all possible binary masks</p></li><li><p>Thank you</p><p>27</p></li></ul>

Recommended

View more >