dl1 deep learning_algorithms

59
deep learning Algorithms and Applications Bernardete Ribeiro, [email protected] University of Coimbra, Portugal INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015

Upload: armando-vieira

Post on 10-Feb-2017

300 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: Dl1 deep learning_algorithms

deep learning

Algorithms and Applications

Bernardete Ribeiro, [email protected]

University of Coimbra, Portugal

INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015

Page 2: Dl1 deep learning_algorithms

III - Deep Learning Algorithms

1

Page 3: Dl1 deep learning_algorithms

elements 3: deep neural networks

Page 4: Dl1 deep learning_algorithms

outline

∙ Learning in Deep Neural Networks∙ Deep Learning: Evolution Timeline∙ Deep Architectures∙ Restricted Boltzmann Machines (RBMs)∙ Deep Belief Networks (DBNs)∙ Deep Models Overall Characteristics

3

Page 5: Dl1 deep learning_algorithms

learning in deep neural networks

Page 6: Dl1 deep learning_algorithms

learning in deep neural networks

1. No general learning algorithm (no-free lunch theorem byWolpert 1996)

2. Learning algorithm for specific tasks - perception, control,prediction, planning reasoning, language understanding

3. Limitations of BP - local minima, optimization challengesfor non-convex objective functions

4. Hinton’s deep belief networks (DBNs) as stack of RBMs5. LeCun’s energy based learning for DBNs

5

Page 7: Dl1 deep learning_algorithms

deep learning: evolution timeline

1. Perceptron [Frank Rosenblatt, 1959]2. Neocognitron [K Fukushima, 1980]3. Convolutional Neural Network (CNN) [LeCun, 1989]4. Multi-level Hierarchy Networks [Jurgen Schmidthuber, 1992]5. Deep Belief Networks (DBNs) as stack of RBMs [GeoffreyHinton, 2006]

6

Page 8: Dl1 deep learning_algorithms

deep architectures

Page 9: Dl1 deep learning_algorithms

from brain-like computing to deep learning

∙ New empirical and theoretical results have brought deeparchitectures into the focus of the Machine Learning (ML)researchers [Larochelle et al., 2007].

∙ Theoretical results suggest that deep architectures arefundamental to learn the kind of brain-like complicatedfunctions that can represent high-level abstractions (e.g.vision, speech, language) [Bengio, 2009]

8

Page 10: Dl1 deep learning_algorithms

deep concepts main idea

9

Page 11: Dl1 deep learning_algorithms

deep neural networks

∙ Convolutional Neural Networks (CNNs) [LeCun et al., 1989]∙ Deep Belief Networks (DBNs) [Hinton et al, 2006]∙ AutoEncoders (AEs) [Bengio et al, NIPS 2006]∙ Sparse Autoencoders [Ranzato et al, NIPS’2006]

10

Page 12: Dl1 deep learning_algorithms

convolutional neural networks (cnns)

∙ Convolutional Neural Network consists of two basicoperations∙ convolutional∙ pooling

∙ Convolutional and pooling layersare arranged alternately untilhigh-level features are obtained

∙ Several feature maps in eachconvolutional layer

∙ Weights in the same map areshared

NN

input C1 S2 C3 S4

1

1I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial Intelligence Research, IEEE,CIM,2010

11

Page 13: Dl1 deep learning_algorithms

convolutional neural networks (cnns)

∙ Convolutional: suppose the size of the layer is d× dand the size of the receptive fields are r × r, γ and xdenote respectively the values of the convolutionallayer and the previous layer:

γij = g(r∑

m=1

r∑n=1

xi+m−1,j+n−1.wm,n + b)

i, j = 1, · · · , (d− r + 1) where g is a nonlinear function.∙ Pooling is following after convolution to reduce thedimensionality of features and to introducetranslational invariance into the CNN network.

12

Page 14: Dl1 deep learning_algorithms

deep belief networks (dbns)

∙ Probabilistic generative modelscontrasting with the discriminativenature of other NNS

∙ Generative models provide a jointprobability distribution of dataand labels

∙ Unsupervised greedy-layer-wisepre-training followed by finaltuning

image 28 x 28 pixels

visible

hidden

visible

hidden

visible

hidden

Top Level units

Labels Hidden Units

RBM Layer

RBM Layer

RBM Layer

Detection Layer

2

2based on I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial IntelligenceResearch, IEEE, CIM,2010

13

Page 15: Dl1 deep learning_algorithms

autoencoders (aes)

∙ The auto-encoder has twocomponents:∙ the encoder f (mapping x to h) and∙ the decoder g (mapping h to r)

∙ An auto-encoder is a neuralnetwork that tries to reconstructits input to its output

encoder f…

decoder g

input x

code h

reconstruction r

3

3based on Y Bengio, I Goodfellow and A Courville, Deep Learning, An MIT Press book (in preparation),www.iro.umontreal.ca_~bengioy_dbook

14

Page 16: Dl1 deep learning_algorithms

deep architectures versus shallow architectures

∙ Deep architectures can be exponentially more efficientthan shallow architectures [Roux and Bengio, 2010].∙ Functions that can be compactly represented with a NeuralNetwork (NN) of depth d, may require an exponential numberof computational elements for a network with depth d− 1[Bengio, 2009].

∙ Since the number of computational elements depends onthe number of training samples available, using shallowarchitectures may result in poor generalizationmodels [Bengio, 2009].

∙ As a result, deep architecture models tend to outperformshallow models such as Support VectorMachines (SVMs) [Larochelle et al., 2007].

15

Page 17: Dl1 deep learning_algorithms

deep architectures versus shallow architectures

∙ Deep architectures can be exponentially more efficientthan shallow architectures [Roux and Bengio, 2010].∙ Functions that can be compactly represented with a NeuralNetwork (NN) of depth d, may require an exponential numberof computational elements for a network with depth d− 1[Bengio, 2009].

∙ Since the number of computational elements depends onthe number of training samples available, using shallowarchitectures may result in poor generalizationmodels [Bengio, 2009].

∙ As a result, deep architecture models tend to outperformshallow models such as SVMs [Larochelle et al., 2007].

15

Page 18: Dl1 deep learning_algorithms

Resctricted Boltzmann Machines

Deep Belief Networks

16

Page 19: Dl1 deep learning_algorithms

restricted boltzmann machines

Page 20: Dl1 deep learning_algorithms

restricted boltzmann machines (rbms)

h1 h2 h3 · · · hj · · · hJ 1

bias

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

18

Page 21: Dl1 deep learning_algorithms

restricted boltzmann machines (rbms)

∙ Unsupervised∙ Find complex regularities intraining data

∙ Bipartite Graph∙ visible, hidden layer

∙ Binary stochastic units∙ On/Off with probability

∙ 1 Iteration∙ Update Hidden Units∙ Reconstruct Visible Units

∙ Maximum Likelihood oftraining data

h1 h2 h3 · · · hj · · · hJ 1

bias

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

19

Page 22: Dl1 deep learning_algorithms

restricted boltzmann machines (rbms)

∙ Training Goal: Best probablereproduction∙ unsupervised data

∙ find latent factors of dataset∙ Adjust weights to getmaximum probability ofinput data

h1 h2 h3 · · · hj · · · hJ 1

bias

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

20

Page 23: Dl1 deep learning_algorithms

restricted boltzmann machines (rbms)

Given an observed state, the energy of the joint configurationof the visible units and hidden units (v,h) is given by:

E(v,h) = −I∑i=1

civi −J∑j=1

bjhj −J∑j=1

I∑i=1

Wjivihj , (1)

where W is the matrix of weights, and b and c are the biasunits w.r.t. hidden and visible layers, respectively.

h1 h2 h3 · · · hj · · · hJ 1

bias

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

21

Page 24: Dl1 deep learning_algorithms

restricted boltzmann machines (rbms)

The Restricted Boltzmann Machine (RBM) assigns aprobability for each configuration (v,h), using:

p(v,h) = e−E(v,h)

Z , (2)

where Z is a normalization constant called partition function,obtained by summing up the energy of all possible (v,h)configurations [Bengio, 2009, Hinton, 2010,Carreira-Perpiñán and Hinton, 2005]:

Z =∑v,h

e−E(v,h) . (3)

22

Page 25: Dl1 deep learning_algorithms

restricted boltzmann machines (rbms)

Since there are no connections between any two units withinthe same layer, given a particular random inputconfiguration, v, all the hidden units are independent of eachother and the probability of h given v becomes:

p(h | v) =∏j

p(hj = 1 | v) , (4)

where

p(hj = 1 | v) = σ(bj +I∑i=1

viWji) . (5)

23

Page 26: Dl1 deep learning_algorithms

restricted boltzmann machines (rbms)

Similarly given a specific hidden state, h, the probability of vgiven h is obtained by (6):

p(v | h) =∏i

p(vi = 1 | h) , (6)

where:

p(vi = 1 | h) = σ(ci +J∑j=1

hjWji) . (7)

24

Page 27: Dl1 deep learning_algorithms

restricted boltzmann machines (rbms)

Given a random training vector v, the state of a given hiddenunit j is set to 1 with probability:

p(hj = 1|v) = σ(bj +∑i

viWij)

Similarly:p(vi = 1|h) = σ(ci +

∑j

hjWij)

where σ (x) is the sigmoid squashing function 1(1+e−x) .

25

Page 28: Dl1 deep learning_algorithms

restricted boltzmann machines (rbms)

The marginal probability assigned to a visible vector, v, isgiven by (8):

p(v) =∑h

p(v,h) = 1Z∑h

e−E(v,h) . (8)

Hence, given a specific training vector v its probability can beraised by adjusting the weights and the biases in order tolower the energy of that particular vector while raising theenergy of all the others.

26

Page 29: Dl1 deep learning_algorithms

restricted boltzmann machines (rbms)

To this end, we can perform stochastic gradient ascentprocedure on the log-likelihood obtained from training thedata vectors using ( 9):

∂ logp(v)∂θ

= −∑h

p(h | v)∂ E(v,h)∂θ︸ ︷︷ ︸

positive phase

+∑v,h

p(v,h)∂E(v,h)∂θ︸ ︷︷ ︸

negative phase

(9)

27

Page 30: Dl1 deep learning_algorithms

training an rbm

Page 31: Dl1 deep learning_algorithms

training an rbm

The learning rule for performing stochastic steepest ascent inthe log probability of the training data:

∂ logp(v)∂θ

=⟨vihj

⟩0 −

⟨vihj

⟩∞ (10)

where 〈·〉0 denotes expectations for the data distribution(p0 = p(h | v)) and 〈·〉∞ denotes expectations under themodel distributionp∞(v,h) = p(v,h) [Roux and Bengio, 2008].

h1 h2 h3 · · · hj · · · hJ 1

bias

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

29

Page 32: Dl1 deep learning_algorithms

mcmc using alternating gibbs sampling

v(0) = x

i · · ·

h(0)

· · · j

⟨vihj

⟩0

p(hj = 1|v) = σ(bj +∑I

i=1 viWji)

30

Page 33: Dl1 deep learning_algorithms

mcmc using alternating gibbs sampling

v(0) = x

i · · ·

h(0)

· · · j

⟨vihj

⟩0

v(1)

i · · ·

p(vi = 1|h) = σ(ci +∑J

j=1 hjWji)

31

Page 34: Dl1 deep learning_algorithms

mcmc using alternating gibbs sampling

v(0) = x

i · · ·

h(0)

· · · j

⟨vihj

⟩0

v(1)

i · · ·

h(1)

· · · j

p(hj = 1|v) = σ(bj +∑I

i=1 viWji)

32

Page 35: Dl1 deep learning_algorithms

mcmc using alternating gibbs sampling

v(0) = x

i · · ·

h(0)

· · · j

⟨vihj

⟩0

v(1)

i · · ·

h(1)

· · · j

v(1)

i · · ·

p(vi = 1|h) = σ(ci +∑J

j=1 hjWji)

33

Page 36: Dl1 deep learning_algorithms

mcmc using alternating gibbs sampling

v(0) = x

i · · ·

h(0)

· · · j

⟨vihj

⟩0

v(1)

i · · ·

h(1)

· · · j

v(2)

i · · ·

h(2)

· · · j

v(∞)

i · · ·

h(∞)

· · · j

⟨vihj

⟩∞

34

Page 37: Dl1 deep learning_algorithms

contrastive divergence algorithm

Page 38: Dl1 deep learning_algorithms

contrastive divergence (cd–k)

∙ To solve this problem, Hinton proposed the ContrastiveDivergence algorithm.

∙ CD–k replaces 〈.〉∞ by 〈·〉k for small values of k.

∆Wji = η(⟨vihj

⟩0 −

⟨vihj

⟩k) (11)

36

Page 39: Dl1 deep learning_algorithms

contrastive divergence (cd–k)

∙ v(0) ← x∙ Compute the binary (features) states of the hidden units,h(0), using v(0)

∙ for n← 1 to k∙ Compute the “reconstruction” states for the visible units, v(n),using h(n−1)

∙ Compute the “reconstruction” states for the hidden units, h(n),using v(n)

∙ end for∙ Update the weights and biases, according to:

∆Wji = η(⟨vihj

⟩0 −

⟨vihj

⟩k) (12)

∆bj = η(⟨hj⟩0 −

⟨hj⟩k) (13)

∆ci = η(〈vi〉0 − 〈vi〉k) (14)37

Page 40: Dl1 deep learning_algorithms

deep belief networks (dbns)

Page 41: Dl1 deep learning_algorithms

deep belief networks (dbns)

x· · ·

h1· · ·

p(x|h1)p(h1|x)

x· · ·

h1· · ·

h2· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

x· · ·

h1· · ·

h2· · ·

h3· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

p(h2|h3)p(h3|h2)

39

Page 42: Dl1 deep learning_algorithms

deep belief networks (dbns)

∙ Start with a training vectoron the visible units

∙ Update all the hidden unitsin parallel

∙ Update the all the visibleunits in parallel to get a“reconstruction”

∙ Update the hidden unitsagain

x· · ·

h1· · ·

p(x|h1)p(h1|x)

x· · ·

h1· · ·

h2· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

x· · ·

h1· · ·

h2· · ·

h3· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

p(h2|h3)p(h3|h2)

40

Page 43: Dl1 deep learning_algorithms

pre-training and fine tuning

RBM

data

500 hidden units

RBM

300 hidden units

500 hidden units

RBM

100 hidden units

300 hidden units

RBM

100 hidden units

10 hidden

data

update weights

500 hidden units

300 hidden units

100 hidden units

10 hidden

error < 0.001

BP

DBN Model

RBMs pre-training fine-tuning with BP41

Page 44: Dl1 deep learning_algorithms

deep belief networks (dbns)

42

Page 45: Dl1 deep learning_algorithms

practical considerations

Page 46: Dl1 deep learning_algorithms

weights initialization

44

Page 47: Dl1 deep learning_algorithms

deep belief networks (dbns) - adaptive learning rate size

ηji =

uη(old)ji if (⟨vihj

⟩0 −

⟨vihj

⟩k)(

⟨vihj

⟩(old)0 −

⟨vihj

⟩(old)k ) > 0

dη(old)ji if (⟨vihj

⟩0 −

⟨vihj

⟩k)(

⟨vihj

⟩(old)0 −

⟨vihj

⟩(old)k ) < 0

44Lopes et al., Towards Adaptive learning with improvedconvergence of DBNs on GPUs, Pattern Recognition, [2014]

45

Page 48: Dl1 deep learning_algorithms

adaptive step size

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0 100 200 300 400 500 600 700 800 900 1000

RMSE

(reconstruction)

Epoch

α = 0.1

adaptiveγ = 0.1γ = 0.4γ = 0.7

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0 100 200 300 400 500 600 700 800 900 1000

RMSE

(reconstruction)

Epoch

α = 0.4

adaptiveγ = 0.1γ = 0.4γ = 0.7

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0 100 200 300 400 500 600 700 800 900 1000

RMSE

(reconstruction)

Epoch

α = 0.7

adaptiveγ = 0.1γ = 0.4γ = 0.7

Average reconstruction error (RMSE).46

Page 49: Dl1 deep learning_algorithms

convergence results (α = 0.1)

Training images

Reconstructionafter 50 epochsReconstructionafter 100 epochsReconstructionafter 250 epochsReconstructionafter 500 epochsReconstructionafter 750 epochsReconstruc-tion after

1000 epochs

Adaptive Step Size Fixed (optimized) learning rate η = 0.4

47

Page 50: Dl1 deep learning_algorithms

deep models characteristics

Page 51: Dl1 deep learning_algorithms

deep models characteristics

∙ Biological Plausibility

∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.

∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.

∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs

49

Page 52: Dl1 deep learning_algorithms

deep models characteristics

∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.

∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.

∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.

∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs

49

Page 53: Dl1 deep learning_algorithms

deep models characteristics

∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.

∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.

∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs

49

Page 54: Dl1 deep learning_algorithms

deep models characteristics

∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.

∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.

∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs

49

Page 55: Dl1 deep learning_algorithms

deep models characteristics

∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.

∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.

∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs

49

Page 56: Dl1 deep learning_algorithms

Bengio, Y. (2009).Learning deep architectures for AI.Foundations and Trends in Machine Learning, 2(1):1–127.

Carreira-Perpiñán, M. A. and Hinton, G. E. (2005).On contrastive divergence learning.In Proceedings of the 10th International Workshop onArtificial Intelligence and Statistics (AISTATS 2005), pages33–40.Hinton, G. E. (2010).A practical guide to training restricted Boltzmannmachines.Technical report, Department of Computer Science,University of Toronto.

Larochelle, H., Erhan, D., Courville, A., Bergstra, J., andBengio, Y. (2007).

49

Page 57: Dl1 deep learning_algorithms

An empirical evaluation of deep architectures onproblems with many factors of variation.In Proceedings of the 24th international conference onMachine learning (ICML 2007), pages 473–480. ACM.

Roux, N. L. and Bengio, Y. (2008).Representational power of restricted Boltzmannmachines and deep belief networks.Neural Computation, 20(6):1631–1649.

Roux, N. L. and Bengio, Y. (2010).Deep belief networks are compact universalapproximators.Neural Computation, 22(8):2192–2207.

50

Page 58: Dl1 deep learning_algorithms

Questions?

50

Page 59: Dl1 deep learning_algorithms

deep learning

Algorithms and Applications

Bernardete Ribeiro, [email protected] 24, 2015

University of Coimbra, Portugal

INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015