learning deep architectures for aimath.gmu.edu/seminars/nda/ndaapr14.pdf · deep vs shallow pascanu...

48
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam

Upload: others

Post on 06-Feb-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Learning Deep Architectures for AI

- Yoshua Bengio

Part II - Vijay Chakilam

Page 2: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Limitations of Perceptron

�  There is no value for W and b such that the model results in right target for every example

x1

x2

y

W, b

0,1

0,0 1,0

1,1

weight plane output =1 output =0

A graphical view of the XOR problem

Page 3: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Learning representations

0,0 1,0

2,1

x2

x1 u1

u2 h2

h1

y

W, b φV, c

W =1111⎡

⎣⎢

⎦⎥, b =

0−1⎡

⎣⎢

⎦⎥, V =

1−2⎡

⎣⎢

⎦⎥ and c = 0

Page 4: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Backpropagation

∂E∂z j

=dyjdz j

∂E∂yj

= yj (1− yj )∂E∂yj

yjj

yii

z j

∂E∂yi

=dzjdyi

∂E∂z jj

∑ = wij∂E∂z jj

∂E∂wij

=∂z j∂wij

∂E∂z j

= yi∂E∂z j

E = 12

(t jj∈output∑ − yj )

2

∂E∂yj

= −(t j − yj )

Page 5: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Deep vs Shallow �  Pascanu et al. (2013) compared deep rectifier

networks with their shallow counterparts. �  For a deep model with inputs and hidden layers

of width , the maximal number of response regions per parameter behaves as

�  For a shallow model with inputs and hidden units, the maximal number of response regions per parameter behaves as

n0 kn

Ωnn0

⎣⎢

⎦⎥

k−1nn0−2

k

⎝⎜⎜

⎠⎟⎟

O kn0−1nn0−1( )

n0 nk

Page 6: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Deep vs Shallow �  Pascanu et al. (2013) showed that the deep model

can generate exponentially more regions per parameter in terms of the number of hidden layers, and at least order polynomially more regions per parameter in terms of layer width n.

�  Montufar et al. (2014) came up with significantly improved lower bound on the maximal linear regions.

Page 7: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Unsupervised Pre-training �  Scarcity of labeled examples and availability of

many unlabeled examples.

�  Unknown future tasks.

�  Once a good high-level representation is learned, other learning tasks could be much easier. �  For example, kernel machines can be very powerful if

using an appropriate kernel, i.e. an appropriate feature space.

Page 8: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

X

X

Z1 Z2 Z3

Y

W 1 W 2 W 3 W 4

Z1

W 1

Z2 Z1

W 2

Z2 Z3 W 3

DNN with unsupervised pre-training using RBMs

Page 9: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

DNN with unsupervised pre-training using

Autoencoders X

Z1

X

W W T

W T

W T

W

W

Z1 Z2 Z

Z2 Z3 Z

X

Z1

2

1

2 2

3 3

Z2 Z3

Y

W 1

1 1

W 2 W 3 W 4

Page 10: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Restricted Boltzmann Machines

Page 11: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Restricted Boltzmann Machine �  Energy function:

�  Distribution:

E(x,h) = −hTWx − cT x − bTh

= − Wj,khj xkk∑

j∑ − ckxk

k∑ − bjhj

j∑

p(x,h) = exp(−E(x,h)) / Z

X h

W

Page 12: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

h1 h2

+2 +1

x1 x2

An example of how weights define a

distribution

39.70

x h - E e-E P(x, h ) P(x) 1 1 1 1 2 7.39 .186 1 1 1 0 2 7.39 .186 1 1 0 1 1 2.72 .069 0.466 1 1 0 0 0 1 .025 1 0 1 1 1 2.72 .069 1 0 1 0 2 7.39 .186 1 0 0 1 0 1 .025 0.305 1 0 0 0 0 1 .025 0 1 1 1 0 1 .025 0 1 1 0 0 1 .025 0 1 0 1 1 2.72 .069 0.144 0 1 0 0 0 1 .025 0 0 1 1 -1 0.37 .009 0 0 1 0 0 1 .025 0.084 0 0 0 1 0 1 .025 0 0 0 0 0 1 .025

-1

-1

Page 13: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Log likelihood �  To train an RBM, we’d like to minimize the average

negative log-likelihood

1T

l( f (x(t ) ))t∑ =

1T

− log p(x(t ) )t∑

∂− log p(x(t ) )∂θ

= Eh∂E(x(t ),h)

∂θ| x(t )

⎣⎢

⎦⎥−Ex,h

∂E(x,h)∂θ

⎣⎢⎤

⎦⎥

Page 14: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Contrastive Divergence �  Idea:

�  Replace the expectation by a point estimate at x˜

�  Obtain the point x˜ by Gibbs sampling �  Start sampling chain at x(t)

~ p(h|x) ~ p(x|h)

......

x(t ) x1 xk = x~

Page 15: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Conditional probabilities

p(h | x) = p(hj | x)j∏

p(hj =1| x) =1

1+ exp(−(bj +Wj.x))

= sigm(bj +Wj.x)

p(x | h) = p(xk | h)k∏

p(xk =1| h) =1

1+ exp(−(ck + hTW.k ))

= sigm(ck + hTW.k )

Page 16: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Conditional probabilities

Page 17: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Contrastive Divergence – Parameter Updates

�  Derivation of for ∂E(x,h)∂θ

θ =Wjk

∂E(x,h)∂Wjk

=∂

∂Wjk

− Wjkhjxkjk∑ − ckxk

k∑ − bjhj

j∑

⎝⎜⎜

⎠⎟⎟

= −∂

∂Wjk

Wjkhjxkjk∑

= −hjxk

∇WE(x,h) = −hxT

Page 18: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Contrastive Divergence – Parameter Updates

�  Derivation of for Εh∂E(x,h)∂θ

| x⎡

⎣⎢⎤

⎦⎥

Εh∂E(x,h)∂Wjk

| x⎡

⎣⎢⎢

⎦⎥⎥= Εh −hjxk | x⎡⎣ ⎤⎦= −hjxk p(hj | x)

hj∈{0,1}∑

= −xk p(hj =1| x)

Εh ∇WE(x,h) | x[ ] = −h(x)xT

θ =Wjk

Page 19: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Contrastive Divergence – Parameter Updates

�  Given and the learning rule for becomes

W ⇐W −α ∇W − log p(x(t ) )( )

⇐W −α Εh ∇WE(x(t ),h) | x(t )⎡⎣ ⎤⎦− Εx,h ∇WE(x,h)[ ]( )

⇐W −α Εh ∇WE(x(t ),h) | x(t )⎡⎣ ⎤⎦− Εh ∇WE(x

~,h) | x

~⎡⎣⎢

⎤⎦⎥

⎝⎜

⎠⎟

x(t ) x~

θ =Wjk

⇐W +α h(x(t ) )x(t )T − h(x~)x~ T⎛

⎝⎜

⎠⎟

Page 20: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Contrastive Divergence - Algorithm

�  For each training example �  Generate a negative sample using k steps of Gibbs

sampling, starting at �  Update parameters

�  Repeat until stopping criterion

W ⇐W +α h(x(t ) )x(t )T − h(x~)x~ T⎛

⎝⎜

⎠⎟

b⇐ b+α h(x(t ) )− h(x~)⎛

⎝⎜

⎞⎠⎟

c⇐ c+α x(t ) − x~⎛

⎝⎜

⎞⎠⎟

x(t )

x(t )x~

Page 21: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Autoencoders

Page 22: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

3

bj

ck x

W h(x ) Encoder

= g(a(x)) = sigm(b + Wx )

Decoderx = o(a(x))

= sigm(c + W* h(x)) for binary inputs

W * = W T

(tied weights)

h(x )

Autoencoder

Page 23: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Autoencoder: Loss Function �  Loss function:

•  Use as the loss function

•  Example: We get the squared error loss

•  By choosing a Gaussian distribution with mean and identity covariance

and choosing

�  For binary inputs:

l( f (x)) = − log p(x |µ)

p(x |µ) = 1(2π )D/2

exp(− 12

(xk −µk )2 )

k∑µ

µ = c+W *h(x)

l( f (x)) = − (xk log(xk )+ (1− xk )log(1− xk ))k∑

l( f (x)) = 12

(xk − xk )2

k∑

Page 24: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Autoencoder: Gradient �  For both binary and real valued inputs, the gradient

has a very simple form

�  Parameter gradients are obtained by backpropagating the gradient like in a regular network

∇a(x( t ) )

l( f (x(t ) )) = x(t ) − x(t )

Page 25: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

3

bj

ck X^

W

W * = W T

(tied weights)

h(x~ )

x

0 0 0 noise process

p(x~|x)

Denoising Autoencoder

�  Representation should be robust to introduction of noise �  Random assignment of

subset of inputs to 0, with probability v

�  Gaussian additive noise

�  Reconstruction computed from the corrupted input

�  Loss function compares reconstruction with original input

X~

Page 26: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Dataset �  MNIST ("Modified National Institute of Standards

and Technology") is the de facto “hello world” dataset of computer vision.

�  Contains gray-scale images of hand-drawn digits, from zero through nine.

�  Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

Page 27: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts
Page 28: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts
Page 29: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

∂y i ∂zi

i i = y (1 - y )

The output units in a softmax group use a non-local non-linearity:

softmax group

z i

yi

yi =ezi

e z j ∑j∈group

Softmax

Page 30: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Theano �  Theano is a Python library that allows you to define,

optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.

�  Tight integration with NumPy

�  Transparent use of a GPU. �  Perform data-intensive computations much faster than on a

CPU.

�  Efficient symbolic differentiation. �  Theano does your derivatives for functions with one or

many inputs.

�  Speed and stability optimizations.

Page 31: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Theano

�  Installation and more information: �  http://deeplearning.net/software/theano/

�  Latest reference: �  Theano: A Python framework for fast computation of

mathematical expressions

Page 32: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Theano.Tensor �  Tensor

�  A Theano module for symbolic NumPy types and operations

�  There are many types of symbolic expressions. Basic Tensor functionality includes creation of tensor variables such as scalar, vector, matrix, etc.

Page 33: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

�  Create algebraic expressions on symbolic variables:

�  Theano Functions

Page 34: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

�  Create a shared variable so we can do gradient descent

�  Define a cost function that has a minimum value

�  Compute the gradient

Page 35: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

•  Get data and initialize network

Page 36: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

•  Define Theano variables and expressions

Page 37: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

•  Define training expressions and functions

Page 38: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

•  Loop through the data in batches

Page 39: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

DNN without unsupervised pre-training

Page 40: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

DNN with unsupervised pre-training using RBMs

X

X

Z1 Z2 Z3

Y

W 1 W 2 W 3 W 4

Z1

W 1

Z2 Z1

W 2

Z2 Z3 W 3

Page 41: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

DNN unsupervised pre-training with RBMs

Page 42: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

DNN with unsupervised pre-training using

Autoencoders X

Z1

X

W W T

W T

W T

W

W

Z1 Z2 Z

Z2 Z3 Z

X

Z1

2

1

2 2

3 3

Z2 Z3

Y

W 1

1 1

W 2 W 3 W 4

Page 43: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

DNN unsupervised pre-training with AutoEncoders

Page 44: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Unsupervised Pre-training �  Layer-wise unsupervised learning:

�  The author believes that greedy layer-wise unsupervised pre-training overcomes the challenges of deep learning by introducing a useful prior to the supervised fine-tuning training procedure.

�  Gradients of a criterion defined at the output layer could become less useful as they are propagated backwards to lower layers.

�  The author states it is reasonable to believe unsupervised learning criterion defined at the level of a single layer could be used to move its parameters in a favorable direction.

Page 45: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

Unsupervised Pre-training �  The author claims unsupervised pre-training acts as

a regularizer by establishing an initialization point of the fine-tuning procedure inside a region of parameter space in which the parameters are henceforth restricted.

�  Another motivation is that greedy layer-wise unsupervised pre-training could be a way to naturally decompose the problem into sub-problems associated with different levels of abstraction and extract salient information about the input distribution and capture in a distributed representation.

Page 46: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

References �  Yoshua Bengio, Learning deep architecures for AI:

http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf

�  Hugo Larochelle, Neural networks:http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html

�  Geoff Hinton, Neural networks for machine learning: https://www.coursera.org/learn/neural-networks

�  Goodfellow et al., Deep Learning: http://www.deeplearningbook.org/

�  Montufar et al., On the number of linear regions of Deep Networks: https://arxiv.org/pdf/1402.1869.pdf

Page 47: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

References �  Pascanu et al., On the number of response regions of deep

feed forward neural networks with piece-wise linear activations: https://arxiv.org/pdf/1312.6098.pdf

�  Perceptron weight update graphics:https://www.cs.utah.edu/~piyush/teaching/8-9-print.pdf

�  Proof of Theorem (Block, 1962, and Novikoff, 1962).http://cs229.stanford.edu/notes/cs229-notes6.pdf

�  Python and Theano code reference:https://deeplearningcourses.com/

Page 48: Learning Deep Architectures for AImath.gmu.edu/seminars/NDA/NDAapr14.pdf · Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts

References �  Erhan et al., Why does unsupervised pre-training

help deep learning:http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf

�  Theano installation and more information: http://deeplearning.net/software/theano/

�  Theano latest reference: Theano: A Python framework for fast computation of mathematical expressions