deep learning - fudipofudipo.eu/wp-content/uploads/2018/01/deep-learning.pdf · • more abundant...

Deep Learning

Ning XiongMälardalen University

FUDIPO CourseNovember 20, 2018

What is Deep Learning

• A narrow understanding: Deep Learning = learning of deep neural networks with many layers

• In future, deep learning would include more …..

• This lecture focuses on DL in its narrow meaning

Why Deep Learning

• Biological Plausibility – e.g. Visual Cortex

• A deep architecture can provide some features, which can be used in other tasks and also make the decision process more transparent

• A deep network is more powerful in approximate modeling than a shallow network

Shallow or Deep

1x 2x ……Nx

Deep

1x 2x ……Nx

……

Shallow

Deep one is better

The same number of neurons

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

Agenda• Basic learning method for artificial neural networks

• Tips for improving the learning of deep neural networks

• Three well known deep neural networks (convolutional neural network, recurrent neural network, deep belief network)

Multilayer Neural NetworkA multilayer neural network can be used to represent very complex input-

output relation. The nodes in one layer are fully linked to the next layer. Traditionally each node is implemented as a sigmoid unit.

Input data

Output data

Sigmoid Unit

σ(x) is the sigmoid function:

Nice property: smooth, similar to step function, derivativeexpressed in terms of itself

xex −+=

11)(σ

))(1)(()()( xx

xdxd σσσ

−=

Gradient Descent for Learning

2)(21)( ∑

∈

−=outputsk

kkd otwE

We incrementally update weights in terms of individual instances. Given a training example d, we define the error function as

The idea is to modify the weights according to the negative of gradient of the error function to get a fast reduction of error on this example, so we revise weights according to gradient information as

jijijiji

dji www

wEw ∆+=

∂∂

−=∆ ,η

η is the learning rate specifying the step size in the gradient search

Updating of Weights

Let referred as the error term of unit j,

the weight update rule is formulated as j

dj net

E∂∂

−=δ

jijji xw ηδ=∆

The key issue is how to get the error term

jij

d

ji

j

j

d

ji

d xnetE

wnet

netE

wE

∂∂

=∂∂

∂∂

=∂∂

Error Term at the Output Layer

)1()()(

)(221)1(

)(21)1( 2

jjjjj

jjjjjj

outputskkk

jjj

j

j

j

d

j

d

oooto

ototoo

oto

ooneto

oE

netE

−−−=∂−∂

−•−=

−∂∂

−=∂∂

∂∂

=∂∂ ∑

∈

)1()1()(

jjj

jjjjj

ooeooot

−=

−−=δfor output units

Sigmoid

xji wji netj oj

Error Term at a Hidden Layer

Sigmoid

xji wji netjoj

kjkjDownstreamk

jj

jjkjkjDownstreamkjDownstreamk j

j

j

kk

j

k

jDownstreamkk

jDownstreamk j

k

k

d

j

d

woo

oowneto

onet

netnet

netnet

netE

netE

δ

δδ

δ

∑

∑∑

∑∑

∈

∈∈

∈∈

−−=

−−=∂

∂

∂∂

−=

∂∂

−=∂∂

∂∂

=∂∂

)(

)()(

)()(

)1(

)1(

Sigmoidok

netkwkj

Error Term at Hidden Layer

Sigmoid

xji wji netjoj

wkj kδ Downstream(j)

kjkjDownstreamk

jjj woo δδ ∑∈

−=)(

)1( for hidden unit j

Backpropagation

xjk

xi

wki

wjk

δj

δk

ojBackward step: propagate error terms from output to hidden layer

Forward step: Propagate activation from input to output layer

Challenges with Learning Deep Networks• An extremely high search and optimization problem• High likelihood of getting local optimum• High risk of overfitting• Gradient can vanish when back propagating the error term

kjkjDownstreamk

jjj woo δδ ∑∈

−=)(

)1(

Derivative of Sigmoid

Sigmoid function

Using ReLU rather than Sigmoid Unit

• Rectified Linear Unit (ReLU)

Reason:1. The derivative of ReLU can be

constant one, thus avoiding the gradient vanishing problem

2. Plece-wise linear function, fast to calculate

𝑧𝑧

𝑎𝑎𝑎𝑎 = 𝑧𝑧

𝑎𝑎 = 0

𝜎𝜎 𝑧𝑧

Adaptation of the Learning Rate

jijji xw ηδ=∆

• Learning rate decides the magnitude of the move in gradient search

• With too large learning rate, the error will not decrease. In contrast, a too small learning rate will make the search too slow.

• Learning rate should get smaller and smaller in the process

• Large derivatives, small learning rate, and vice versa

Parameter dependent learning rate

𝑤𝑤𝑡𝑡+1 ← 𝑤𝑤𝑡𝑡 − 𝜂𝜂𝑤𝑤𝑔𝑔𝑡𝑡

constant

𝑔𝑔𝑡𝑡 =𝜕𝜕𝐸𝐸 𝑤𝑤𝜕𝜕𝑤𝑤

𝜂𝜂𝑤𝑤 =𝜂𝜂

∑𝑖𝑖=0𝑡𝑡 𝑔𝑔𝑖𝑖 2 Summation of the square of the previous derivatives

Adaptation of the Learning Rate

Dropout

Training:

Each time before the calculating gradient

Each neuron has p% to dropout

DropoutTraining:

Each time before calculating gradient Each neuron has p% to dropout

Using the new network for training

Make the network more robust to changes, hence less chance of overfitting

DropoutTesting:

No dropout, using the original network structure

If the dropout rate at training is p%, all the weights times (1-p)%

Convolutional Neural Network• A special multilayer network

• Not full connection: a hidden neuron is used for extracting a local feature, it only connects to partial input neurons.

• Weight sharing: if a set of hidden neurons detect the same feature, they should share the weights

• Usually used for image processing and recognition

Feature from Local Receptive Field Consider a 28X28 input image, there are 28X28 input neurons

• A neuron in the first hidden layer is connected to a 5X5 local receptive field in the image for detecting a feature from there

• The neurons detecting that feature across the image constitute a feature map: 24X24 neurons

• All neurons in the feature map share weights

Multiple Feature Maps

Convolutional layer

• Summarize a square (say 2X2 neurons) in the convolutional layer• A common approach for pooling is max-pooling

Pooling Layer

Condensed feature map12X12 neurons

A Complete Convolutional Network

Input layer Convolutional layer Pooling layerFully

connected

Fully connected

Softmax

Recurrent Neural Network

• The hidden units are affected by not only the inputs but also their own outputs in the preceding time step

• Well suitable to deal with sequential informatuon

Unrolled Recurrent Neural Network

Learning of Recurrent Neural Network

Backpropagation through time

Application of Recurrent Net in Dynamic Process Modeling

Recurrent

Network

u1(t)

u2(t)

y1(t)

y2(t)

Dynamic

Process

u1(t)

u2(t)

y1(t)

y2(t)

How to do real time process modeling with recurrent network is an interesting research question

Deep Belief Networks

• Deep architecture allowing for different level of features

• Greedy layer-by-layer pre-training

• No labels of examples required in hidden layer training

• Supervised fine tuning with the whole network

• Both supervised and unsupervised learning

• Help avoiding local minimum and over-fitting

Train layer 1

Train layer 2

Train layer 3

Fine trainingLow levelfeatures

High levelfeatures

Greedy Layer-Wise Training1. Train first hidden layer using data without the labels (unsupervised)

• More abundant unlabeled data not in training set can be used as well .2. Then freeze the first layer parameters and start training the second

hidden layer using the output of the first layer as the unsupervised input to the second layer

3. Repeat this for each hidden layer successively4. Use the output of the final hidden layer and train the output layer in

a supervised fashion (with early weights frozen)5. Unfreeze all weights and fine tune the full network by supervised

learning approach, given the pre-trained weights

Stack of Restricted Boltzman Machines

RMB 1

RMB 2

• A deep brief net consists of a stack of Restricted Boltzman Machines (RBM)

• Learning at a a hidden layer --> problem of learning of RBM

RMB 1

What Is Restricted Boltzman Machine

Visible layer

Hidden layer• RBM consists of visible and hidden layers• Nodes from one layer are fully connected

to nodes in the other layer • There is no connection of nodes in the

same layer• Simple RBM contains binary variables• Information can go in both directions

𝑝𝑝 𝑥𝑥𝑖𝑖 = 1 ℎ = 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 �𝑗𝑗=1

𝑁𝑁𝑤𝑤𝑖𝑖𝑗𝑗ℎ𝑗𝑗 + 𝑏𝑏𝑖𝑖

𝑝𝑝 ℎ𝑗𝑗 = 1 𝑥𝑥 = 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 �𝑖𝑖=1

𝑀𝑀𝑤𝑤𝑖𝑖𝑗𝑗𝑥𝑥𝑖𝑖 + 𝑎𝑎𝑗𝑗

The Learning of RBM

XX1 XX2 XX3

Reconstruction of input data

Contrastive Divergence (for RBM Learning)Basic idea: revise weights based on the difference of original x and h values and those values derived from the model

X1: input vector of an example

h1

X2X1

h1

X2

h2

Sam

plin

g

Sam

plin

g

Sam

plin

g

�∆𝑤𝑤𝑖𝑖𝑗𝑗 = 𝜂𝜂(ℎ𝑗𝑗1𝑥𝑥𝑖𝑖1 − ℎ𝑗𝑗2𝑥𝑥𝑖𝑖2Learning rule:

Auto-Encoder: Simpler Alternative to RBM

Decoder: X’=sigmoid(W’h+d)

Encoder: h=sigmoid(WX+a)

This 2-layer network can be trained with BP algorithm where the input data are also used as target output

Stacked Auto-EncodersThe auto-encoders can be stacked together to build a deep network, while excluding the decoders

Qusetions?

deep learning - fudipofudipo.eu/wp-content/uploads/2018/01/deep-learning.pdf · • more abundant...

Documents