deep learning - fudipofudipo.eu/wp-content/uploads/2018/01/deep-learning.pdf · • more abundant...

39
Deep Learning Ning Xiong Mälardalen University FUDIPO Course November 20, 2018

Upload: others

Post on 12-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Deep Learning

Ning XiongMälardalen University

FUDIPO CourseNovember 20, 2018

Page 2: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

What is Deep Learning

• A narrow understanding: Deep Learning = learning of deep neural networks with many layers

• In future, deep learning would include more …..

• This lecture focuses on DL in its narrow meaning

Page 3: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Why Deep Learning

• Biological Plausibility – e.g. Visual Cortex

• A deep architecture can provide some features, which can be used in other tasks and also make the decision process more transparent

• A deep network is more powerful in approximate modeling than a shallow network

Page 4: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Shallow or Deep

1x 2x ……Nx

Deep

1x 2x ……Nx

……

Shallow

Deep one is better

The same number of neurons

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

Page 5: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Agenda• Basic learning method for artificial neural networks

• Tips for improving the learning of deep neural networks

• Three well known deep neural networks (convolutional neural network, recurrent neural network, deep belief network)

Page 6: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Multilayer Neural NetworkA multilayer neural network can be used to represent very complex input-

output relation. The nodes in one layer are fully linked to the next layer. Traditionally each node is implemented as a sigmoid unit.

Input data

Output data

Page 7: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Sigmoid Unit

σ(x) is the sigmoid function:

Nice property: smooth, similar to step function, derivativeexpressed in terms of itself

xex −+=

11)(σ

))(1)(()()( xx

xdxd σσσ

−=

Page 8: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Gradient Descent for Learning

2)(21)( ∑

−=outputsk

kkd otwE

We incrementally update weights in terms of individual instances. Given a training example d, we define the error function as

The idea is to modify the weights according to the negative of gradient of the error function to get a fast reduction of error on this example, so we revise weights according to gradient information as

jijijiji

dji www

wEw ∆+=

∂∂

−=∆ ,η

η is the learning rate specifying the step size in the gradient search

Page 9: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Updating of Weights

Let referred as the error term of unit j,

the weight update rule is formulated as j

dj net

E∂∂

−=δ

jijji xw ηδ=∆

The key issue is how to get the error term

jij

d

ji

j

j

d

ji

d xnetE

wnet

netE

wE

∂∂

=∂∂

∂∂

=∂∂

Page 10: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Error Term at the Output Layer

)1()()(

)(221)1(

)(21)1( 2

jjjjj

jjjjjj

outputskkk

jjj

j

j

j

d

j

d

oooto

ototoo

oto

ooneto

oE

netE

−−−=∂−∂

−•−=

−∂∂

−=∂∂

∂∂

=∂∂ ∑

)1()1()(

jjj

jjjjj

ooeooot

−=

−−=δfor output units

Sigmoid

xji wji netj oj

Page 11: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Error Term at a Hidden Layer

Sigmoid

xji wji netjoj

kjkjDownstreamk

jj

jjkjkjDownstreamkjDownstreamk j

j

j

kk

j

k

jDownstreamkk

jDownstreamk j

k

k

d

j

d

woo

oowneto

onet

netnet

netnet

netE

netE

δ

δδ

δ

∑∑

∑∑

∈∈

∈∈

−−=

−−=∂

∂∂

−=

∂∂

−=∂∂

∂∂

=∂∂

)(

)()(

)()(

)1(

)1(

Sigmoidok

netkwkj

Page 12: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Error Term at Hidden Layer

Sigmoid

xji wji netjoj

wkj kδ Downstream(j)

kjkjDownstreamk

jjj woo δδ ∑∈

−=)(

)1( for hidden unit j

Page 13: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Backpropagation

xjk

xi

wki

wjk

δj

δk

ojBackward step: propagate error terms from output to hidden layer

Forward step: Propagate activation from input to output layer

Page 14: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Challenges with Learning Deep Networks• An extremely high search and optimization problem• High likelihood of getting local optimum• High risk of overfitting• Gradient can vanish when back propagating the error term

kjkjDownstreamk

jjj woo δδ ∑∈

−=)(

)1(

Derivative of Sigmoid

Sigmoid function

Page 15: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Using ReLU rather than Sigmoid Unit

• Rectified Linear Unit (ReLU)

Reason:1. The derivative of ReLU can be

constant one, thus avoiding the gradient vanishing problem

2. Plece-wise linear function, fast to calculate

𝑧𝑧

𝑎𝑎𝑎𝑎 = 𝑧𝑧

𝑎𝑎 = 0

𝜎𝜎 𝑧𝑧

Page 16: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Adaptation of the Learning Rate

jijji xw ηδ=∆

• Learning rate decides the magnitude of the move in gradient search

• With too large learning rate, the error will not decrease. In contrast, a too small learning rate will make the search too slow.

• Learning rate should get smaller and smaller in the process

• Large derivatives, small learning rate, and vice versa

Page 17: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Parameter dependent learning rate

𝑤𝑤𝑡𝑡+1 ← 𝑤𝑤𝑡𝑡 − 𝜂𝜂𝑤𝑤𝑔𝑔𝑡𝑡

constant

𝑔𝑔𝑡𝑡 =𝜕𝜕𝐸𝐸 𝑤𝑤𝜕𝜕𝑤𝑤

𝜂𝜂𝑤𝑤 =𝜂𝜂

∑𝑖𝑖=0𝑡𝑡 𝑔𝑔𝑖𝑖 2 Summation of the square of the previous derivatives

Adaptation of the Learning Rate

Page 18: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Dropout

Training:

Each time before the calculating gradient

Each neuron has p% to dropout

Page 19: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

DropoutTraining:

Each time before calculating gradient Each neuron has p% to dropout

Using the new network for training

Make the network more robust to changes, hence less chance of overfitting

Page 20: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

DropoutTesting:

No dropout, using the original network structure

If the dropout rate at training is p%, all the weights times (1-p)%

Page 21: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Convolutional Neural Network• A special multilayer network

• Not full connection: a hidden neuron is used for extracting a local feature, it only connects to partial input neurons.

• Weight sharing: if a set of hidden neurons detect the same feature, they should share the weights

• Usually used for image processing and recognition

Page 22: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Feature from Local Receptive Field Consider a 28X28 input image, there are 28X28 input neurons

• A neuron in the first hidden layer is connected to a 5X5 local receptive field in the image for detecting a feature from there

• The neurons detecting that feature across the image constitute a feature map: 24X24 neurons

• All neurons in the feature map share weights

Page 23: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Multiple Feature Maps

Convolutional layer

Page 24: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

• Summarize a square (say 2X2 neurons) in the convolutional layer• A common approach for pooling is max-pooling

Pooling Layer

Condensed feature map12X12 neurons

Page 25: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

A Complete Convolutional Network

Input layer Convolutional layer Pooling layerFully

connected

Fully connected

Softmax

Page 26: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Recurrent Neural Network

• The hidden units are affected by not only the inputs but also their own outputs in the preceding time step

• Well suitable to deal with sequential informatuon

Page 27: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Unrolled Recurrent Neural Network

Page 28: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Learning of Recurrent Neural Network

Backpropagation through time

Page 29: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Application of Recurrent Net in Dynamic Process Modeling

Recurrent

Network

u1(t)

u2(t)

y1(t)

y2(t)

Dynamic

Process

u1(t)

u2(t)

y1(t)

y2(t)

How to do real time process modeling with recurrent network is an interesting research question

Page 30: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Deep Belief Networks

• Deep architecture allowing for different level of features

• Greedy layer-by-layer pre-training

• No labels of examples required in hidden layer training

• Supervised fine tuning with the whole network

• Both supervised and unsupervised learning

• Help avoiding local minimum and over-fitting

Train layer 1

Train layer 2

Train layer 3

Fine trainingLow levelfeatures

High levelfeatures

Page 31: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Greedy Layer-Wise Training1. Train first hidden layer using data without the labels (unsupervised)

• More abundant unlabeled data not in training set can be used as well .2. Then freeze the first layer parameters and start training the second

hidden layer using the output of the first layer as the unsupervised input to the second layer

3. Repeat this for each hidden layer successively4. Use the output of the final hidden layer and train the output layer in

a supervised fashion (with early weights frozen)5. Unfreeze all weights and fine tune the full network by supervised

learning approach, given the pre-trained weights

Page 32: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Stack of Restricted Boltzman Machines

RMB 1

RMB 2

• A deep brief net consists of a stack of Restricted Boltzman Machines (RBM)

• Learning at a a hidden layer --> problem of learning of RBM

RMB 1

Page 33: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

What Is Restricted Boltzman Machine

Visible layer

Hidden layer• RBM consists of visible and hidden layers• Nodes from one layer are fully connected

to nodes in the other layer • There is no connection of nodes in the

same layer• Simple RBM contains binary variables• Information can go in both directions

𝑝𝑝 𝑥𝑥𝑖𝑖 = 1 ℎ = 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 �𝑗𝑗=1

𝑁𝑁𝑤𝑤𝑖𝑖𝑗𝑗ℎ𝑗𝑗 + 𝑏𝑏𝑖𝑖

𝑝𝑝 ℎ𝑗𝑗 = 1 𝑥𝑥 = 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 �𝑖𝑖=1

𝑀𝑀𝑤𝑤𝑖𝑖𝑗𝑗𝑥𝑥𝑖𝑖 + 𝑎𝑎𝑗𝑗

Page 34: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

The Learning of RBM

XX1 XX2 XX3

Reconstruction of input data

Page 35: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Contrastive Divergence (for RBM Learning)Basic idea: revise weights based on the difference of original x and h values and those values derived from the model

X1: input vector of an example

h1

X2X1

h1

X2

h2

Sam

plin

g

Sam

plin

g

Sam

plin

g

�∆𝑤𝑤𝑖𝑖𝑗𝑗 = 𝜂𝜂(ℎ𝑗𝑗1𝑥𝑥𝑖𝑖1 − ℎ𝑗𝑗2𝑥𝑥𝑖𝑖2Learning rule:

Page 36: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Auto-Encoder: Simpler Alternative to RBM

Decoder: X’=sigmoid(W’h+d)

Encoder: h=sigmoid(WX+a)

This 2-layer network can be trained with BP algorithm where the input data are also used as target output

Page 37: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Stacked Auto-EncodersThe auto-encoders can be stacked together to build a deep network, while excluding the decoders

Page 38: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first
Page 39: Deep Learning - FUDIPOfudipo.eu/wp-content/uploads/2018/01/Deep-Learning.pdf · • More abundant unlabeled data not in training set can be used as well . 2. Then freeze the first

Qusetions?