deep learning - fudipofudipo.eu/wp-content/uploads/2018/01/deep-learning.pdf · • more abundant...
TRANSCRIPT
Deep Learning
Ning XiongMälardalen University
FUDIPO CourseNovember 20, 2018
What is Deep Learning
• A narrow understanding: Deep Learning = learning of deep neural networks with many layers
• In future, deep learning would include more …..
• This lecture focuses on DL in its narrow meaning
Why Deep Learning
• Biological Plausibility – e.g. Visual Cortex
• A deep architecture can provide some features, which can be used in other tasks and also make the decision process more transparent
• A deep network is more powerful in approximate modeling than a shallow network
Shallow or Deep
1x 2x ……Nx
Deep
1x 2x ……Nx
……
Shallow
Deep one is better
The same number of neurons
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Agenda• Basic learning method for artificial neural networks
• Tips for improving the learning of deep neural networks
• Three well known deep neural networks (convolutional neural network, recurrent neural network, deep belief network)
Multilayer Neural NetworkA multilayer neural network can be used to represent very complex input-
output relation. The nodes in one layer are fully linked to the next layer. Traditionally each node is implemented as a sigmoid unit.
Input data
Output data
Sigmoid Unit
σ(x) is the sigmoid function:
Nice property: smooth, similar to step function, derivativeexpressed in terms of itself
xex −+=
11)(σ
))(1)(()()( xx
xdxd σσσ
−=
Gradient Descent for Learning
2)(21)( ∑
∈
−=outputsk
kkd otwE
We incrementally update weights in terms of individual instances. Given a training example d, we define the error function as
The idea is to modify the weights according to the negative of gradient of the error function to get a fast reduction of error on this example, so we revise weights according to gradient information as
jijijiji
dji www
wEw ∆+=
∂∂
−=∆ ,η
η is the learning rate specifying the step size in the gradient search
Updating of Weights
Let referred as the error term of unit j,
the weight update rule is formulated as j
dj net
E∂∂
−=δ
jijji xw ηδ=∆
The key issue is how to get the error term
jij
d
ji
j
j
d
ji
d xnetE
wnet
netE
wE
∂∂
=∂∂
∂∂
=∂∂
Error Term at the Output Layer
)1()()(
)(221)1(
)(21)1( 2
jjjjj
jjjjjj
outputskkk
jjj
j
j
j
d
j
d
oooto
ototoo
oto
ooneto
oE
netE
−−−=∂−∂
−•−=
−∂∂
−=∂∂
∂∂
=∂∂ ∑
∈
)1()1()(
jjj
jjjjj
ooeooot
−=
−−=δfor output units
Sigmoid
xji wji netj oj
Error Term at a Hidden Layer
Sigmoid
xji wji netjoj
kjkjDownstreamk
jj
jjkjkjDownstreamkjDownstreamk j
j
j
kk
j
k
jDownstreamkk
jDownstreamk j
k
k
d
j
d
woo
oowneto
onet
netnet
netnet
netE
netE
δ
δδ
δ
∑
∑∑
∑∑
∈
∈∈
∈∈
−−=
−−=∂
∂
∂∂
−=
∂∂
−=∂∂
∂∂
=∂∂
)(
)()(
)()(
)1(
)1(
Sigmoidok
netkwkj
Error Term at Hidden Layer
Sigmoid
xji wji netjoj
wkj kδ Downstream(j)
kjkjDownstreamk
jjj woo δδ ∑∈
−=)(
)1( for hidden unit j
Backpropagation
xjk
xi
wki
wjk
δj
δk
ojBackward step: propagate error terms from output to hidden layer
Forward step: Propagate activation from input to output layer
Challenges with Learning Deep Networks• An extremely high search and optimization problem• High likelihood of getting local optimum• High risk of overfitting• Gradient can vanish when back propagating the error term
kjkjDownstreamk
jjj woo δδ ∑∈
−=)(
)1(
Derivative of Sigmoid
Sigmoid function
Using ReLU rather than Sigmoid Unit
• Rectified Linear Unit (ReLU)
Reason:1. The derivative of ReLU can be
constant one, thus avoiding the gradient vanishing problem
2. Plece-wise linear function, fast to calculate
𝑧𝑧
𝑎𝑎𝑎𝑎 = 𝑧𝑧
𝑎𝑎 = 0
𝜎𝜎 𝑧𝑧
Adaptation of the Learning Rate
jijji xw ηδ=∆
• Learning rate decides the magnitude of the move in gradient search
• With too large learning rate, the error will not decrease. In contrast, a too small learning rate will make the search too slow.
• Learning rate should get smaller and smaller in the process
• Large derivatives, small learning rate, and vice versa
Parameter dependent learning rate
𝑤𝑤𝑡𝑡+1 ← 𝑤𝑤𝑡𝑡 − 𝜂𝜂𝑤𝑤𝑔𝑔𝑡𝑡
constant
𝑔𝑔𝑡𝑡 =𝜕𝜕𝐸𝐸 𝑤𝑤𝜕𝜕𝑤𝑤
𝜂𝜂𝑤𝑤 =𝜂𝜂
∑𝑖𝑖=0𝑡𝑡 𝑔𝑔𝑖𝑖 2 Summation of the square of the previous derivatives
Adaptation of the Learning Rate
Dropout
Training:
Each time before the calculating gradient
Each neuron has p% to dropout
DropoutTraining:
Each time before calculating gradient Each neuron has p% to dropout
Using the new network for training
Make the network more robust to changes, hence less chance of overfitting
DropoutTesting:
No dropout, using the original network structure
If the dropout rate at training is p%, all the weights times (1-p)%
Convolutional Neural Network• A special multilayer network
• Not full connection: a hidden neuron is used for extracting a local feature, it only connects to partial input neurons.
• Weight sharing: if a set of hidden neurons detect the same feature, they should share the weights
• Usually used for image processing and recognition
Feature from Local Receptive Field Consider a 28X28 input image, there are 28X28 input neurons
• A neuron in the first hidden layer is connected to a 5X5 local receptive field in the image for detecting a feature from there
• The neurons detecting that feature across the image constitute a feature map: 24X24 neurons
• All neurons in the feature map share weights
Multiple Feature Maps
Convolutional layer
• Summarize a square (say 2X2 neurons) in the convolutional layer• A common approach for pooling is max-pooling
Pooling Layer
Condensed feature map12X12 neurons
A Complete Convolutional Network
Input layer Convolutional layer Pooling layerFully
connected
Fully connected
Softmax
Recurrent Neural Network
• The hidden units are affected by not only the inputs but also their own outputs in the preceding time step
• Well suitable to deal with sequential informatuon
Unrolled Recurrent Neural Network
Learning of Recurrent Neural Network
Backpropagation through time
Application of Recurrent Net in Dynamic Process Modeling
Recurrent
Network
u1(t)
u2(t)
y1(t)
y2(t)
Dynamic
Process
u1(t)
u2(t)
y1(t)
y2(t)
How to do real time process modeling with recurrent network is an interesting research question
Deep Belief Networks
• Deep architecture allowing for different level of features
• Greedy layer-by-layer pre-training
• No labels of examples required in hidden layer training
• Supervised fine tuning with the whole network
• Both supervised and unsupervised learning
• Help avoiding local minimum and over-fitting
Train layer 1
Train layer 2
Train layer 3
Fine trainingLow levelfeatures
High levelfeatures
Greedy Layer-Wise Training1. Train first hidden layer using data without the labels (unsupervised)
• More abundant unlabeled data not in training set can be used as well .2. Then freeze the first layer parameters and start training the second
hidden layer using the output of the first layer as the unsupervised input to the second layer
3. Repeat this for each hidden layer successively4. Use the output of the final hidden layer and train the output layer in
a supervised fashion (with early weights frozen)5. Unfreeze all weights and fine tune the full network by supervised
learning approach, given the pre-trained weights
Stack of Restricted Boltzman Machines
RMB 1
RMB 2
• A deep brief net consists of a stack of Restricted Boltzman Machines (RBM)
• Learning at a a hidden layer --> problem of learning of RBM
RMB 1
What Is Restricted Boltzman Machine
Visible layer
Hidden layer• RBM consists of visible and hidden layers• Nodes from one layer are fully connected
to nodes in the other layer • There is no connection of nodes in the
same layer• Simple RBM contains binary variables• Information can go in both directions
𝑝𝑝 𝑥𝑥𝑖𝑖 = 1 ℎ = 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 �𝑗𝑗=1
𝑁𝑁𝑤𝑤𝑖𝑖𝑗𝑗ℎ𝑗𝑗 + 𝑏𝑏𝑖𝑖
𝑝𝑝 ℎ𝑗𝑗 = 1 𝑥𝑥 = 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 �𝑖𝑖=1
𝑀𝑀𝑤𝑤𝑖𝑖𝑗𝑗𝑥𝑥𝑖𝑖 + 𝑎𝑎𝑗𝑗
The Learning of RBM
XX1 XX2 XX3
Reconstruction of input data
Contrastive Divergence (for RBM Learning)Basic idea: revise weights based on the difference of original x and h values and those values derived from the model
X1: input vector of an example
h1
X2X1
h1
X2
h2
Sam
plin
g
Sam
plin
g
Sam
plin
g
�∆𝑤𝑤𝑖𝑖𝑗𝑗 = 𝜂𝜂(ℎ𝑗𝑗1𝑥𝑥𝑖𝑖1 − ℎ𝑗𝑗2𝑥𝑥𝑖𝑖2Learning rule:
Auto-Encoder: Simpler Alternative to RBM
Decoder: X’=sigmoid(W’h+d)
Encoder: h=sigmoid(WX+a)
This 2-layer network can be trained with BP algorithm where the input data are also used as target output
Stacked Auto-EncodersThe auto-encoders can be stacked together to build a deep network, while excluding the decoders
Qusetions?