basic structure: fully connected feedforward...
TRANSCRIPT
-
Network StructureHung-yi Lee
李宏毅
-
Step 1: Neural Network
Step 2: Cost Function
Step 3: Optimization
Three Steps for Deep Learning
Step 1. A neural network is a function composed of simple functions (neurons)
Usually we design the network structure, and let machine find parameters from data
Step 2. Cost function evaluates how good a set of parameters is
Step 3. Find the best function set (e.g. gradient descent)
We design the cost function based on the task
-
Outline
• Basic structure (3/03)
• Fully Connected Layer
• Recurrent Structure
• Convolutional/Pooling Layer
• Special Structure (3/17)
• Spatial Transformation Layer
• Highway Network / Grid LSTM
• Recursive Structure
• External Memory
• Batch Normalization
• Sequence-to-sequence / Attention (3/24)
-
Prerequisite
• Brief Introduction of Deep Learning
• https://youtu.be/Dr-WRlEFefw?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49
• Convolutional Neural Network
• https://youtu.be/FrKWiRv254g?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49
• Recurrent Neural Network (Part I)
• https://youtu.be/xCGidAeyS4M?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49
• Recurrent Neural Network (Part II)
• https://www.youtube.com/watch?v=rTqmWlnwz_0&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&index=25
https://youtu.be/Dr-WRlEFefw?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49https://youtu.be/FrKWiRv254g?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49https://youtu.be/xCGidAeyS4M?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49https://www.youtube.com/watch?v=rTqmWlnwz_0&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&index=25
-
Basic Structure:Fully Connected Layer
-
Fully Connected Layer
……
nodeslN
Layer l
1
1
la
1
2
la
1l
ja……
……
Layer 1l
nodes1lN
la1
la2
l
ia……
l
ia
Output of a neuron:
Neuron i
Layer l
Output of one layer:
la : a vector
1
2
j
1
2
i
-
Fully Connected Layer
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
l
ijw
1
2
j
1
2
i
l
ijwLayer 1lto Layer l
from neuron j
to neuron i
(Layer )1l(Layer )l
ll
ll
l ww
ww
W 2221
1211
lN
1N l
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
-
Fully Connected Layer
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
l
ib : bias for neuron i at layer l
l
ib
1
lb1
lb2
l
i
l
l
l
b
b
b
b
2
1
bias for all neurons in layer l
-
Fully Connected Layer
……
nodeslN
Layer l
1
1
la
1
2
la
1l
ja……
Layer 1l
nodes1lN
l
ia……
1
2
j i
l
iz
l
i
N
j
l
j
l
ij
l
i bawzl
1
1
1
: input of the activation function for neuron i at layer l
l
ijw
l
iw 2
l
iw 1 li
ll
i
ll
i
l
i bawawz 122
1
11
l
iz
l
ib
1
: input of the activation function all the neurons in layer l
lz
-
Relations between Layer Outputs
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz1la
-
Relations between Layer Outputs
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz1la
l
i
l
l
l
i
l
l
ll
ll
l
i
l
l
b
b
b
a
a
a
ww
ww
z
z
z
2
1
1
1
2
1
1
2221
12112
1
llll baWz 1
llllll bawawz 11
212
1
1111
llllll bawawz 21
222
1
1212
l
i
ll
i
ll
i
l
i bawawz 122
1
11
……
-
Relations between Layer Outputs
l
i
l
l
l
i
l
l
z
z
z
a
a
a
2
1
2
1
lili za
ll za
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz
-
Relations between Layer Outputs
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz1la
llll baWz 1
ll za
llll baWa 1
-
Basic Structure:Recurrent Structure
Simplify the networkby using the same function again and again
-
Reference
https://www.cs.toronto.edu/~graves/preprint.pdf
K. Greff, R. K. Srivastava, J. Koutník, B. R.
Steunebrink, J. Schmidhuber, "LSTM: A Search
Space Odyssey," in IEEE Transactions on
Neural Networks and Learning Systems, 2016
Rafal Józefowicz, Wojciech Zaremba, Ilya
Sutskever, “An Empirical Exploration of
Recurrent Network Architectures,”
in ICML, 2015
-
Recurrent Neural Network
• Given function f: ℎ′, 𝑦 = 𝑓 ℎ, 𝑥
fh0 h1
y1
x1
f h2
y2
x2
f h3
y3
x3
……
No matter how long the input/output sequence is, we only need one function f
h and h’ are vectors with the same dimension
-
Deep RNN
f1h0 h1
y1
x1
f1 h2
y2
x2
f1 h3
y3
x3
……
f2b0 b1
c1
f2 b2
c2
f2 b3
c3
……
… ……
ℎ′, 𝑦 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑦 …
-
f1h0 h1
a1
x1
f1 h2
a2
x2
f1 h3
a3
x3
f2b0 b1 f2 b2 f2 b3
Bidirectional RNN
x1 x2 x3
c1 c2 c3
f3 f3 f3y1 y2 y3
ℎ′, 𝑎 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑥
𝑦 = 𝑓3 𝑎, 𝑐
-
Pyramidal RNN
• Reducing the number of time steps
W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural
network for large vocabulary conversational speech recognition,” ICASSP, 2016
-
Naïve RNN
• Given function f: ℎ′, 𝑦 = 𝑓 ℎ, 𝑥
fh h'
y
x
Ignore bias here
h'
y Wo
Wh
h'= 𝜎
softmax
= 𝜎 h Wi x+
-
LSTM
c change slowly
h change faster
ct is ct-1 added by something
ht and ht-1 can be very different
Naive ht
yt
xt
ht-1LSTM
yt
xt
ct
htht-1
ct-1
-
xt
zzizf zo
ht-1
ct-1
zxt
ht-1W= 𝑡𝑎𝑛ℎ
zixt
ht-1Wi= 𝜎
zfxt
ht-1Wf= 𝜎
zoxt
ht-1Wo= 𝜎
-
xt
zzizf zo
ht-1
ct-1
“peephole”
z W= 𝑡𝑎𝑛ℎ
xt
ht-1
ct-1
diagonal
zizfzo obtained by the same way
-
ht
𝑐𝑡 = 𝑧𝑓⨀𝑐𝑡−1+𝑧𝑖⨀𝑧
xt
zzizf zo
+
yt
ht-1
ct-1 ct
ℎ𝑡 = 𝑧𝑜⨀𝑡𝑎𝑛ℎ 𝑐𝑡⨀
⨀ ⨀tanh
𝑦𝑡 = 𝜎 𝑊′ℎ𝑡
-
LSTM
xt
zzizf zo
⨀ +
yt
ht-1
ct-1 ct
xt+1
zzizf zo
+
yt+1
ht
ct+1
⨀
⨀ ⨀
⨀
⨀tanh tanh
ht
-
ht-1
GRU
r z
yt
xtht-1
h'
⨀
xt
⨀
⨀1-
+ ht
reset update
ℎ𝑡 = 𝑧⨀ℎ𝑡 + 1 − 𝑧 ⨀ℎ′
-
Example Task
• (Simplified) Speech Recognition: Frame classification on TIMIT
TSI TSI TSI I I N N N
x1 x2 x3 x4
y1 y2 y3 y4 ……
……
S S @ @ @ @
x1 x2 x3 x4
y1 y2 y3 y4 ……
……
Utterance 1 Utterance 2
-
Target Delay
• Only for unidirectional RNN
TSI TSI TSI I I N N NTrue labels:
Delay 3 steps: TSI TSI TSI I I N N Nx x x
-
LSTM > RNN > feedforward
Bi-direction > uni-direction
-
Forward direction
Reverse direction
-
Training LSTM is faster than RNN
-
LSTM: A Search Space Odyssey
Standard LSTM works wellSimply LSTM: coupling input and forget gate, removing peepholeForget gate is critical for performanceOutput gate activation function is critical
-
An Empirical Exploration of Recurrent Network
Architectures
Importance: forget > input > output
Large bias for forget gate is helpful
LSTM-f/i/o: removing forget/input/output gates
LSTM-b: large bias
-
An Empirical Exploration of Recurrent Network
Architectures
-
Basic Structure:Convolutional / Pooing
Layer
Simplify the neural network(based on prior knowledge to the task)
-
Convolutional Layer
……
11
22
Sparse Connectivity
3
4
5
Each neural only connects to part of the output of the previous layer
3
4
Receptive Field
Different neurons have
different, but overlapping,
receptive fields
-
Convolutional Layer
……
11
22
Sparse Connectivity
3
4
5
Each neural only connects to part of the output of the previous layer
3
4
Parameter Sharing
The neurons with different receptive fields can use the same set of parameters.
Less parameters then fully connected layer
-
Convolutional Layer
……
11
22
3
4
5
3
4
Considering neuron 1 and 3 as “filter 1” (kernel 1)
filter (kernel) size: size of the receptive field of a neuron
Stride = 2
Considering neuron 2 and 4 as “filter 2” (kernel 2)
Kernel size, no. of filter, stride are all designed by the developers.
-
Example –1D Signal + Single Channel
1 2 3 4
𝑥1 𝑥2 𝑥3 𝑥4𝑥5
Classification, Predict the future …
Audio Signal, Stock Value …
-
Example –1D Signal + Multiple Channel
1
2
3
A document: each word is a vector
I
like
this
movie
very
much4
……
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5
𝑥6 𝑥7
Does this kind of receptive field make sense?
-
Example –2D Signal + Single Channel
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 black & white picture image
1:
2:
3:
…
7:
8:
9:…
13:
14:
15:
…
4:
10:
16:
1
0
0
0
0
1
0
0
0
0
1
1Only show 1 filter here
Size of Receptive field is 3x3, Stride is 1
-
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Example –2D Signal + Multiple Channel
6 x 6 colorful image
1:
2:
3:
…
7:
8:
9:
…
13:
14:
15:
…
4:
10:
16:
1
0
0
0
0
1
0
0
0
0
1
1
Only show 1 filter here
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Size of Receptive field is 3x3x3, Stride is 1
1
0
1
0
0
1
1
0
1
1
0
1
0
0
0
0
0
0
0
0
1
1
1
0
-
Zero Padding
Without Zero Padding
-
Pooling Layer
nodeskN /
Layer lLayer 1l
nodesN
…
…
1
1
k
…
1k
k2
…
2
k outputs in layer 𝑙 − 1 are grouped together
Each output in layer 𝑙“summarizes” k inputs1
1
la
1l
ka la1
k
j
l
j
l ak
a1
1
1
1
Average Pooling:
Max Pooling:
L2 Pooling:
112111 ,,,max lklll aaaa
k
j
l
j
l ak
a1
21
1
1
-
Pooling LayerWhich outputs should be grouped together?
Group the neurons corresponding to the same filter with nearby receptive fields
……
11
22
3
4
5
3
4
Convolutional Layer
PoolingLayer
1
2
Subsampling
-
Pooling LayerWhich outputs should be grouped together?
Group the neurons with the same receptive field
……
11
22
3
4
5
3
4
Convolutional Layer
PoolingLayer
1
2
Maxout Network
How do you know which neurons detect the same pattern?
-
Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-endWith Raw Waveform CLDNNs,” In INTERPSEECH 2015
Combination of Different Basic Layers
-
Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-endWith Raw Waveform CLDNNs,” In INTERPSEECH 2015
Combination of Different Basic Layers
-
Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-endWith Raw Waveform CLDNNs,” In INTERPSEECH 2015
Combination of Different Basic Layers
3 layers
-
Next Time
• 3/10: TAs will teach TensorFlow
• TensorFlow for regression
• TensorFlow for word vector
• word vector: https://www.youtube.com/watch?v=X7PH3NuYW0Q
• TensorFlow for CNN
• If you want to learn Theano
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20RNN.ecm.mp4/index.html
https://www.youtube.com/watch?v=X7PH3NuYW0Qhttp://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano DNN.ecm.mp4/index.htmlhttp://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano RNN.ecm.mp4/index.html