basic structure: fully connected feedforward...

52
Network Structure Hung-yi Lee 李宏毅

Upload: others

Post on 23-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Network StructureHung-yi Lee

    李宏毅

  • Step 1: Neural Network

    Step 2: Cost Function

    Step 3: Optimization

    Three Steps for Deep Learning

    Step 1. A neural network is a function composed of simple functions (neurons)

    Usually we design the network structure, and let machine find parameters from data

    Step 2. Cost function evaluates how good a set of parameters is

    Step 3. Find the best function set (e.g. gradient descent)

    We design the cost function based on the task

  • Outline

    • Basic structure (3/03)

    • Fully Connected Layer

    • Recurrent Structure

    • Convolutional/Pooling Layer

    • Special Structure (3/17)

    • Spatial Transformation Layer

    • Highway Network / Grid LSTM

    • Recursive Structure

    • External Memory

    • Batch Normalization

    • Sequence-to-sequence / Attention (3/24)

  • Prerequisite

    • Brief Introduction of Deep Learning

    • https://youtu.be/Dr-WRlEFefw?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

    • Convolutional Neural Network

    • https://youtu.be/FrKWiRv254g?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

    • Recurrent Neural Network (Part I)

    • https://youtu.be/xCGidAeyS4M?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

    • Recurrent Neural Network (Part II)

    • https://www.youtube.com/watch?v=rTqmWlnwz_0&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&index=25

    https://youtu.be/Dr-WRlEFefw?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49https://youtu.be/FrKWiRv254g?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49https://youtu.be/xCGidAeyS4M?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49https://www.youtube.com/watch?v=rTqmWlnwz_0&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&index=25

  • Basic Structure:Fully Connected Layer

  • Fully Connected Layer

    ……

    nodeslN

    Layer l

    1

    1

    la

    1

    2

    la

    1l

    ja……

    ……

    Layer 1l

    nodes1lN

    la1

    la2

    l

    ia……

    l

    ia

    Output of a neuron:

    Neuron i

    Layer l

    Output of one layer:

    la : a vector

    1

    2

    j

    1

    2

    i

  • Fully Connected Layer

    ……

    nodeslN

    Layer l

    ……

    ……

    Layer 1l

    nodes1lN

    ……

    l

    ijw

    1

    2

    j

    1

    2

    i

    l

    ijwLayer 1lto Layer l

    from neuron j

    to neuron i

    (Layer )1l(Layer )l

    ll

    ll

    l ww

    ww

    W 2221

    1211

    lN

    1N l

    1

    1

    la

    1

    2

    la

    1l

    ja

    la1

    la2

    l

    ia

  • Fully Connected Layer

    ……

    nodeslN

    Layer l

    ……

    ……

    Layer 1l

    nodes1lN

    ……

    1

    2

    j

    1

    2

    i

    1

    1

    la

    1

    2

    la

    1l

    ja

    la1

    la2

    l

    ia

    l

    ib : bias for neuron i at layer l

    l

    ib

    1

    lb1

    lb2

    l

    i

    l

    l

    l

    b

    b

    b

    b

    2

    1

    bias for all neurons in layer l

  • Fully Connected Layer

    ……

    nodeslN

    Layer l

    1

    1

    la

    1

    2

    la

    1l

    ja……

    Layer 1l

    nodes1lN

    l

    ia……

    1

    2

    j i

    l

    iz

    l

    i

    N

    j

    l

    j

    l

    ij

    l

    i bawzl

    1

    1

    1

    : input of the activation function for neuron i at layer l

    l

    ijw

    l

    iw 2

    l

    iw 1 li

    ll

    i

    ll

    i

    l

    i bawawz 122

    1

    11

    l

    iz

    l

    ib

    1

    : input of the activation function all the neurons in layer l

    lz

  • Relations between Layer Outputs

    ……

    nodeslN

    Layer l

    ……

    ……

    Layer 1l

    nodes1lN

    ……

    1

    2

    j

    1

    2

    i

    1

    1

    la

    1

    2

    la

    1l

    ja

    la1

    la2

    l

    ia

    lz1

    lz2

    l

    iz

    lalz1la

  • Relations between Layer Outputs

    ……

    nodeslN

    Layer l

    ……

    ……

    Layer 1l

    nodes1lN

    ……

    1

    2

    j

    1

    2

    i

    1

    1

    la

    1

    2

    la

    1l

    ja

    la1

    la2

    l

    ia

    lz1

    lz2

    l

    iz

    lalz1la

    l

    i

    l

    l

    l

    i

    l

    l

    ll

    ll

    l

    i

    l

    l

    b

    b

    b

    a

    a

    a

    ww

    ww

    z

    z

    z

    2

    1

    1

    1

    2

    1

    1

    2221

    12112

    1

    llll baWz 1

    llllll bawawz 11

    212

    1

    1111

    llllll bawawz 21

    222

    1

    1212

    l

    i

    ll

    i

    ll

    i

    l

    i bawawz 122

    1

    11

    ……

  • Relations between Layer Outputs

    l

    i

    l

    l

    l

    i

    l

    l

    z

    z

    z

    a

    a

    a

    2

    1

    2

    1

    lili za

    ll za

    ……

    nodeslN

    Layer l

    ……

    ……

    Layer 1l

    nodes1lN

    ……

    1

    2

    j

    1

    2

    i

    1

    1

    la

    1

    2

    la

    1l

    ja

    la1

    la2

    l

    ia

    lz1

    lz2

    l

    iz

    lalz

  • Relations between Layer Outputs

    ……

    nodeslN

    Layer l

    ……

    ……

    Layer 1l

    nodes1lN

    ……

    1

    2

    j

    1

    2

    i

    1

    1

    la

    1

    2

    la

    1l

    ja

    la1

    la2

    l

    ia

    lz1

    lz2

    l

    iz

    lalz1la

    llll baWz 1

    ll za

    llll baWa 1

  • Basic Structure:Recurrent Structure

    Simplify the networkby using the same function again and again

  • Reference

    https://www.cs.toronto.edu/~graves/preprint.pdf

    K. Greff, R. K. Srivastava, J. Koutník, B. R.

    Steunebrink, J. Schmidhuber, "LSTM: A Search

    Space Odyssey," in IEEE Transactions on

    Neural Networks and Learning Systems, 2016

    Rafal Józefowicz, Wojciech Zaremba, Ilya

    Sutskever, “An Empirical Exploration of

    Recurrent Network Architectures,”

    in ICML, 2015

  • Recurrent Neural Network

    • Given function f: ℎ′, 𝑦 = 𝑓 ℎ, 𝑥

    fh0 h1

    y1

    x1

    f h2

    y2

    x2

    f h3

    y3

    x3

    ……

    No matter how long the input/output sequence is, we only need one function f

    h and h’ are vectors with the same dimension

  • Deep RNN

    f1h0 h1

    y1

    x1

    f1 h2

    y2

    x2

    f1 h3

    y3

    x3

    ……

    f2b0 b1

    c1

    f2 b2

    c2

    f2 b3

    c3

    ……

    … ……

    ℎ′, 𝑦 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑦 …

  • f1h0 h1

    a1

    x1

    f1 h2

    a2

    x2

    f1 h3

    a3

    x3

    f2b0 b1 f2 b2 f2 b3

    Bidirectional RNN

    x1 x2 x3

    c1 c2 c3

    f3 f3 f3y1 y2 y3

    ℎ′, 𝑎 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑥

    𝑦 = 𝑓3 𝑎, 𝑐

  • Pyramidal RNN

    • Reducing the number of time steps

    W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural

    network for large vocabulary conversational speech recognition,” ICASSP, 2016

  • Naïve RNN

    • Given function f: ℎ′, 𝑦 = 𝑓 ℎ, 𝑥

    fh h'

    y

    x

    Ignore bias here

    h'

    y Wo

    Wh

    h'= 𝜎

    softmax

    = 𝜎 h Wi x+

  • LSTM

    c change slowly

    h change faster

    ct is ct-1 added by something

    ht and ht-1 can be very different

    Naive ht

    yt

    xt

    ht-1LSTM

    yt

    xt

    ct

    htht-1

    ct-1

  • xt

    zzizf zo

    ht-1

    ct-1

    zxt

    ht-1W= 𝑡𝑎𝑛ℎ

    zixt

    ht-1Wi= 𝜎

    zfxt

    ht-1Wf= 𝜎

    zoxt

    ht-1Wo= 𝜎

  • xt

    zzizf zo

    ht-1

    ct-1

    “peephole”

    z W= 𝑡𝑎𝑛ℎ

    xt

    ht-1

    ct-1

    diagonal

    zizfzo obtained by the same way

  • ht

    𝑐𝑡 = 𝑧𝑓⨀𝑐𝑡−1+𝑧𝑖⨀𝑧

    xt

    zzizf zo

    yt

    ht-1

    ct-1 ct

    ℎ𝑡 = 𝑧𝑜⨀𝑡𝑎𝑛ℎ 𝑐𝑡⨀

    ⨀ ⨀tanh

    𝑦𝑡 = 𝜎 𝑊′ℎ𝑡

  • LSTM

    xt

    zzizf zo

    ⨀ +

    yt

    ht-1

    ct-1 ct

    xt+1

    zzizf zo

    yt+1

    ht

    ct+1

    ⨀ ⨀

    ⨀tanh tanh

    ht

  • ht-1

    GRU

    r z

    yt

    xtht-1

    h'

    xt

    ⨀1-

    + ht

    reset update

    ℎ𝑡 = 𝑧⨀ℎ𝑡 + 1 − 𝑧 ⨀ℎ′

  • Example Task

    • (Simplified) Speech Recognition: Frame classification on TIMIT

    TSI TSI TSI I I N N N

    x1 x2 x3 x4

    y1 y2 y3 y4 ……

    ……

    S S @ @ @ @

    x1 x2 x3 x4

    y1 y2 y3 y4 ……

    ……

    Utterance 1 Utterance 2

  • Target Delay

    • Only for unidirectional RNN

    TSI TSI TSI I I N N NTrue labels:

    Delay 3 steps: TSI TSI TSI I I N N Nx x x

  • LSTM > RNN > feedforward

    Bi-direction > uni-direction

  • Forward direction

    Reverse direction

  • Training LSTM is faster than RNN

  • LSTM: A Search Space Odyssey

    Standard LSTM works wellSimply LSTM: coupling input and forget gate, removing peepholeForget gate is critical for performanceOutput gate activation function is critical

  • An Empirical Exploration of Recurrent Network

    Architectures

    Importance: forget > input > output

    Large bias for forget gate is helpful

    LSTM-f/i/o: removing forget/input/output gates

    LSTM-b: large bias

  • An Empirical Exploration of Recurrent Network

    Architectures

  • Basic Structure:Convolutional / Pooing

    Layer

    Simplify the neural network(based on prior knowledge to the task)

  • Convolutional Layer

    ……

    11

    22

    Sparse Connectivity

    3

    4

    5

    Each neural only connects to part of the output of the previous layer

    3

    4

    Receptive Field

    Different neurons have

    different, but overlapping,

    receptive fields

  • Convolutional Layer

    ……

    11

    22

    Sparse Connectivity

    3

    4

    5

    Each neural only connects to part of the output of the previous layer

    3

    4

    Parameter Sharing

    The neurons with different receptive fields can use the same set of parameters.

    Less parameters then fully connected layer

  • Convolutional Layer

    ……

    11

    22

    3

    4

    5

    3

    4

    Considering neuron 1 and 3 as “filter 1” (kernel 1)

    filter (kernel) size: size of the receptive field of a neuron

    Stride = 2

    Considering neuron 2 and 4 as “filter 2” (kernel 2)

    Kernel size, no. of filter, stride are all designed by the developers.

  • Example –1D Signal + Single Channel

    1 2 3 4

    𝑥1 𝑥2 𝑥3 𝑥4𝑥5

    Classification, Predict the future …

    Audio Signal, Stock Value …

  • Example –1D Signal + Multiple Channel

    1

    2

    3

    A document: each word is a vector

    I

    like

    this

    movie

    very

    much4

    ……

    𝑥1 𝑥2 𝑥3 𝑥4 𝑥5

    𝑥6 𝑥7

    Does this kind of receptive field make sense?

  • Example –2D Signal + Single Channel

    1 0 0 0 0 1

    0 1 0 0 1 0

    0 0 1 1 0 0

    1 0 0 0 1 0

    0 1 0 0 1 0

    0 0 1 0 1 0

    6 x 6 black & white picture image

    1:

    2:

    3:

    7:

    8:

    9:…

    13:

    14:

    15:

    4:

    10:

    16:

    1

    0

    0

    0

    0

    1

    0

    0

    0

    0

    1

    1Only show 1 filter here

    Size of Receptive field is 3x3, Stride is 1

  • 1 0 0 0 0 1

    0 1 0 0 1 0

    0 0 1 1 0 0

    1 0 0 0 1 0

    0 1 0 0 1 0

    0 0 1 0 1 0

    Example –2D Signal + Multiple Channel

    6 x 6 colorful image

    1:

    2:

    3:

    7:

    8:

    9:

    13:

    14:

    15:

    4:

    10:

    16:

    1

    0

    0

    0

    0

    1

    0

    0

    0

    0

    1

    1

    Only show 1 filter here

    1 0 0 0 0 1

    0 1 0 0 1 0

    0 0 1 1 0 0

    1 0 0 0 1 0

    0 1 0 0 1 0

    0 0 1 0 1 0

    1 0 0 0 0 1

    0 1 0 0 1 0

    0 0 1 1 0 0

    1 0 0 0 1 0

    0 1 0 0 1 0

    0 0 1 0 1 0

    Size of Receptive field is 3x3x3, Stride is 1

    1

    0

    1

    0

    0

    1

    1

    0

    1

    1

    0

    1

    0

    0

    0

    0

    0

    0

    0

    0

    1

    1

    1

    0

  • Zero Padding

    Without Zero Padding

  • Pooling Layer

    nodeskN /

    Layer lLayer 1l

    nodesN

    1

    1

    k

    1k

    k2

    2

    k outputs in layer 𝑙 − 1 are grouped together

    Each output in layer 𝑙“summarizes” k inputs1

    1

    la

    1l

    ka la1

    k

    j

    l

    j

    l ak

    a1

    1

    1

    1

    Average Pooling:

    Max Pooling:

    L2 Pooling:

    112111 ,,,max lklll aaaa

    k

    j

    l

    j

    l ak

    a1

    21

    1

    1

  • Pooling LayerWhich outputs should be grouped together?

    Group the neurons corresponding to the same filter with nearby receptive fields

    ……

    11

    22

    3

    4

    5

    3

    4

    Convolutional Layer

    PoolingLayer

    1

    2

    Subsampling

  • Pooling LayerWhich outputs should be grouped together?

    Group the neurons with the same receptive field

    ……

    11

    22

    3

    4

    5

    3

    4

    Convolutional Layer

    PoolingLayer

    1

    2

    Maxout Network

    How do you know which neurons detect the same pattern?

  • Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-endWith Raw Waveform CLDNNs,” In INTERPSEECH 2015

    Combination of Different Basic Layers

  • Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-endWith Raw Waveform CLDNNs,” In INTERPSEECH 2015

    Combination of Different Basic Layers

  • Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-endWith Raw Waveform CLDNNs,” In INTERPSEECH 2015

    Combination of Different Basic Layers

    3 layers

  • Next Time

    • 3/10: TAs will teach TensorFlow

    • TensorFlow for regression

    • TensorFlow for word vector

    • word vector: https://www.youtube.com/watch?v=X7PH3NuYW0Q

    • TensorFlow for CNN

    • If you want to learn Theano

    • http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html

    • http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20RNN.ecm.mp4/index.html

    https://www.youtube.com/watch?v=X7PH3NuYW0Qhttp://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano DNN.ecm.mp4/index.htmlhttp://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano RNN.ecm.mp4/index.html