deep learning recurrent networksbhiksha/courses/deeplearning/... · deep learning recurrent...

98
Deep Learning Recurrent Networks 21/Feb/2018 1

Upload: others

Post on 25-May-2020

30 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Deep Learning

Recurrent Networks

21/Feb/2018

1

Page 2: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Which open source project?

2

Page 3: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Related math. What is it talking about?

3

Page 4: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

And a Wikipedia page explaining it all

4

Page 5: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The unreasonable effectiveness of recurrent neural networks..

• All previous examples were generated blindly by a recurrent neural network..

• http://karpathy.github.io/2015/05/21/rnn-effectiveness/

5

Page 6: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Modelling Series

• In many situations one must consider a series of inputs to produce an output

– Outputs to may be a series

• Examples: ..

6

Page 7: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Should I invest..

• Stock market

– Must consider the series of stock values in the past several days to decide if it is wise to invest today• Ideally consider all of history

• Note: Inputs are vectors. Output may be scalar or vector

– Should I invest, vs. should I invest in X

15/0314/0313/0312/0311/0310/039/038/037/03

To invest or not to invest?

sto

cks

7

Page 8: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Representational shortcut

• Input at each time is a vector

• Each layer has many neurons

– Output layer too may have many neurons

• But will represent everything simple boxes

– Each box actually represents an entire layer with many units8

Page 9: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Representational shortcut

• Input at each time is a vector

• Each layer has many neurons

– Output layer too may have many neurons

• But will represent everything simple boxes

– Each box actually represents an entire layer with many units9

Page 10: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Representational shortcut

• Input at each time is a vector

• Each layer has many neurons

– Output layer too may have many neurons

• But will represent everything simple boxes

– Each box actually represents an entire layer with many units10

Page 11: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

• The sliding predictor– Look at the last few days

– This is just a convolutional neural net applied to series data• Also called a Time-Delay neural network

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+3)

11

Page 12: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

• The sliding predictor– Look at the last few days

– This is just a convolutional neural net applied to series data• Also called a Time-Delay neural network

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+4)

12

Page 13: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

• The sliding predictor– Look at the last few days

– This is just a convolutional neural net applied to series data• Also called a Time-Delay neural network

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+5)

13

Page 14: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

• The sliding predictor– Look at the last few days

– This is just a convolutional neural net applied to series data• Also called a Time-Delay neural network

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+6)

14

Page 15: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

• The sliding predictor– Look at the last few days

– This is just a convolutional neural net applied to series data• Also called a Time-Delay neural network

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+6)

15

Page 16: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Finite-response model

• This is a finite response system

– Something that happens today only affects the output of the system for 𝑁 days into the future

• 𝑁 is the width of the system

𝑌𝑡 = 𝑓 𝑋𝑡, 𝑋𝑡−1, … , 𝑋𝑡−𝑁

16

Page 17: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+2)

• This is a finite response system

– Something that happens today only affects the output of the system for 𝑁 days into the future

• 𝑁 is the width of the system

𝑌𝑡 = 𝑓 𝑋𝑡 , 𝑋𝑡−1, … , 𝑋𝑡−𝑁17

Page 18: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+3)

• This is a finite response system

– Something that happens today only affects the output of the system for 𝑁 days into the future

• 𝑁 is the width of the system

𝑌𝑡 = 𝑓 𝑋𝑡 , 𝑋𝑡−1, … , 𝑋𝑡−𝑁18

Page 19: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+4)

• This is a finite response system

– Something that happens today only affects the output of the system for 𝑁 days into the future

• 𝑁 is the width of the system

𝑌𝑡 = 𝑓 𝑋𝑡 , 𝑋𝑡−1, … , 𝑋𝑡−𝑁19

Page 20: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+5)

• This is a finite response system

– Something that happens today only affects the output of the system for 𝑁 days into the future

• 𝑁 is the width of the system

𝑌𝑡 = 𝑓 𝑋𝑡 , 𝑋𝑡−1, … , 𝑋𝑡−𝑁20

Page 21: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+6)

• This is a finite response system

– Something that happens today only affects the output of the system for 𝑁 days into the future

• 𝑁 is the width of the system

𝑌𝑡 = 𝑓 𝑋𝑡 , 𝑋𝑡−1, … , 𝑋𝑡−𝑁21

Page 22: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The stock predictor

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+7)

• This is a finite response system

– Something that happens today only affects the output of the system for 𝑁 days into the future

• 𝑁 is the width of the system

𝑌𝑡 = 𝑓 𝑋𝑡 , 𝑋𝑡−1, … , 𝑋𝑡−𝑁22

Page 23: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Finite-response model

• This is a finite response system– Something that happens today only affects the output

of the system for 𝑁 days into the future• 𝑁 is the width of the system

𝑌𝑡 = 𝑓 𝑋𝑡 , 𝑋𝑡−1, … , 𝑋𝑡−𝑁

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+6)

23

Page 24: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Finite-response

• Problem: Increasing the “history” makes the network more complex

– No worries, we have the CPU and memory

• Or do we?

Stockvector

Time

X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)

Y(t+6)

24

Page 25: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Systems often have long-term dependencies

• Longer-term trends –

– Weekly trends in the market

– Monthly trends in the market

– Annual trends

– Though longer history tends to affect us less than more recent events.. 25

Page 26: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

We want infinite memory

• Required: Infinite response systems– What happens today can continue to affect the output

forever• Possibly with weaker and weaker influence

𝑌𝑡 = 𝑓 𝑋𝑡 , 𝑋𝑡−1, … , 𝑋𝑡−∞

Time

26

Page 27: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Examples of infinite response systems

𝑌𝑡 = 𝑓 𝑋𝑡 , 𝑌𝑡−1– Required: Define initial state: 𝑌−1 for 𝑡 = 0

– An input at 𝑋0 at 𝑡 = 0 produces 𝑌0– 𝑌0 produces 𝑌1 which produces 𝑌2 and so on until 𝑌∞ even

if 𝑋1…𝑋∞ are 0• i.e. even if there are no further inputs!

• This is an instance of a NARX network

– “nonlinear autoregressive network with exogenous inputs”

– 𝑌𝑡 = 𝑓 𝑋0:𝑡 , 𝑌0:𝑡−1

• Output contains information about the entire past27

Page 28: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

A one-tap NARX network

• A NARX net with recursion from the output

TimeX(t)

Y(t)

28

Page 29: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

• A NARX net with recursion from the output

TimeX(t)

Y(t) Y

29

A one-tap NARX network

Page 30: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

A one-tap NARX network

• A NARX net with recursion from the output

TimeX(t)

Y(t)

30

Page 31: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

• A NARX net with recursion from the output

TimeX(t)

Y(t)

31

A one-tap NARX network

Page 32: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

• A NARX net with recursion from the output

TimeX(t)

Y(t)

32

A one-tap NARX network

Page 33: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

• A NARX net with recursion from the output

TimeX(t)

Y(t)

33

A one-tap NARX network

Page 34: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

• A NARX net with recursion from the output

TimeX(t)

Y(t)

34

A one-tap NARX network

Page 35: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

• A NARX net with recursion from the output

TimeX(t)

Y(t)

35

A one-tap NARX network

Page 36: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

A more complete representation

• A NARX net with recursion from the output

• Showing all computations

• All columns are identical

• An input at t=0 affects outputs forever

TimeX(t)

Y(t-1)

Brown boxes show output nodesYellow boxes are outputs

36

Page 37: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Same figure redrawn

• A NARX net with recursion from the output

• Showing all computations

• All columns are identical

• An input at t=0 affects outputs forever

TimeX(t)

Y(t)

Brown boxes show output nodesAll outgoing arrows are the same output

37

Page 38: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

A more generic NARX network

• The output 𝑌𝑡 at time 𝑡 is computed from the

past 𝐾 outputs 𝑌𝑡−1, … , 𝑌𝑡−𝐾 and the current

and past 𝐿 inputs 𝑋𝑡, … , 𝑋𝑡−𝐿

TimeX(t)

Y(t)

38

Page 39: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

A “complete” NARX network

• The output 𝑌𝑡 at time 𝑡 is computed from all

past outputs and all inputs until time t

– Not really a practical model

TimeX(t)

Y(t)

39

Page 40: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

NARX Networks

• Very popular for time-series prediction– Weather

– Stock markets

– As alternate system models in tracking systems

• Any phenomena with distinct “innovations” that “drive” an output

• Note: here the “memory” of the past is in the output itself, and not in the network

40

Page 41: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Lets make memory more explicit

• Task is to “remember” the past

• Introduce an explicit memory variable whose job it is to remember

𝑚𝑡 = 𝑟 𝑦𝑡−1, ℎ𝑡−1, 𝑚𝑡−1

ℎ𝑡 = 𝑓 𝑥𝑡 , 𝑚𝑡

𝑦𝑡 = 𝑔 ℎ𝑡

• 𝑚𝑡 is a “memory” variable

– Generally stored in a “memory” unit

– Used to “remember” the past

41

Page 42: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Jordan Network

• Memory unit simply retains a running average of past outputs

– “Serial order: A parallel distributed processing approach”, M.I.Jordan, 1986

• Input is constant (called a “plan”)

• Objective is to train net to produce a specific output, given an input plan

– Memory has fixed structure; does not “learn” to remember

• The running average of outputs considers entire past, rather than immediate past

Time

Y(t) Y(t+1)1 1

𝜇 𝜇

Fixedweights

Fixedweights

X(t) X(t+1)

42

Page 43: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Elman Networks

• Separate memory state from output

– “Context” units that carry historical state

– “Finding structure in time”, Jeffrey Elman, Cognitive Science, 1990

• For the purpose of training, this was approximated as a set of T independent 1-step history nets

• Only the weight from the memory unit to the hidden unit is learned

Time

X(t)

Y(t) Y(t+1)

1

Cloned state

1

Cloned state

X(t+1)

43

Page 44: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

An alternate model for infinite response systems: the state-space model

ℎ𝑡 = 𝑓 𝑥𝑡, ℎ𝑡−1𝑦𝑡 = 𝑔 ℎ𝑡

• ℎ𝑡 is the state of the network– Model directly embeds the memory in the state

• Need to define initial state ℎ−1

• This is a fully recurrent neural network– Or simply a recurrent neural network

• State summarizes information about the entire past

44

Page 45: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The simple state-space model

• The state (green) at any time is determined by the input at that time, and the state at the previous time

• An input at t=0 affects outputs forever

• Also known as a recurrent neural net

Time

X(t)

Y(t)

t=0

h-1

45

Page 46: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

An alternate model for infinite response systems: the state-space model

ℎ𝑡 = 𝑓 𝑥𝑡, ℎ𝑡−1𝑦𝑡 = 𝑔 ℎ𝑡

• ℎ𝑡 is the state of the network

• Need to define initial state ℎ−1

• The state an be arbitrarily complex

46

Page 47: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Single hidden layer RNN

• Recurrent neural network

• All columns are identical

• An input at t=0 affects outputs forever

Time

X(t)

Y(t)

t=0

h-1

47

Page 48: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Multiple recurrent layer RNN

• Recurrent neural network

• All columns are identical

• An input at t=0 affects outputs forever

Time

Y(t)

X(t)

t=0

48

Page 49: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

A more complex state

• All columns are identical

• An input at t=0 affects outputs forever

TimeX(t)

Y(t)

49

Page 50: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Or the network may be even more complicated

• Shades of NARX

• All columns are identical

• An input at t=0 affects outputs forever

TimeX(t)

Y(t)

50

Page 51: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Generalization with other recurrences

• All columns (including incoming edges) are

identical

Time

Y(t)

X(t)

t=0

51

Page 52: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

State dependencies may be simpler

• Recurrent neural network

• All columns are identical

• An input at t=0 affects outputs forever

Time

Y(t)

X(t)

t=0

52

Page 53: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Multiple recurrent layer RNN

• We can also have skips..

Time

Y(t)

X(t)

t=0

53

Page 54: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

A Recurrent Neural Network

• Simplified models often drawn

• The loops imply recurrence

54

Page 55: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

The detailed version of the simplified representation

Time

X(t)

Y(t)

t=0

h-1

55

Page 56: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Multiple recurrent layer RNN

Time

Y(t)

X(t)

t=056

Page 57: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Multiple recurrent layer RNN

Time

Y(t)

X(t)

t=057

Page 58: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Equations

• Note superscript in indexing, which indicates layer of network from which inputs are obtained

• Assuming vector function at output, e.g. softmax

• The state node activation, 𝑓1() is typically tanh()

• Every neuron also has a bias input

𝑌(𝑡) = 𝑓2

𝑗

𝑤𝑗𝑘1ℎ𝑗

1𝑡 + 𝑏𝑘

1, 𝑘 = 1. .𝑀

ℎ𝑖1(𝑡) = 𝑓1

𝑗

𝑤𝑗𝑖0𝑋𝑗 𝑡 +

𝑗

𝑤𝑗𝑖11

ℎ𝑖1

𝑡 − 1 + 𝑏𝑖1

ℎ𝑖1

−1 = 𝑝𝑎𝑟𝑡 𝑜𝑓 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠

𝑋

ℎ(1)

𝑌

58

Page 59: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Equations

𝑌(𝑡) = 𝑓3

𝑗

𝑤𝑗𝑘2ℎ𝑗

2𝑡 + 𝑏𝑘

3, 𝑘 = 1. .𝑀

ℎ𝑖2(𝑡) = 𝑓2

𝑗

𝑤𝑗𝑖1ℎ𝑗

1(𝑡) +

𝑗

𝑤𝑗𝑖22

ℎ𝑖2

𝑡 − 1 + 𝑏𝑖2

ℎ𝑖1

−1 = 𝑝𝑎𝑟𝑡 𝑜𝑓 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠

• Assuming vector function at output, e.g. softmax 𝑓3()

• The state node activations, 𝑓𝑘() are typically tanh()

• Every neuron also has a bias input

ℎ𝑖2

−1 = 𝑝𝑎𝑟𝑡 𝑜𝑓 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠

ℎ𝑖1(𝑡) = 𝑓1

𝑗

𝑤𝑗𝑖0𝑋𝑗 𝑡 +

𝑗

𝑤𝑗𝑖11

ℎ𝑖1

𝑡 − 1 + 𝑏𝑖1

𝑋

ℎ(1)

𝑌

ℎ(2)

59

Page 60: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Equations

𝑌𝑖(𝑡) = 𝑓3

𝑗

𝑤𝑗𝑘2ℎ𝑗

2𝑡 +

𝑗

𝑤𝑗𝑘1,3

ℎ𝑗1(𝑡) + 𝑏𝑘

3, 𝑘 = 1. .𝑀

ℎ𝑖2(𝑡) = 𝑓2

𝑗

𝑤𝑗𝑖1,2

ℎ𝑗1(𝑡) +

𝑗

𝑤𝑗𝑖0,2

𝑋𝑗 𝑡 +

𝑖

𝑤𝑖𝑖2,2

ℎ𝑖2

𝑡 − 1 + 𝑏𝑖2

ℎ𝑖1

−1 = 𝑝𝑎𝑟𝑡 𝑜𝑓 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠

ℎ𝑖2

−1 = 𝑝𝑎𝑟𝑡 𝑜𝑓 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠

ℎ𝑖1(𝑡) = 𝑓1

𝑗

𝑤𝑗𝑖0,1

𝑋𝑗 𝑡 +

𝑖

𝑤𝑖𝑖1,1

ℎ𝑖1

𝑡 − 1 + 𝑏𝑖1

60

Page 61: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Variants on recurrent nets

• 1: Conventional MLP• 2: Sequence generation, e.g. image to caption• 3: Sequence based prediction or classification, e.g. Speech recognition,

text classification

Images fromKarpathy

61

Page 62: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Variants

• 1: Delayed sequence to sequence

• 2: Sequence to sequence, e.g. stock problem, label prediction

• Etc…

Images fromKarpathy

62

Page 63: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

How do we train the network

• Back propagation through time (BPTT)

• Given a collection of sequence inputs– (𝐗𝑖 , 𝐃𝑖), where

– 𝐗𝑖 = 𝑋𝑖,0, … , 𝑋𝑖,𝑇

– 𝐃𝑖 = 𝐷𝑖,0, … , 𝐷𝑖,𝑇

• Train network parameters to minimize the error between the output of the network 𝐘𝑖 = 𝑌𝑖,0, … , 𝑌𝑖,𝑇 and the desired outputs

– This is the most generic setting. In other settings we just “remove” some of the input or output entries

X(0)

Y(0)

t

h-1

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2) Y(T-2) Y(T-1) Y(T)

63

Page 64: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Training: Forward pass

• For each training input:

• Forward pass: pass the entire data sequence through the network, generate outputs

X(0)

Y(0)

t

h-1

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2) Y(T-2) Y(T-1) Y(T)

64

Page 65: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Training: Computing gradients

• For each training input:

• Backward pass: Compute gradients via backpropagation

– Back Propagation Through Time

X(0)

Y(0)

t

h-1

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2) Y(T-2) Y(T-1) Y(T)

65

Page 66: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

Will only focus on one training instance

All subscripts represent components and not training instance index

66

Page 67: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

• The divergence computed is between the sequence of outputsby the network and the desired sequence of outputs

• This is not just the sum of the divergences at individual times Unless we explicitly define it that way

67

Page 68: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

First step of backprop: Compute 𝑑𝐷𝐼𝑉

𝑑𝑌𝑖(𝑇)for all i

In general we will be required to compute 𝑑𝐷𝐼𝑉

𝑑𝑌𝑖(𝑡)for all 𝑖 and 𝑡 as we will see. This can

be a source of significant difficulty in many scenarios.68

Page 69: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝐷(𝑇)

𝐷𝑖𝑣(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑌𝑖(𝑡)for all i for all T

𝐷𝑖𝑣(𝑇 − 1)𝐷𝑖𝑣(𝑇 − 2)𝐷𝑖𝑣(2)𝐷𝑖𝑣(1)𝐷𝑖𝑣(0)

𝐷𝐼𝑉

Must compute

𝑑𝐷𝐼𝑉

𝑑𝑌𝑖(𝑡)=𝑑𝐷𝑖𝑣(𝑡)

𝑑𝑌𝑖(𝑡)

Will usually get

Special case, when the overall divergence is a simple combination of localdivergences at each time:

69

Page 70: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

First step of backprop: Compute 𝑑𝐷𝐼𝑉

𝑑𝑌𝑖(𝑇)for all i

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖(1)(𝑇)

=𝑑𝐷𝐼𝑉

𝑑𝑌𝑖(𝑇)

𝑑𝑌𝑖(𝑇)

𝑑𝑍𝑖(1)(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖(𝑇)=

𝑗

𝑑𝐷𝐼𝑉

𝑑𝑌𝑗(𝑇)

𝑑𝑌𝑗(𝑇)

𝑑𝑍𝑗(1)(𝑇)

OR

Vector output activation

70

∇𝑍(1)(𝑇)𝐷𝐼𝑉 = 𝛻𝑌(𝑇)𝐷𝐼𝑉 𝛻𝑍(𝑇)𝑌(𝑇)

Page 71: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑌𝑖(𝑇)for all i

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖(1)(𝑇)

=𝑑𝐷𝑖𝑣(𝑇)

𝑑𝑌𝑖(𝑇)

𝑑𝑌𝑖(𝑇)

𝑑𝑍𝑖(1)(𝑇)

𝑑𝐷𝐼𝑉

𝑑ℎ𝑖(𝑇)=

𝑗

𝑑𝐷𝐼𝑉

𝑑𝑍𝑗(1)(𝑇)

𝑑𝑍𝑗(1)(𝑇)

𝑑ℎ𝑖(𝑇)=

𝑗

𝑤𝑖𝑗(1) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗(1)(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

71

∇ℎ(𝑇)𝐷𝐼𝑉 = 𝛻𝑍(1)(𝑇)𝐷𝐼𝑉 𝑊(1)

Page 72: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖(1)(𝑇)

=𝑑𝐷𝑖𝑣(𝑇)

𝑑𝑌𝑖(𝑇)

𝑑𝑌𝑖(𝑇)

𝑑𝑍𝑖(1)(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(1)

=𝑑𝐷𝐼𝑉

𝑑𝑍𝑗(1)(𝑇)

𝑑𝑍𝑗(1)(𝑇)

𝑑𝑤𝑖𝑗(1)

=𝑑𝐷𝐼𝑉

𝑑𝑍𝑗1

𝑇ℎ𝑖(𝑇)

𝑑𝐷𝐼𝑉

𝑑ℎ𝑖(𝑇)=

𝑗

𝑤𝑖𝑗(1) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗(1)(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

72

∇𝑊(1)𝐷𝐼𝑉 = ℎ 𝑇 𝛻𝑍(1)(𝑇)𝐷𝐼𝑉

Page 73: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖(1)(𝑇)

=𝑑𝐷𝐼𝑉

𝑑𝑌𝑖(𝑇)

𝑑𝑌𝑖(𝑇)

𝑑𝑍𝑖(1)(𝑇)

𝑑𝐷𝐼𝑉

𝑑ℎ𝑖(𝑇)=

𝑗

𝑤𝑖𝑗(1) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗(1)(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(1)

=𝑑𝐷𝐼𝑉

𝑑𝑍𝑗1

𝑇ℎ𝑖(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖(0)(𝑇)

=𝑑𝐷𝐼𝑉

𝑑ℎ𝑖(𝑇)

𝑑ℎ𝑖(𝑇)

𝑑𝑍𝑖(0)(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

73

∇𝑍(0) (𝑇)𝐷𝐼𝑉 = 𝛻ℎ(𝑇)𝐷𝐼𝑉 𝛻𝑍(0)(𝑇)ℎ(𝑇)

Page 74: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(0)

=𝑑𝐷𝐼𝑉

𝑑𝑍𝑗0

𝑇𝑋𝑖(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

74

∇𝑊(0)𝐷𝐼𝑉 = 𝑋 𝑇 𝛻𝑍(0)(𝑇)𝐷𝐼𝑉

Page 75: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(11)

=𝑑𝐷𝐼𝑉

𝑑𝑍𝑗0

𝑇ℎ𝑖(𝑇 − 1)

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(0)

=𝑑𝐷𝐼𝑉

𝑑𝑍𝑗0

𝑇𝑋𝑖(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

75

∇𝑊(11)𝐷𝐼𝑉 = ℎ 𝑇 − 1 𝛻𝑍(0)(𝑇)𝐷𝐼𝑉

Page 76: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖1(𝑇 − 1)

=𝑑𝐷𝐼𝑉

𝑑𝑌𝑖(𝑇 − 1)

𝑑𝑌𝑖(𝑇 − 1)

𝑑𝑍𝑖1(𝑇 − 1)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

Vector output activation

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖1(𝑇 − 1)

=

𝑗

𝑑𝐷𝐼𝑉

𝑑𝑌𝑗(𝑇 − 1)

𝑑𝑌𝑗(𝑇 − 1)

𝑑𝑍𝑖1(𝑇 − 1)

OR

76

∇𝑍(1)(𝑇−1)𝐷𝐼𝑉 = 𝛻𝑌(𝑇−1)𝐷𝐼𝑉 𝛻𝑍(1) 𝑇 𝑌(𝑇 − 1)

Page 77: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝑑𝐷𝐼𝑉

𝑑ℎ𝑖(𝑇 − 1)=

𝑗

𝑤𝑖𝑗(1) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗1(𝑇 − 1)

+

𝑗

𝑤𝑖𝑗(11) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗0(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

77

∇ℎ(𝑇−1)𝐷𝐼𝑉 = 𝛻𝑍 1 (𝑇−1)𝐷𝐼𝑉 𝑊(1) + 𝛻𝑍(0)(𝑇)𝐷𝐼𝑉 𝑊(11)

Page 78: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝑑𝐷𝐼𝑉

𝑑ℎ𝑖(𝑇 − 1)=

𝑗

𝑤𝑖𝑗(1) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗1(𝑇 − 1)

+

𝑗

𝑤𝑖𝑗(11) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗0(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(1)

+=𝑑𝐷𝐼𝑉

𝑑𝑍𝑗1

𝑇 − 1ℎ𝑖(𝑇 − 1)

Note the addition

78∇𝑊(1)𝐷𝐼𝑉 += ℎ 𝑇 − 1 𝛻𝑍 1 (𝑇−1)𝐷𝐼𝑉

Page 79: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖0(𝑇 − 1)

=𝑑𝐷𝐼𝑉

𝑑ℎ𝑖(𝑇 − 1)

𝑑ℎ𝑖(𝑇 − 1)

𝑑𝑍𝑖0(𝑇 − 1)

79∇𝑍(0) (𝑇−1)𝐷𝐼𝑉 = 𝛻ℎ(𝑇−1)𝐷𝐼𝑉 𝛻𝑍 0 𝑇−1 ℎ(𝑇 − 1)

Page 80: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖0(𝑇 − 1)

=𝑑𝐷𝐼𝑉

𝑑ℎ𝑖(𝑇 − 1)

𝑑ℎ𝑖(𝑇 − 1)

𝑑𝑍𝑖0(𝑇 − 1)

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(0)

+=𝑑𝐷𝐼𝑉

𝑑𝑍𝑗0

𝑇 − 1𝑋𝑖(𝑇 − 1)

Note the addition

80

∇𝑊(0)𝐷𝐼𝑉 += 𝑋 𝑇 − 1 𝛻𝑍 0 (𝑇−1)𝐷𝐼𝑉

Page 81: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(0)

+=𝑑𝐷𝐼𝑉

𝑑𝑍𝑗0

𝑇 − 1𝑋𝑖(𝑇 − 1)

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(11)

+=𝑑𝐷𝐼𝑉

𝑑𝑍𝑗0

𝑇 − 1ℎ𝑖(𝑇 − 2)

Note the addition

81∇𝑊(11)𝐷𝐼𝑉 += ℎ 𝑇 − 2 𝛻𝑍 0 (𝑇−1)𝐷𝐼𝑉

Page 82: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

𝑑𝐷𝐼𝑉

𝑑ℎ−1=

𝑗

𝑤𝑖𝑗(11) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗1(0)

82∇ℎ−1𝐷𝐼𝑉 = 𝛻𝑍 1 (0)𝐷𝐼𝑉𝑊

(11)

Page 83: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

𝑑𝐷𝐼𝑉

𝑑𝑍𝑖𝑘(𝑡)

=𝑑𝐷𝐼𝑉

𝑑ℎ𝑖𝑘(𝑡)

𝑓𝑘′ 𝑍𝑖

𝑘(𝑡)

𝑑𝐷𝐼𝑉

𝑑ℎ𝑖𝑘(𝑡)

=

𝑗

𝑤𝑖,𝑗(𝑘) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗𝑘+1

(𝑡)+

𝑗

𝑤𝑖,𝑗(𝑘,𝑘) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗𝑘(𝑡 + 1)

Not showing derivativesat output neurons

83

Page 84: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Back Propagation Through Time

h-1

𝑋(0) 𝑋(1) 𝑋(2) 𝑋(𝑇 − 2) 𝑋(𝑇 − 1) 𝑋(𝑇)

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑇 − 2) 𝑌(𝑇 − 1) 𝑌(𝑇)

𝐷(1. . 𝑇)

𝐷𝐼𝑉

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(0)

=

𝑡

𝑑𝐷𝐼𝑉

𝑑𝑍𝑗0

𝑡𝑋𝑖(𝑡)

𝑑𝐷𝐼𝑉

𝑑𝑤𝑖𝑗(11)

=

𝑡

𝑑𝐷𝐼𝑉

𝑑𝑍𝑗0

𝑡ℎ𝑖(𝑡 − 1)

𝑑𝐷𝐼𝑉

𝑑ℎ−1=

𝑗

𝑤𝑖𝑗(11) 𝑑𝐷𝐼𝑉

𝑑𝑍𝑗1(0)

84

Page 85: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

BPTT

• Can be generalized to any architecture

85

Page 86: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Extensions to the RNN: Bidirectional RNN

• RNN with both forward and backward recursion

– Explicitly models the fact that just as the future can be predicted from the past, the past can be deduced from the future

86

Page 87: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Bidirectional RNN

• A forward net process the data from t=0 to t=T

• A backward net processes it backward from t=T down to t=0

X(0)

Y(0)

t

hf(-1)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2) Y(T-2) Y(T-1) Y(T)

X(0) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

87

Page 88: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Bidirectional RNN: Processing an input string

• The forward net process the data from t=0 to t=T

– Only computing the hidden states, initially

• The backward net processes it backward from t=T down to t=0

X(0)

t

hf(-1)

X(1) X(2) X(T-2) X(T-1) X(T)

88

Page 89: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Bidirectional RNN: Processing an input string

• The backward nets processes the input data in reverse time, end to beginning

– Initially only the hidden state values are computed

– Clearly, this is not an online process and requires the entire input data

– Note: This is not the backward pass of backprop.

• The backward net processes it backward from t=T down to t=0

X(0)

t

hf(-1)

X(1) X(2) X(T-2) X(T-1) X(T)

X(0) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

89

Page 90: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Bidirectional RNN: Processing an input string

• The computed states of both networks are

used to compute the final output at each time

X(0)

Y(0)

t

hf(-1)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2) Y(T-2) Y(T-1) Y(T)

X(0) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

90

Page 91: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Backpropagation in BRNNs

• Forward pass: Compute both forward and

backward networks and final output

X(0)

Y(0)

t

hf(-1)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2) Y(T-2) Y(T-1) Y(T)

X(0) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

91

Page 92: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Backpropagation in BRNNs

• Backward pass: Define a divergence from the desired output

• Separately perform back propagation on both nets

– From t=T down to t=0 for the forward net

– From t=0 up to t=T for the backward net

X(0)

Y(0)

t

hf(-1)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2) Y(T-2) Y(T-1) Y(T)

X(0) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

Div()d1..dT

Div

92

Page 93: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Backpropagation in BRNNs

• Backward pass: Define a divergence from the desired output

• Separately perform back propagation on both nets

– From t=T down to t=0 for the forward net

– From t=0 up to t=T for the backward net

X(0)

Y(0)

t

hf(-1)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2) Y(T-2) Y(T-1) Y(T)

Div()d1..dT

Div

93

Page 94: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

Backpropagation in BRNNs

• Backward pass: Define a divergence from the desired output

• Separately perform back propagation on both nets

– From t=T down to t=0 for the forward net

– From t=0 up to t=T for the backward net

Y(0)

t

Y(1) Y(2) Y(T-2) Y(T-1) Y(T)

X(0) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

Div()d1..dT

Div

94

Page 95: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

RNNs..

• Excellent models for time-series analysis tasks

– Time-series prediction

– Time-series classification

– Sequence prediction..

95

Page 96: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

So how did this happen

96

Page 97: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

So how did this happen

More on this later..97

Page 98: Deep Learning Recurrent Networksbhiksha/courses/deeplearning/... · Deep Learning Recurrent Networks 21/Feb/2018 1. Which open source project? 2. Related math. What is it talking

RNNs..

• Excellent models for time-series analysis tasks

– Time-series prediction

– Time-series classification

– Sequence prediction..

– They can even simplify some problems that are difficult for MLPs

• Next class..

98