long/short-term memory

55
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. University of Illinois ECE 417: Multimedia Signal Processing, Fall 2020

Upload: others

Post on 16-Oct-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Long/Short-Term Memory

Mark Hasegawa-JohnsonAll content CC-SA 4.0 unless otherwise specified.

University of Illinois

ECE 417: Multimedia Signal Processing, Fall 2020

Page 2: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

1 Review: Recurrent Neural Networks

2 Vanishing/Exploding Gradient

3 Running Example: a Pocket Calculator

4 Regular RNN

5 Forget Gate

6 Long Short-Term Memory (LSTM)

7 Backprop for an LSTM

8 Conclusion

Page 3: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Outline

1 Review: Recurrent Neural Networks

2 Vanishing/Exploding Gradient

3 Running Example: a Pocket Calculator

4 Regular RNN

5 Forget Gate

6 Long Short-Term Memory (LSTM)

7 Backprop for an LSTM

8 Conclusion

Page 4: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Recurrent Neural Net (RNN) = Nonlinear(IIR)

Image CC-SA-4.0 by Ixnay,

https://commons.wikimedia.org/wiki/File:Recurrent_neural_network_unfold.svg

Page 5: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Back-Propagation and Causal Graphs

x

h0

h1

y

d y

dx=

N−1∑i=0

dy

dhi

∂hi∂x

For each hi , we find the total derivative of y w.r.t. hi , multipliedby the partial derivative of hi w.r.t. x .

Page 6: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Back-Propagation Through Time

Back-propagation through time computes the error gradient ateach time step based on the error gradients at future time steps. Ifthe forward-prop equation is

y [n] = g(e[n]), e[n] = x [n] +M−1∑m=1

w [m]y [n −m],

then the BPTT equation is

δ[n] =dE

de[n]=

∂E

∂e[n]+

M−1∑m=1

δ[n + m]w [m]g(e[n])

Weight update, for an RNN, multiplies the back-prop times theforward-prop.

dE

dw [m]=∑n

δ[n]y [n −m]

Page 7: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Outline

1 Review: Recurrent Neural Networks

2 Vanishing/Exploding Gradient

3 Running Example: a Pocket Calculator

4 Regular RNN

5 Forget Gate

6 Long Short-Term Memory (LSTM)

7 Backprop for an LSTM

8 Conclusion

Page 8: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Vanishing/Exploding Gradient

The “vanishing gradient” problem refers to the tendency ofdy [n+m]de[n] to disappear, exponentially, when m is large.

The “exploding gradient” problem refers to the tendency ofdy [n+m]de[n] to explode toward infinity, exponentially, when m is

large.

If the largest feedback coefficient is |w [m]| > 1, then you getexploding gradient. If |w [m]| < 1, you get vanishing gradient.

Page 9: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Example: A Memorizer Network

Suppose that we have a very simple RNN:

y [n] = wx [n] + uy [n − 1]

Suppose that x [n] is only nonzero at time 0:

x [n] =

{x0 n = 0

0 n 6= 0

Suppose that, instead of measuring x [0] directly, we are onlyallowed to measure the output of the RNN m time-steps later. Ourgoal is to learn w and u so that y [m] remembers x0, thus:

E =1

2(y [m]− x0)2

Page 10: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Example: A Memorizer Network

Now, how do we perform gradient update of the weights? If

y [n] = wx [n] + uy [n − 1]

then

dE

dw=∑n

(dE

dy [n]

)∂y [n]

∂w

=∑n

(dE

dy [n]

)x [n] =

(dE

dy [0]

)x0

But the error is defined as

E =1

2(y [m]− x0)2

so

dE

dy [0]= u

dE

dy [1]= u2

dE

dy [2]= . . . = um (y [m]− x0)

Page 11: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Example: Vanishing Gradient

So we find out that the gradient, w.r.t. thecoefficient w , is either exponentially small,or exponentially large, depending onwhether |u| < 1 or |u| > 1:

dE

dw= x0 (y [m]− x0) um

In other words, if our application requiresthe neural net to wait m time steps beforegenerating its output, then the gradient isexponentially smaller, and thereforetraining the neural net is exponentiallyharder.

Exponential Decay

Image CC-SA-4.0, PeterQ, Wikipedia

Page 12: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Outline

1 Review: Recurrent Neural Networks

2 Vanishing/Exploding Gradient

3 Running Example: a Pocket Calculator

4 Regular RNN

5 Forget Gate

6 Long Short-Term Memory (LSTM)

7 Backprop for an LSTM

8 Conclusion

Page 13: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Notation

Today’s lecture will try to use notation similar to the Wikipediapage for LSTM.

x [t] = input at time t

y [t] = target/desired output

c[t] = excitation at time t OR LSTM cell

h[t] = activation at time t OR LSTM output

u = feedback coefficient

w = feedforward coefficient

b = bias

Page 14: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Running Example: a Pocket Calculator

The rest of this lecture will refer to a toy application called“pocket calculator.”

Pocket Calculator

When x [t] > 0, add it to the current tally:c[t] = c[t − 1] + x [t].

When x [t] = 0,1 Print out the current tally, h[t] = c[t − 1], and then2 Reset the tally to zero, c[t] = 0.

Example Signals

Input: x [t] = 1, 2, 1, 0, 1, 1, 1, 0

Target Output: y [t] = 0, 0, 0, 4, 0, 0, 0, 3

Page 15: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Pocket Calculator

When x [t] > 0, add it tothe current tally:c[t] = c[t − 1] + x [t].

When x [t] = 0,1 Print out the current

tally, h[t] = c[t − 1],and then

2 Reset the tally to zero,c[t] = 0.

Pocket Calculator

Page 16: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Outline

1 Review: Recurrent Neural Networks

2 Vanishing/Exploding Gradient

3 Running Example: a Pocket Calculator

4 Regular RNN

5 Forget Gate

6 Long Short-Term Memory (LSTM)

7 Backprop for an LSTM

8 Conclusion

Page 17: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

One-Node One-Tap Linear RNN

Suppose that we have a very simple RNN:

Excitation: c[t] = x [t] + uh[t − 1]

Activation: h[t] = σh (c[t])

where σh() is some feedback nonlinearity. In this simple example,let’s just use σh(c[t]) = c[t], i.e., no nonlinearity.GOAL: Find u so that h[t] ≈ y [t]. In order to make the problemeasier, we will only score an “error” when y [t] 6= 0:

E =1

2

∑t:y [t]>0

(h[t]− y [t])2

Page 18: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

RNN: u = 1?

Obviously, if we want to just addnumbers, we should just set u = 1.Then the RNN is computing

Excitation: c[t] = x [t] + h[t − 1]

Activation: h[t] = σh (c[t])

That works until the firstzero-valued input. But then it justkeeps on adding.

RNN with u = 1

Page 19: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

RNN: u = 0.5?

Can we get decent results usingu = 0.5?

Advantage: by the time wereach x [t] = 0, the sum haskind of leaked away from us(c[t] ≈ 0), so a hard-reset isnot necessary.

Disadvantage: by the time wereach x [t] = 0, the sum haskind of leaked away from us(h[t] ≈ 0).

RNN with u = 0.5

Page 20: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Gradient Descent

c[t] = x [t] + uh[t − 1]

h[t] = σh (c[t])

Let’s try initializing u = 0.5, and then performing gradient descentto improve it. Gradient descent has five steps:

1 Forward Propagation: c[t] = x [t] + uh[t − 1], h[t] = c[t].

2 Synchronous Backprop: ε[t] = ∂E/∂c[t].

3 Back-Prop Through Time: δ[t] = dE/dc[t].

4 Weight Gradient: dE/du =∑

t δ[t]h[t − 1]

5 Gradient Descent: u ← u − ηdE/du

Page 21: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Gradient Descent

Excitation: c[t] = x [t] + uh[t − 1]

Activation: h[t] = σh (c[t])

Error: E =1

2

∑t:y [t]>0

(h[t]− y [t])2

So the back-prop stages are:

Synchronous Backprop: ε[t] =∂E

∂c[t]=

{(h[t]− y [t]) y [t] > 00 otherwise

BPTT: δ[t] =dE

dc[t]= ε[t] + uδ[t + 1]

Weight Gradient:dE

du=∑t

δ[t]h[t − 1]

Page 22: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Backprop Stages

ε[t] =

{(h[t]− y [t]) y [t] > 00 otherwise

δ[t] = ε[t] + uδ[t + 1]

dE

du=∑t

δ[t]h[t − 1]

Backprop Stages, u = 0.5

Page 23: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Vanishing Gradient and Exploding Gradient

Notice that, with |u| < 1, δ[t] tends to vanish exponentiallyfast as we go backward in time. This is called the vanishinggradient problem. It is a big problem for RNNs with longtime-dependency, and for deep neural nets with many layers.

If we set |u| > 1, we get an even worse problem, sometimescalled the exploding gradient problem.

Page 24: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

RNN, u = 1.7

c[t] = x [t] + uh[t − 1]

RNN, u = 1.7

δ[t] = ε[t] + uδ[t + 1]

Page 25: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Outline

1 Review: Recurrent Neural Networks

2 Vanishing/Exploding Gradient

3 Running Example: a Pocket Calculator

4 Regular RNN

5 Forget Gate

6 Long Short-Term Memory (LSTM)

7 Backprop for an LSTM

8 Conclusion

Page 26: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Hochreiter and Schmidhuber’s Solution: The Forget Gate

Instead of multiplying by the same weight, u, at each time step,Hochreiter and Schmidhuber proposed: let’s make the feedbackcoefficient a function of the input!

Excitation: c[t] = x [t] + f [t]h[t − 1]

Activation: h[t] = σh (c[t])

Forget Gate: f [t] = σg (wf x [t] + uf h[t − 1] + bf )

Where σh() and σg () might be different nonlinearities. Inparticular, it’s OK for σh() to be linear (σh(c) = c), but σg ()should be clipped so that 0 ≤ f [t] ≤ 1, in order to avoid gradientexplosion.

Page 27: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

The Forget-Gate Nonlinearity

The forget gate is

f [t] = σg (wf x [t] + uf h[t − 1] + bf )

where σg () is some nonlinearity such that 0 ≤ σg () ≤ 1. Two suchnonlinearities are worth knowing about.

Page 28: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Forget-Gate Nonlinearity #1: CReLU

The first useful nonlinearity is the CReLU (clipped rectified linearunit), defined as

σg (wf x + uf h + bf ) = min (1,max (0,wf x + uf h + bf ))

The CReLU is particularly useful for knowledge-baseddesign. That’s because σ(1) = 1 and σ(0) = 0, so it isrelatively easy to design the weights wf , uf , and bf to get theresults you want.

The CReLU is not very useful, though, if you want to chooseyour weights using gradient descent. What usually happensis that wf grows larger and larger for the first 2-3 epochs oftraining, and then suddenly wf is so large thatσ(wf x + uf h + bf ) = 0 for all training tokens. At that point,the gradient is dE/dw = 0, so further gradient-descenttraining is useless.

Page 29: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Forget-Gate Nonlinearity #1: Logistic Sigmoid

The second useful nonlinearity is the logistic sigmoid, defined as:

σg (wf x + uf h + bf ) =1

1 + e−(wf x+uf h+bf )

The logistic sigmoid is particularly useful for gradientdescent. That’s because its gradient is defined for all valuesof wf . In fact, it has a really simple form, that can be writtenin terms of the output: σ = σ(1− σ).

The logistic sigmoid is not as useful for knowledge-baseddesign. That’s because 0 < σ < 1: as x → −∞, σ(x)→ 0,but it never quite reaches it. Likewise as x →∞, σ(x)→ 1,but it never quite reaches it.

Page 30: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Pocket Calculator

When x [t] > 0,accumulate the input, andprint out nothing.

When x [t] = 0, print outthe accumulator, thenreset.

. . . but the “print out nothing”part is not scored, only theaccumulation. Furthermore,nonzero input is alwaysx [t] ≥ 1.

Page 31: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Pocket Calculator

With zero error, we canapproximate the pocketcalculator as

When x [t] ≥ 1,accumulate the input.

When x [t] = 0, print outthe accumulator, thenreset.

E = 12

∑t:y [t]>0 (h[t]− y [t])2 = 0

Page 32: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Forget-Gate Implementation of the Pocket Calculator

It seems like we can approximate the pocket calculator as:

When x [t] ≥ 1, accumulate the input: c[t] = x [t] + h[t − 1].

When x [t] = 0, print out the accumulator, then reset:c[t] = x [t].

So it seems that we just want the forget gate set to

f [t] =

{1 x [t] ≥ 10 x [t] = 0

This can be accomplished as

f [t] = CReLU (x [t]) = max (0,min (1, x [t]))

Page 33: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Forget Gate Implementation ofthe Pocket Calculator

c[t] = x [t] + f [t]h[t − 1]

h[t] = c[t]

f [t] = CReLU (x [t])

Forward Prop

Page 34: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Forget Gate Implementation ofthe Pocket Calculator

c[t] = x [t] + f [t]h[t − 1]

h[t] = c[t]

f [t] = CReLU (x [t])

Back Prop

Page 35: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

What Went Wrong?

The forget gate correctly turned itself on (remember the past)when x [t] > 0, and turned itself off (forget the past) whenx [t] = 0.

Unfortunately, we don’t want to forget the past whenx [t] = 0. We want to forget the past on the next time stepafter x [t] = 0.

Coincidentally, we also don’t want any output when x [t] > 0.The error criterion doesn’t score those samples, but maybe itshould.

Page 36: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Outline

1 Review: Recurrent Neural Networks

2 Vanishing/Exploding Gradient

3 Running Example: a Pocket Calculator

4 Regular RNN

5 Forget Gate

6 Long Short-Term Memory (LSTM)

7 Backprop for an LSTM

8 Conclusion

Page 37: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Long Short-Term Memory (LSTM)

The LSTM solves those problems by defining two types of memory,and three types of gates. The two types of memory are

1 The “cell,” c[t], corresponds to the excitation in an RNN.

2 The “output” or “prediction,” h[t], corresponds to theactivation in an RNN.

The three gates are:

1 The cell remembers the past only when the forget gate is on,f [t] = 1.

2 The cell accepts input only when the input gate is on, i [t] = 1.

3 The cell is output only when the output gate is on, o[t] = 1.

Page 38: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Long Short-Term Memory (LSTM)

The three gates are:

1 The cell remembers the past only when the forget gate is on,f [t] = 1.

2 The cell accepts input only when the input gate is on, i [t] = 1.

c[t] = f [t]c[t − 1] + i [t]σh (wcx [t] + uch[t − 1] + bc)

3 The cell is output only when the output gate is on, o[t] = 1.

h[t] = o[t]c[t]

Page 39: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Characterizing Human Memory

LONGTERM

SHORTTERM

INPUT GATE

OUTPUT GATE

PERCEPTION

ACTION

Pr {remember} = pLTMe−t/TLTM + (1− pLTM)e−t/TSTM

Page 40: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

When Should You Remember?

c[t] = f [t]c[t − 1] + i [t]σh (wcx [t] + uch[t − 1] + bc)

h[t] = o[t]c[t]

1 The forget gate is a function of current input and past output,f [t] = σg (wf x [t] + uf h[t − 1] + bf )

2 The input gate is a function of current input and past output,i [t] = σg (wix [t] + uih[t − 1] + bi )

3 The output gate is a function of current input and pastoutput, o[t] = σg (wox [t] + uoh[t − 1] + bo)

Page 41: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Neural Network Model: LSTM

i [t] = input gate = σg (wix [t] + uih[t − 1] + bi )

o[t] = output gate = σg (wox [t] + uoh[t − 1] + bo)

f [t] = forget gate = σg (wf x [t] + uf h[t − 1] + bf )

c[t] = memory cell = f [t]c[t − 1] + i [t]σh (wcx [t] + uch[t − 1] + bc)

h[t] = output = o[t]c[t]

Page 42: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Example: Pocket Calculator

i [t] = CReLU(1)

o[t] = CReLU(1− x [t])

f [t] = CReLU(1− h[t − 1])

c[t] = f [t]c[t − 1] + i [t]x [t]

h[t] = o[t]c[t]

Forward Prop

Page 43: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Outline

1 Review: Recurrent Neural Networks

2 Vanishing/Exploding Gradient

3 Running Example: a Pocket Calculator

4 Regular RNN

5 Forget Gate

6 Long Short-Term Memory (LSTM)

7 Backprop for an LSTM

8 Conclusion

Page 44: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Backprop for a normal RNN

In a normal RNN, each epoch of gradient descent has five steps:

1 Forward-prop: find the node excitation and activation,moving forward through time.

2 Synchronous backprop: find the partial derivative of errorw.r.t. node excitation at each time, assuming all other timesteps are constant.

3 Back-prop through time: find the total derivative of errorw.r.t. node excitation at each time.

4 Weight gradient: find the total derivative of error w.r.t.each weight and each bias.

5 Gradient descent: adjust each weight and bias in thedirection of the negative gradient

Page 45: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Backprop for an LSTM

An LSTM differs from a normal RNN in that, instead of just onememory unit at each time step, we now have two memory unitsand three gates. Each of them depends on the previous time-step.Since there are so many variables, let’s stop back-propagating toexcitations. Instead, we’ll just back-prop to compute the derivativeof the error w.r.t. each of the variables:

δh[t] =dE

dh[t], δc [t] =

dE

dc[t], δi [t] =

dE

di [t], δo [t] =

dE

do[t], δf [t] =

dE

df [t]

The partial derivatives are easy, though. Error can’t dependdirectly on any of the internal variables; it can only dependdirectly on the output, h[t]:

εh[t] =∂E

∂h[t]

Page 46: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Backprop for an LSTM

In an LSTM, we’ll implement each epoch of gradient descent withfive steps:

1 Forward-prop: find all five of the variables at each time step,moving forward through time.

2 Synchronous backprop: find the partial derivative of errorw.r.t. h[t].

3 Back-prop through time: find the total derivative of errorw.r.t. each of the five variables at each time, starting withh[t].

4 Weight gradient: find the total derivative of error w.r.t.each weight and each bias.

5 Gradient descent: adjust each weight and bias in thedirection of the negative gradient

Page 47: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Synchronous Back-Prop: the Output

Suppose the error term is

E =1

2

∞∑t=−∞

(h[t]− y [t])2

Then the first step, in back-propagation, is to calculate the partialderivative w.r.t. the prediction term h[t]:

εh[t] =∂E

∂h[t]= h[t]− y [t]

Page 48: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Synchronous Back-Prop: the other variables

Remember that the error is defined only in terms of the output,h[t]. So, actually, partial derivatives with respect to the othervariables are all zero!

εi [t] =∂E

∂i [t]= 0

εo [t] =∂E

∂o[t]= 0

εf [t] =∂E

∂f [t]= 0

εc [t] =∂E

∂c[t]= 0

Page 49: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Back-Prop Through Time

Back-prop through time is really tricky in an LSTM, because fourof the five variables depend on the previous time step, either onh[t − 1] or on c[t − 1]:

i [t] = σg (wix [t] + uih[t − 1] + bi )

o[t] = σg (wox [t] + uoh[t − 1] + bo)

f [t] = σg (wf x [t] + uf h[t − 1] + bf )

c[t] = f [t]c[t − 1] + i [t]σh (wcx [t] + uch[t − 1] + bc)

h[t] = o[t]c[t]

Page 50: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Back-Prop Through Time

Taking the partial derivative of each variable at time t w.r.t. thevariables at time t − 1, we get

∂i [t]

∂h[t − 1]= σg (wix [t] + uih[t − 1] + bi )ui

∂o[t]

∂h[t − 1]= σg (wox [t] + uoh[t − 1] + bo)uo

∂o[t]

∂h[t − 1]= σg (wf x [t] + uf h[t − 1] + bf )uf

∂c[t]

∂h[t − 1]= i [t]σh (wcx [t] + uch[t − 1] + bc) uc

∂c[t]

∂c[t − 1]= f [t]

Page 51: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Back-Prop Through Time

Using the standard rule for partial and total derivatives, we get areally complicated rule for h[t]:

dE

dh[t]=

∂E

∂h[t]+

dE

di [t + 1]

∂i [t + 1]

∂h[t]+

dE

do[t + 1]

∂o[t + 1]

∂h[t]

+dE

df [t + 1]

∂f [t + 1]

∂h[t]+

dE

dc[t + 1]

∂c[t + 1]

∂h[t]

The rule for c[t] is a bit simpler, because ∂E/∂c[t] = 0, so wedon’t need to include it:

dE

dc[t]=

dE

dh[t]

∂h[t]

∂c[t]+

dE

dc[t + 1]

∂c[t + 1]

∂c[t]

Page 52: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Back-Prop Through Time

If we define δh[t] = dE/dh[t], and so on, then we have

δh[t] = εh[t] + δi [t + 1]σg (wix [t + 1] + uih[t] + bi )ui

+ δo [t + 1]σg (wox [t + 1] + uoh[t] + bo)uo

+ δf [t + 1]σg (wf x [t + 1] + uf h[t] + bf )uf

+ i [t + 1]δc [t + 1]σh (wcx [t + 1] + uch[t] + bc) uc

The rule for c[t] is a bit simpler:

δc [t] = δh[t]o[t] + δc [t + 1]f [t + 1]

Page 53: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Back-Prop Through Time

BPTT for the gates is easy, because nothing at time t + 1 dependsdirectly on o[t], i [t], or f [t]. The only dependence is indirect, byway of h[t] and c[t]:

δo [t] =dE

do[t]=

dE

dh[t]

∂h[t]

∂o[t]= δh[t]c[t]

δi [t] =dE

di [t]=

dE

dc[t]

∂c[t]

∂i [t]= δc [t]σh (wcx [t] + uch[t − 1] + bc)

δf [t] =dE

df [t]=

dE

dc[t]

∂c[t]

∂f [t]= δc [t]c[t − 1]

Page 54: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

Outline

1 Review: Recurrent Neural Networks

2 Vanishing/Exploding Gradient

3 Running Example: a Pocket Calculator

4 Regular RNN

5 Forget Gate

6 Long Short-Term Memory (LSTM)

7 Backprop for an LSTM

8 Conclusion

Page 55: Long/Short-Term Memory

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion

RNNs suffer from either exponentially decreasing memory (if|w | < 1) or exponentially increasing memory (if |w | > 1).This is one version of a more general problem sometimescalled the gradient vanishing problem.

The forget gate solves that problem by making the feedbackcoefficient a function of the input.

LSTM defines two types of memory(cell=excitation=“long-term memory,” andoutput=activation=“short-term memory”), and three types ofgates (input, output, forget).

Each epoch of LSTM training has the same steps as in aregular RNN:

1 Forward propagation: find h[t].2 Synchronous backprop: find the time-synchronous partial

derivatives ε[t].3 BPTT: find the total derivatives δ[t].4 Weight gradients5 Gradient descent