lecture 6: lstms

85
Harvard AC295/CS287r/CSCI E-115B Chris Tanner Contextualized, Token-based Representations Lecture 6: LSTMs

Upload: others

Post on 06-May-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 6: LSTMs

HarvardAC295/CS287r/CSCI E-115BChris Tanner

Contextualized, Token-based Representations

Lecture 6: LSTMs

Page 2: Lecture 6: LSTMs

Stayin' alive was no jive […] but it was just a dream for the

teen, who was a fiend. Started [hustlin’ at] 16, and runnin' up

in [LSTM gates].

-- Raekwon of Wu-Tang (1994)

Page 3: Lecture 6: LSTMs

3

ANNOUNCEMENTSβ€’ HW2 coming very soon. Due in 2 weeks.

β€’ Research Proposals are due in 9 days, Sept 30.

β€’ Office Hours:

β€’ This week, my OH will be pushed back 30 min: 3:30pm – 5:30pm

β€’ Please reserve your coding questions for the TFs and/or EdStem, as I hold

office hours solo, and debugging code can easily bottleneck the queue.

Page 4: Lecture 6: LSTMs

RECAP: L4

Distributed Representations: dense vectors (aka embeddings) that aim to

convey the meaning of tokens:

β€’ β€œword embeddings” refer to when you have type-based

representations

β€’ β€œcontextualized embeddings” refer to when you have token-based

representations

4

Page 5: Lecture 6: LSTMs

RECAP: L4An auto-regressive LM is one that only has access to the previous tokens

(and the outputs become the inputs).

Evaluation: Perplexity

A masked LM can peak ahead, too. It β€œmasks” a word within the context

(i.e., center word).

Evaluation: downstream NLP tasks that uses the learned embeddings.

5

Both of these can produce useful word embeddings.

Page 6: Lecture 6: LSTMs

RECAP: L5β€’ RNNs help capture more context

while avoiding sparsity, storage,

and compute issues!

β€’ The hidden layer is what we care

about. It represents the word’s β€œmeaning”.

6

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

#𝑦!

π‘₯!

𝑉

Page 7: Lecture 6: LSTMs

Recurrent Neural Nets (RNNs)

Long Short-Term Memory (LSTMs)

Bi-LSTM and ELMo

Outline

7

Page 8: Lecture 6: LSTMs

Recurrent Neural Nets (RNNs)

Long Short-Term Memory (LSTMs)

Bi-LSTM and ELMo

Outline

8

Page 9: Lecture 6: LSTMs

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

#𝑦"

π‘₯"

Training Process

𝐢𝐸 𝑦!, '𝑦!Error

She

𝐢𝐸 𝑦" , '𝑦" = βˆ’*#∈%

𝑦#" log( '𝑦#" )RNN

9

Page 10: Lecture 6: LSTMs

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

#𝑦"

π‘Š

π‘ˆ

#𝑦0

π‘₯" π‘₯0

𝑉

Training Process

𝐢𝐸 𝑦&, '𝑦&𝐢𝐸 𝑦!, '𝑦!Error

She went

𝐢𝐸 𝑦" , '𝑦" = βˆ’*#∈%

𝑦#" log( '𝑦#" )RNN

10

Page 11: Lecture 6: LSTMs

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

#𝑦"

π‘Š

π‘ˆ

π‘Š

π‘ˆ

#𝑦0 #𝑦1

π‘₯" π‘₯0 π‘₯1

𝑉 𝑉

Training Process

𝐢𝐸 𝑦&, '𝑦& 𝐢𝐸 𝑦', '𝑦'𝐢𝐸 𝑦!, '𝑦!Error

She went to

𝐢𝐸 𝑦" , '𝑦" = βˆ’*#∈%

𝑦#" log( '𝑦#" )RNN

11

Page 12: Lecture 6: LSTMs

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

#𝑦"

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

#𝑦0 #𝑦1 #𝑦2

π‘₯" π‘₯0 π‘₯1 π‘₯2

𝑉 𝑉 𝑉

Training Process

𝐢𝐸 𝑦&, '𝑦& 𝐢𝐸 𝑦', '𝑦' 𝐢𝐸 𝑦(, '𝑦(𝐢𝐸 𝑦!, '𝑦!Error

She went to class

𝐢𝐸 𝑦" , '𝑦" = βˆ’*#∈%

𝑦#" log( '𝑦#" )RNN

12

Page 13: Lecture 6: LSTMs

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

#𝑦"

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

#𝑦0 #𝑦1 #𝑦2

π‘₯" π‘₯0 π‘₯1 π‘₯2

𝑉 𝑉 𝑉

Training Process

𝐢𝐸 𝑦&, '𝑦& 𝐢𝐸 𝑦', '𝑦' 𝐢𝐸 𝑦(, '𝑦(𝐢𝐸 𝑦!, '𝑦!Error

She went to class

During training, regardless of our output predictions,we feed in the correct inputs

𝐢𝐸 𝑦" , '𝑦" = βˆ’*#∈%

𝑦#" log( '𝑦#" )RNN

13

Page 14: Lecture 6: LSTMs

Training Process

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆπ‘‰ 𝑉 𝑉

She went to class

went? over? class? after?

𝐢𝐸 𝑦&, '𝑦& 𝐢𝐸 𝑦', '𝑦' 𝐢𝐸 𝑦(, '𝑦(𝐢𝐸 𝑦!, '𝑦!Error

#𝑦

𝐢𝐸 𝑦" , '𝑦" = βˆ’*#∈%

𝑦#" log( '𝑦#" )RNN

14

Page 15: Lecture 6: LSTMs

Training Process

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆπ‘‰ 𝑉 𝑉

She went to class

went? over? class? after?

𝐢𝐸 𝑦&, '𝑦& 𝐢𝐸 𝑦', '𝑦' 𝐢𝐸 𝑦(, '𝑦(𝐢𝐸 𝑦!, '𝑦!Error

#𝑦

𝐢𝐸 𝑦" , '𝑦" = βˆ’*#∈%

𝑦#" log( '𝑦#" )

Our total loss is simply the average loss across all 𝑻 time steps

RNN

15

Page 16: Lecture 6: LSTMs

Training Details

Output layer

π‘ˆ π‘ˆ π‘ˆ π‘ˆπ‘‰ 𝑉 𝑉

went? over? class? after?

𝐢𝐸 𝑦(, '𝑦(

#𝑦Using the chain rule, we trace the derivative all the way back to the beginning, while summing the results.

Input layer

Hidden layer

π‘Š π‘Š π‘Š π‘Š

She went to class

To update our weights (e.g. 𝑽), we calculate the gradient

of our loss w.r.t. the repeated weight matrix (e.g., 𝝏𝑳𝝏𝑽).

RNN

16

Page 17: Lecture 6: LSTMs

Training Details

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆπ‘‰ 𝑉 𝑉1

She went to class

went? over? class?

𝐢𝐸 𝑦(, '𝑦(

#𝑦

To update our weights (e.g. 𝑽), we calculate the gradient

of our loss w.r.t. the repeated weight matrix (e.g., 𝝏𝑳𝝏𝑽).

𝝏𝑳𝝏𝑽

Using the chain rule, we trace the derivative all the way back to the beginning, while summing the results.

RNN

17

Page 18: Lecture 6: LSTMs

Training Details

Output layer

π‘ˆ π‘ˆ π‘ˆ π‘ˆπ‘‰ 𝑉1

went? over? class?

𝐢𝐸 𝑦(, '𝑦(

#𝑦

𝝏𝑳𝝏𝑽

𝑉0

Using the chain rule, we trace the derivative all the way back to the beginning, while summing the results.

Input layer

Hidden layer

π‘Š π‘Š π‘Š π‘Š

She went to class

To update our weights (e.g. 𝑽), we calculate the gradient

of our loss w.r.t. the repeated weight matrix (e.g., 𝝏𝑳𝝏𝑽).

RNN

18

Page 19: Lecture 6: LSTMs

RNN Training Details

Output layer

π‘ˆ π‘ˆ π‘ˆ π‘ˆπ‘‰1

went? over? class?

𝐢𝐸 𝑦(, '𝑦(

#𝑦Using the chain rule, we trace the derivative all the way back to the beginning, while summing the results.

𝝏𝑳𝝏𝑽

𝑉0𝑉"

Input layer

Hidden layer

π‘Š π‘Š π‘Š π‘Š

She went to class

To update our weights (e.g. 𝑽), we calculate the gradient

of our loss w.r.t. the repeated weight matrix (e.g., 𝝏𝑳𝝏𝑽).

19

Page 20: Lecture 6: LSTMs

Training Details

β€’ This backpropagation through time (BPTT) process is expensive

β€’ Instead of updating after every timestep, we tend to do so

every 𝑇 steps (e.g., every sentence or paragraph)

β€’ This isn’t equivalent to using only a window size 𝑇

(a la n-grams) because we still have β€˜infinite memory’

RNN

20

Page 21: Lecture 6: LSTMs

We can generate the most likely next event (e.g., word) by sampling from +π’š

Continue until we generate <EOS> symbol.

RNN: Generation

21

Page 22: Lecture 6: LSTMs

We can generate the most likely next event (e.g., word) by sampling from +π’š

Continue until we generate <EOS> symbol.

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘₯"<START>

β€œSorry”

RNN: Generation

22

Page 23: Lecture 6: LSTMs

RNN: Generation

We can generate the most likely next event (e.g., word) by sampling from +π’š

Continue until we generate <EOS> symbol.

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘₯" π‘₯0

𝑉

<START> β€œSorry”

β€œSorry” Harry

23

Page 24: Lecture 6: LSTMs

RNN: Generation

We can generate the most likely next event (e.g., word) by sampling from +π’š

Continue until we generate <EOS> symbol.

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘₯" π‘₯0 π‘₯1

𝑉 𝑉

<START> β€œSorry” Harry

β€œSorry” Harry shouted,

24

Page 25: Lecture 6: LSTMs

RNN: Generation

We can generate the most likely next event (e.g., word) by sampling from +π’š

Continue until we generate <EOS> symbol.

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘₯" π‘₯0 π‘₯1 π‘₯2

𝑉 𝑉 𝑉

<START> β€œSorry” Harry shouted,

β€œSorry” Harry shouted, panicking

25

Page 26: Lecture 6: LSTMs

RNN: Generation

NOTE: the same input (e.g., β€œHarry”) can easily yield different outputs, depending on the context (unlike FFNNs and n-grams).

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘₯" π‘₯0 π‘₯1 π‘₯2

𝑉 𝑉 𝑉

<START> β€œSorry” Harry shouted,

β€œSorry” Harry shouted, panicking

26

Page 27: Lecture 6: LSTMs

RNN: Generation

When trained on Harry Potter text, it generates:

Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da627

Page 28: Lecture 6: LSTMs

RNN: Generation When trained on recipes

Source: https://gist.github.com/nylki/1efbaa36635956d35bcc28

Page 29: Lecture 6: LSTMs

RNNs: Overview

RNN ISSUES?

RNN STRENGTHS?

β€’ Can handle infinite-length sequences (not just a fixed-window)

β€’ Has a β€œmemory” of the context (thanks to the hidden layer’s recurrent loop)

β€’ Same weights used for all inputs, so positionality isn’t wonky/overwritten (like FFNN)

β€’ Slow to train (BPTT)

β€’ Due to ”infinite sequence”, gradients can easily vanish or explode

β€’ Has trouble actually making use of long-range context29

Page 30: Lecture 6: LSTMs

RNNs: Overview

RNN ISSUES?

RNN STRENGTHS?

β€’ Can handle infinite-length sequences (not just a fixed-window)

β€’ Has a β€œmemory” of the context (thanks to the hidden layer’s recurrent loop)

β€’ Same weights used for all inputs, so positionality isn’t wonky/overwritten (like FFNN)

β€’ Slow to train (BPTT)

β€’ Due to ”infinite sequence”, gradients can easily vanish or explode

β€’ Has trouble actually making use of long-range context30

Page 31: Lecture 6: LSTMs

RNNs: Vanishing and Exploding Gradients

π‘ˆ π‘ˆ π‘ˆ π‘ˆπ‘‰1

𝐢𝐸 𝑦(, '𝑦(

#𝑦

ππ‘³πŸ’

ππ‘½πŸ

𝑉0𝑉"

Input layer

Hidden layer

π‘Š π‘Š π‘Š π‘Š

She went to class

ππ‘³πŸ’

ππ‘½πŸ= ?

31

Page 32: Lecture 6: LSTMs

RNNs: Vanishing and Exploding Gradients

π‘ˆ π‘ˆ π‘ˆ π‘ˆπ‘‰1

𝐢𝐸 𝑦(, '𝑦(

#𝑦

ππ‘³πŸ’

ππ‘½πŸ

𝑉0𝑉"

Input layer

Hidden layer

π‘Š π‘Š π‘Š π‘Š

She went to class

ππ‘³πŸ’

ππ‘½πŸ= ππ‘³πŸ’

ππ‘½πŸ‘

32

Page 33: Lecture 6: LSTMs

RNNs: Vanishing and Exploding Gradients

π‘ˆ π‘ˆ π‘ˆ π‘ˆπ‘‰1

𝐢𝐸 𝑦(, '𝑦(

#𝑦

ππ‘³πŸ’

ππ‘½πŸ

𝑉0𝑉"

Input layer

Hidden layer

π‘Š π‘Š π‘Š π‘Š

She went to class

ππ‘³πŸ’

ππ‘½πŸ= ππ‘³πŸ’

ππ‘½πŸ‘ππ‘½πŸ‘

ππ‘½πŸ

33

Page 34: Lecture 6: LSTMs

RNNs: Vanishing and Exploding Gradients

π‘ˆ π‘ˆ π‘ˆ π‘ˆπ‘‰1

𝐢𝐸 𝑦(, '𝑦(

#𝑦

ππ‘³πŸ’

ππ‘½πŸ

𝑉0𝑉"

Input layer

Hidden layer

π‘Š π‘Š π‘Š π‘Š

She went to class

ππ‘³πŸ’

ππ‘½πŸ= ππ‘³πŸ’

ππ‘½πŸ‘ππ‘½πŸ‘

ππ‘½πŸππ‘½πŸ

ππ‘½πŸ

34

Page 35: Lecture 6: LSTMs

RNNs: Vanishing and Exploding Gradients

π‘ˆ π‘ˆ π‘ˆ π‘ˆπ‘‰1

𝐢𝐸 𝑦(, '𝑦(

#𝑦

ππ‘³πŸ’

ππ‘½πŸ

𝑉0𝑉"

Input layer

Hidden layer

π‘Š π‘Š π‘Š π‘Š

She went to class

ππ‘³πŸ’

ππ‘½πŸ= ππ‘³πŸ’

ππ‘½πŸ‘ππ‘½πŸ‘

ππ‘½πŸππ‘½πŸ

ππ‘½πŸ

This long path makes it easy for the gradients to become really small or large.

If small, the far-away context will be ”forgotten.”

If large, recency bias and no context.

35

Page 36: Lecture 6: LSTMs

Activation Functions

Sigmoid

36

Tanh

ReLU

Page 37: Lecture 6: LSTMs

Derivatives of Activation Functions

Sigmoid

37

Tanh

ReLU

Page 38: Lecture 6: LSTMs

Chain Reactions

Tanh activations for all

38

Page 39: Lecture 6: LSTMs

Lipschitz

A real-valued function 𝑓: ℝàℝ is Lipschitz continuous if

βˆƒ 𝐾 s.t. βˆ€π‘₯!, π‘₯", 𝑓 π‘₯! βˆ’ 𝑓 π‘₯" ≀ 𝐾 π‘₯! βˆ’ π‘₯"

Re-worded, if π‘₯! β‰  π‘₯":

𝑓 π‘₯! βˆ’ 𝑓 π‘₯"π‘₯! βˆ’ π‘₯"

≀ 𝐾

39

Page 40: Lecture 6: LSTMs

Gradients

We assert our Neural Net’s objective function f is well-behaved and

Lipschitz continuous w/ a constant L

𝑓 π‘₯ βˆ’ 𝑓 𝑦 ≀ 𝐿 π‘₯ βˆ’ 𝑦

We update our parameters by πœ‚π’ˆ

⟹ 𝑓 π‘₯ βˆ’ 𝑓 π‘₯ βˆ’ πœ‚π’ˆ ≀ πΏπœ‚ π’ˆ

40

Page 41: Lecture 6: LSTMs

Gradients

We assert our Neural Net’s objective function f is well-behaved and

Lipschitz continuous w/ a constant L

𝑓 π‘₯ βˆ’ 𝑓 𝑦 ≀ 𝐿 π‘₯ βˆ’ 𝑦

We update our parameters by πœ‚π’ˆ

⟹ 𝑓 π‘₯ βˆ’ 𝑓 π‘₯ βˆ’ πœ‚π’ˆ ≀ πΏπœ‚ π’ˆ

This means, we will never observe a change by more than πΏπœ‚ π’ˆ

41

Page 42: Lecture 6: LSTMs

Gradients

We update our parameters by πœ‚π’ˆ ⟹ 𝑓 π‘₯ βˆ’ 𝑓 π‘₯ βˆ’ πœ‚π’ˆ ≀ πΏπœ‚ π’ˆ

PRO:β€’ Limits the extent to which things can go wrong if we move in the wrong

direction

CONS:β€’ Limits the speed of making progress

β€’ Gradients may still become quite large and the optimizer may not converge

How can we limit the gradients from being so large? 42

Page 43: Lecture 6: LSTMs

Exploding Gradients

Source: https://www.deeplearningbook.org/contents/rnn.html Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf43

Page 44: Lecture 6: LSTMs

Gradient Clipping

β€’ Ensures the norm of the gradient never exceeds the threshold 𝜏

β€’ Only adjusts the magnitude of each gradient, not the direction (good).

β€’ Helps w/ numerical stability of training; no general improvement in

performance

Page 45: Lecture 6: LSTMs

Recurrent Neural Nets (RNNs)

Long Short-Term Memory (LSTMs)

Bi-LSTM and ELMo

Outline

45

Page 46: Lecture 6: LSTMs

Outline

46

Recurrent Neural Nets (RNNs)

Long Short-Term Memory (LSTMs)

Bi-LSTM and ELMo

Page 47: Lecture 6: LSTMs

LSTM

β€’ A type of RNN that is designed to better handle long-range

dependencies

β€’ In ”vanilla” RNNs, the hidden state is perpetually being rewritten

β€’ In addition to a traditional hidden state h, let’s have a dedicated

memory cell c for long-term events. More power to relay

sequence info.

47

Page 48: Lecture 6: LSTMs

LSTM

At each each time step 𝑑, we have a hidden state β„Ž6 and cell state 𝑐6:

β€’ Both are vectors of length n

β€’ cell state 𝑐6 stores long-term info

β€’ At each time step 𝑑, the LSTM erases, writes, and reads information from the cell 𝑐6

β€’ 𝑐6 never undergoes a nonlinear activation though, just – and +

β€’ + of two things does not modify the gradient; simply adds the gradients

48

Page 49: Lecture 6: LSTMs

LSTM

Input layer

Hidden layer

Output layer

π‘₯" π‘₯0

𝐻"

𝐢"

π‘₯0

𝐻0

𝐢0

𝐢 and 𝐻 relay long- and short-term memory to the hidden layer, respectively. Inside the hidden layer, there are many weights.

49

Page 50: Lecture 6: LSTMs

LSTM

𝐻67"

𝐢67"

𝐻6

𝐢6

𝐻68"

𝐢68"

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 50

Page 51: Lecture 6: LSTMs

LSTM

𝐻67"

𝐢67"

𝐻6

𝐢6

𝐻68"

𝐢68"some old memories are β€œforgotten”

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 51

Page 52: Lecture 6: LSTMs

LSTM

𝐻67"

𝐢67"

𝐻6

𝐢6

𝐻68"

𝐢68"some old memories are β€œforgotten” some new memories are made

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 52

Page 53: Lecture 6: LSTMs

LSTM

𝐻67"

𝐢67"

𝐻6

𝐢6

𝐻68"

𝐢68"some old memories are β€œforgotten” some new memories are made

a nonlinear weighted version of the long-term memory becomes our short-term memory

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 53

Page 54: Lecture 6: LSTMs

LSTM

𝐻67"

𝐢67"

𝐻6

𝐢6

𝐻68"

𝐢68"some old memories are β€œforgotten” some new memories are made

a nonlinear weighted version of the long-term memory becomes our short-term memory

memory is written, erased, and read by three gates – which are influenced by 𝒙 and 𝒉

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 54

Page 55: Lecture 6: LSTMs

Forget Gate

55

Imagine the cell state currently has values related to a previous topic. Convo has shifted.

Page 56: Lecture 6: LSTMs

Input Gate

56

Decides which values to update (by scaling each of the new, incoming info).

Page 57: Lecture 6: LSTMs

Cell State

57

The cell state forgets some info, then it’s simultaneously updated by us adding to it.

Page 58: Lecture 6: LSTMs

Hidden State

58

Decide what our new hidden representation will be. Based on:- filtered version of the cell state, and- weighted version of recurrent, hidden layer

Page 59: Lecture 6: LSTMs

LSTM

It’s still possible for LSTMs to suffer from vanishing/exploding

gradients, but it’s way less likely than with vanilla RNNs:

β€’ If RNNs wish to preserve info over long contexts, it must delicately

find a recurrent weight matrix π‘Š9 that isn’t too large or small

β€’ However, LSTMs have 3 separate mechanism that adjust the flow of

information (e.g., forget gate, if turned off, will preserve all info)

59

Page 60: Lecture 6: LSTMs

60

The cell learns an operative time to β€œturn on”.

Page 61: Lecture 6: LSTMs

61

The cell learns an operative time to β€œturn on”.

Page 62: Lecture 6: LSTMs

LSTM

LSTM ISSUES?

LSTM STRENGTHS?

β€’ Almost always outperforms vanilla RNNs

β€’ Captures long-range dependencies shockingly well

β€’ Has more weights to learn than vanilla RNNs; thus,

β€’ Requires a moderate amount of training data (otherwise, vanilla RNNs are better)

β€’ Can still suffer from vanishing/exploding gradients62

Page 63: Lecture 6: LSTMs

Sequential Modelling

If your goal isn’t to predict the next item in a sequence, and you rather

do some other classification or regression task using the sequence,

then you can:

β€’ Train an aforementioned model (e.g., LSTM) as a language model

β€’ Use the hidden layers that correspond to each item in your

sequence

IMPORTANT

63

Page 64: Lecture 6: LSTMs

Sequential Modelling

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘₯" π‘₯0 π‘₯1 π‘₯2

𝑉 𝑉 𝑉

Language Modelling 1-to-1 tagging/classification

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

#𝑦"

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

#𝑦0 #𝑦1 #𝑦2

π‘₯" π‘₯0 π‘₯1 π‘₯2

𝑉 𝑉 𝑉

π‘₯0 π‘₯1 π‘₯2 π‘₯:

64

Auto-regressive Non-Auto-regressive

Page 65: Lecture 6: LSTMs

Sequential Modelling

Many-to-1 classification

Sentiment score

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

#𝑦2

π‘₯" π‘₯0 π‘₯1 π‘₯2

𝑉 𝑉 𝑉

65

Page 66: Lecture 6: LSTMs

Sequential Modelling

Many-to-1 classification

Input layer

Hidden layer

Output layer

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

π‘₯" π‘₯0 π‘₯1 π‘₯2

𝑉 𝑉 𝑉

Sentiment score

66

Page 67: Lecture 6: LSTMs

This concludes the foundation in sequential representation.

Most state-of-the-art advances are based on those core

RNN/LSTM ideas. But, with tens of thousands of researchers and

hackers exploring deep learning, there are many tweaks that

haven proven useful.

(This is where things get crazy.)

67

Page 68: Lecture 6: LSTMs

Outline

68

Recurrent Neural Nets (RNNs)

Long Short-Term Memory (LSTMs)

Bi-LSTM and ELMo

Page 69: Lecture 6: LSTMs

Outline

69

Recurrent Neural Nets (RNNs)

Long Short-Term Memory (LSTMs)

Bi-LSTM and ELMo

Page 70: Lecture 6: LSTMs

RNNs/LSTMs use the left-to-right context and sequentially

process data.

If you have full access to the data at testing time, why not

make use of the flow of information from right-to-left, also?

RNN Extensions: Bi-directional LSTMs

70

Page 71: Lecture 6: LSTMs

RNN Extensions: Bi-directional LSTMs

Input layer

Hidden layer

π‘₯" π‘₯0 π‘₯1 π‘₯2

β„Ž"; β„Ž0; β„Ž1; β„Ž2;

For brevity, let’s use the follow schematic to represent an RNN

71

Page 72: Lecture 6: LSTMs

RNN Extensions: Bi-directional LSTMs

Input layer

Hidden layer

π‘₯" π‘₯0 π‘₯1 π‘₯2 π‘₯" π‘₯0 π‘₯1 π‘₯2

β„Ž"; β„Ž0; β„Ž1; β„Ž2; β„Ž"< β„Ž0< β„Ž1< β„Ž2<

For brevity, let’s use the follow schematic to represent an RNN

72

Page 73: Lecture 6: LSTMs

RNN Extensions: Bi-directional LSTMs

Input layer

Hidden layer

π‘₯" π‘₯0 π‘₯1 π‘₯2 π‘₯" π‘₯0 π‘₯1 π‘₯2

β„Ž"; β„Ž0; β„Ž1; β„Ž2; β„Ž"< β„Ž0< β„Ž1< β„Ž2<

β„Ž";β„Ž"<

β„Ž0;β„Ž0<

β„Ž1;β„Ž1<

β„Ž2;β„Ž2<Concatenate the hidden layers

73

Page 74: Lecture 6: LSTMs

RNN Extensions: Bi-directional LSTMs

Input layer

Hidden layer

Output layer

π‘₯" π‘₯0 π‘₯1 π‘₯2 π‘₯" π‘₯0 π‘₯1 π‘₯2

β„Ž"; β„Ž0; β„Ž1; β„Ž2; β„Ž"< β„Ž0< β„Ž1< β„Ž2<

β„Ž";β„Ž"<

β„Ž0;β„Ž0<

β„Ž1;β„Ž1<

β„Ž2;β„Ž2<

#𝑦" #𝑦0 #𝑦1 #𝑦2

Concatenate the hidden layers

74

Page 75: Lecture 6: LSTMs

β€’ Usually performs at least as well as uni-directional RNNs/LSTMs

RNN Extensions: Bi-directional LSTMs

BI-LSTM ISSUES?

BI-LSTM STRENGTHS?

β€’ Slower to train

β€’ Only possible if access to full data is allowed

75

Page 76: Lecture 6: LSTMs

RNN Extensions: Stacked LSTMs

Input layer

Hidden layer #1

π‘₯" π‘₯0 π‘₯1 π‘₯2

β„Ž"; β„Ž0; β„Ž1; β„Ž2;

Hidden layers provide an

abstraction (holds β€œmeaning”).

Stacking hidden layers provides

increased abstractions.

76

Page 77: Lecture 6: LSTMs

RNN Extensions: Stacked LSTMs

Input layer

Hidden layer #1

π‘₯" π‘₯0 π‘₯1 π‘₯2

β„Ž"; β„Ž0; β„Ž1; β„Ž2;

β„Ž2;0β„Ž1;0β„Ž0;0β„Ž";0Hidden layer #2

Hidden layers provide an

abstraction (holds β€œmeaning”).

Stacking hidden layers provides

increased abstractions.

77

Page 78: Lecture 6: LSTMs

RNN Extensions: Stacked LSTMs

Input layer

Hidden layer #1

π‘₯" π‘₯0 π‘₯1 π‘₯2

β„Ž"; β„Ž0; β„Ž1; β„Ž2;

β„Ž2;0β„Ž1;0β„Ž0;0β„Ž";0

#𝑦" #𝑦0 #𝑦1 #𝑦2Output layer

Hidden layer #2

Hidden layers provide an

abstraction (holds β€œmeaning”).

Stacking hidden layers provides

increased abstractions.

78

Page 79: Lecture 6: LSTMs

ELMo: Stacked Bi-directional LSTMs

General Idea:

β€’ Goal is to get highly rich embeddings for each word (unique type)

β€’ Use both directions of context (bi-directional), with increasing

abstractions (stacked)

β€’ Linearly combine all abstract representations (hidden layers) and

optimize w.r.t. a particular task (e.g., sentiment classification)

ELMo Slides: https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018 79

Page 80: Lecture 6: LSTMs

ELMo: Stacked Bi-directional LSTMs

Illustration: http://jalammar.github.io/illustrated-bert/ 80

Page 81: Lecture 6: LSTMs

Illustration: http://jalammar.github.io/illustrated-bert/ 81

Page 82: Lecture 6: LSTMs

ELMo: Stacked Bi-directional LSTMs

Deep contextualized word representations. Peters et al. NAACL 2018. 82

Page 83: Lecture 6: LSTMs

ELMo: Stacked Bi-directional LSTMs

β€’ ELMo yielded incredibly good contextualized embeddings, which yielded

SOTA results when applied to many NLP tasks.

β€’ Main ELMo takeaway: given enough training data, having tons of explicit

connections between your vectors is useful

(system can determine how to best use context)

ELMo Slides: https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018 83

Page 84: Lecture 6: LSTMs

SUMMARY

β€’ Distributed Representations can be:

β€’ Type-based (β€œword embeddings”)

β€’ Token-based (β€œcontextualized representations/embeddings”)

β€’ Type-based models include Bengio’s 2003 and word2vec 2013

β€’ Token-based models include RNNs/LSTMs, which:

β€’ demonstrated profound results in 2015 onward.

β€’ it can be used for essentially any NLP task.

84

Page 85: Lecture 6: LSTMs

BACKUP

85