deep learning sequence to sequence models: attention modelsbhiksha/courses/deeplearning/... · deep...

131
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1

Upload: others

Post on 01-Aug-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Deep Learning

Sequence to Sequence models: Attention Models

17 March 2018

1

Page 2: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Sequence-to-sequence modelling

• Problem:

– A sequence 𝑋1…𝑋𝑁 goes in

– A different sequence 𝑌1…𝑌𝑀 comes out

• E.g.

– Speech recognition: Speech goes in, a word sequence comes out

• Alternately output may be phoneme or character sequence

– Machine translation: Word sequence goes in, word sequence comes out

• In general 𝑁 ≠ 𝑀

– No synchrony between 𝑋 and 𝑌.

2

Page 3: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Sequence to sequence

• Sequence goes in, sequence comes out

• No notion of “synchrony” between input and

output

– May even not have a notion of “alignment”

• E.g. “I ate an apple” “Ich habe einen apfel gegessen”

3

v

Seq2seq

Seq2seqI ate an apple Ich habe einen apfel gegessen

I ate an apple

Page 4: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Recap: Have dealt with the “aligned” case: CTC

• The input and output sequences happen in the same

order

– Although they may be asynchronous

– E.g. Speech recognition

• The input speech corresponds to the phoneme sequence output

Time

X(t)

Y(t)

t=0

h-1

4

Page 5: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Today

• Sequence goes in, sequence comes out

• No notion of “synchrony” between input and

output

– May even not have a notion of “alignment”

• E.g. “I ate an apple” “Ich habe einen apfel gegessen”

5

v

Seq2seqI ate an apple Ich habe einen apfel gegessen

Page 6: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Recap: Predicting text

• Simple problem: Given a series of symbols (characters or words) w1 w2… wn, predict the next symbol (character or word) wn+1

6

Page 7: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Simple recurrence : Text Modelling

• Learn a model that can predict the next character given a sequence of characters

– Or, at a higher level, words

• After observing inputs 𝑤0…𝑤𝑘 it predicts 𝑤𝑘+1

h-1

𝑤0 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 𝑤6

𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 𝑤6 𝑤7

7

Page 8: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Language modelling using RNNs

• Problem: Given a sequence of words (or characters) predict the next one

Four score and seven years ???

A B R A H A M L I N C O L ??

8

Page 9: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Training

• Input: symbols as one-hot vectors• Dimensionality of the vector is the size of the “vocabulary”

• Projected down to lower-dimensional “embeddings”

• Output: Probability distribution over symbols𝑌 𝑡, 𝑖 = 𝑃(𝑉𝑖|𝑤0…𝑤𝑡−1)

• 𝑉𝑖 is the i-th symbol in the vocabulary

• Divergence

𝐷𝑖𝑣 𝐘𝑡𝑎𝑟𝑔𝑒𝑡 1…𝑇 , 𝐘(1…𝑇) =

𝑡

𝑋𝑒𝑛𝑡 𝐘𝑡𝑎𝑟𝑔𝑒𝑡 𝑡 , 𝐘(𝑡) = −

𝑡

log 𝑌(𝑡, 𝑤𝑡+1)

Y(t)

h-1

Y(t)DIVERGENCE

𝑃

𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 𝑤6 𝑤7

The probability assigned to the correct next word

9

𝑤0 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 𝑤6

𝑃 𝑃 𝑃 𝑃 𝑃 𝑃

Page 10: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Language: The model

• The hidden units are (one or more layers of) LSTM units

• Trained via backpropagation from a lot of text

𝑃

𝑊1

𝑃

𝑊2

𝑃

𝑊3

𝑃

𝑊4

𝑃

𝑊5

𝑃

𝑊6

𝑃

𝑊7

𝑃

𝑊8

𝑃

𝑊9

𝑊5 𝑊6 𝑊7 𝑊8 𝑊9 𝑊10𝑊2 𝑊3 𝑊4

10

Page 11: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Language: Synthesis

• On trained model : Provide the first few words

– One-hot vectors

• After the last input word, the network generates a probability distribution over words

– Outputs an N-valued probability distribution rather than a one-hot vector

𝑃

𝑊1

𝑃

𝑊2

𝑃

𝑊3

11

𝑦41

𝑦42

𝑦4𝑁

𝑦𝑡𝑖 = 𝑃(𝑊𝑡 = 𝑉𝑖|𝑊1…𝑊𝑡−1)

The probability that the t-th word in thesequence is the i-th word in the vocabularygiven all previous t-1 words

Page 12: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Language: Synthesis

• On trained model : Provide the first few words

– One-hot vectors

• After the last input word, the network generates a probability distribution over words

– Outputs an N-valued probability distribution rather than a one-hot vector

• Draw a word from the distribution

– And set it as the next word in the series

𝑃

𝑊1

𝑃

𝑊2

𝑃

𝑊3

𝑊4

12

𝑦41

𝑦42

𝑦4𝑁

𝑦𝑡𝑖 = 𝑃(𝑊𝑡 = 𝑉𝑖|𝑊1…𝑊𝑡−1)

The probability that the t-th word in thesequence is the i-th word in the vocabularygiven all previous t-1 words

Page 13: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Language: Synthesis

• Feed the drawn word as the next word in the series

– And draw the next word from the output probability distribution

• Continue this process until we terminate generation

– In some cases, e.g. generating programs, there may be a natural termination

𝑃

𝑊1

𝑃

𝑊2

𝑃

𝑊3

𝑃

𝑊4

13

𝑦51

𝑦52

𝑦5𝑁

𝑦𝑡𝑖 = 𝑃(𝑊𝑡 = 𝑉𝑖|𝑊1…𝑊𝑡−1)

The probability that the t-th word in thesequence is the i-th word in the vocabularygiven all previous t-1 words

Page 14: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Language: Synthesis

• Feed the drawn word as the next word in the series

– And draw the next word from the output probability distribution

• Continue this process until we terminate generation

– In some cases, e.g. generating programs, there may be a natural termination

𝑃

𝑊1

𝑃

𝑊2

𝑃

𝑊3

𝑃

𝑊5𝑊4

14

𝑦51

𝑦52

𝑦5𝑁

𝑦𝑡𝑖 = 𝑃(𝑊𝑡 = 𝑉𝑖|𝑊1…𝑊𝑡−1)

The probability that the t-th word in thesequence is the i-th word in the vocabularygiven all previous t-1 words

Page 15: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Language: Synthesis

• Feed the drawn word as the next word in the series

– And draw the next word from the output probability distribution

• Continue this process until we terminate generation

– For text generation we will usually end at an <eos> (end of sequence) symbol• The <eos> symbol is a special symbol included in the vocabulary, which indicates the

termination of a sequence and occurs only at the final position of sequences

𝑃

𝑊1

𝑃

𝑊2

𝑃

𝑊3

𝑃 𝑃 𝑃 𝑃 𝑃 𝑃

𝑊5 𝑊6 𝑊7 𝑊8 𝑊9 𝑊10𝑊4

15

Page 16: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Returning our problem

• Problem:

– A sequence 𝑋1…𝑋𝑁 goes in

– A different sequence 𝑌1…𝑌𝑀 comes out

• Similar to predicting text, but with a difference

– The output is in a different language..

16

Seq2seqI ate an apple Ich habe einen apfel gegessen

Page 17: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Modelling the problem

• Delayed sequence to sequence

17

Page 18: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Modelling the problem

• Delayed sequence to sequence

– Delayed self-referencing sequence-to-sequence18

Page 19: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

The “simple” translation model

• The input sequence feeds into an recurrent structure

• The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

• Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs

– The output at each time becomes the input at the next time

– Output production continues until an <eos> is produced19

I ate an apple <eos>

Page 20: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

The “simple” translation model

• The input sequence feeds into an recurrent structure

• The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

• Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs

– The output at each time becomes the input at the next time

– Output production continues until an <eos> is produced20

I ate an apple <eos>

Page 21: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

The “simple” translation model

• The input sequence feeds into an recurrent structure

• The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

• Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs

– The output at each time becomes the input at the next time

– Output production continues until an <eos> is produced21

I ate an apple <eos>

Ich

Page 22: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

The “simple” translation model

• The input sequence feeds into an recurrent structure

• The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

• Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs

– The output at each time becomes the input at the next time

– Output production continues until an <eos> is produced22

I ate an apple <eos>

Ich habe

Ich

Page 23: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

The “simple” translation model

• The input sequence feeds into an recurrent structure

• The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

• Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs

– The output at each time becomes the input at the next time

– Output production continues until an <eos> is produced23

I ate an apple <eos>

Ich habe einen

Ich habe

Page 24: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

The “simple” translation model

• The input sequence feeds into an recurrent structure

• The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

• Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs

– The output at each time becomes the input at the next time

– Output production continues until an <eos> is produced24

I ate an apple <eos>

Ich habe einen apfel gegessen <eos>

Ich habe einen apfel gegessen

Page 25: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• We will illustrate with a single hidden layer, but the discussion generalizes to more layers

25

I ate an apple <eos>

Ich habe einen apfel gegessen <eos>

Ich habe einen apfel gegessen

I ate an apple <eos>

Ich habe einen apfel gegessen <eos>

Ich habe einen apfel gegessen

Page 26: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

The “simple” translation model

• The input sequence feeds into an recurrent structure

• The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

• Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs

– The output at each time becomes the input at the next time

– Output production continues until an <eos> is produced26

I ate an apple <eos>

Ich habe einen apfel gegessen <eos>

Ich habe einen apfel gegessen

ENCODER

DECODER

Page 27: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

The “simple” translation model

• A more detailed look: The one-hot word representations may be compressed via embeddings

– Embeddings will be learned along with the rest of the net

– In the following slides we will not represent the projection matrices

27

Ich habe einen apfel gegessen <eos>

I ate an apple <eos> Ich habe einen apfel gegessen

𝑃1 𝑃1 𝑃1 𝑃1 𝑃1 𝑃2 𝑃2 𝑃2 𝑃2 𝑃2

Page 28: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

What the network actually produces

• At each time 𝑘 the network actually produces a probability distribution over the output vocabulary

– 𝑦𝑘𝑤 = 𝑃 𝑂𝑘 = 𝑤|𝑂𝑘−1, … , 𝑂1, 𝐼1, … , 𝐼𝑁

– The probability given the entire input sequence 𝐼1, … , 𝐼𝑁 and the partial output sequence 𝑂1, … , 𝑂𝑘−1 until 𝑘

• At each time a word is drawn from the output distribution

• The drawn word is provided as input to the next time28

I ate an apple <eos>

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

Page 29: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

What the network actually produces

• At each time 𝑘 the network actually produces a probability distribution over the output vocabulary

– 𝑦𝑘𝑤 = 𝑃 𝑂𝑘 = 𝑤|𝑂𝑘−1, … , 𝑂1, 𝐼1, … , 𝐼𝑁

– The probability given the entire input sequence 𝐼1, … , 𝐼𝑁 and the partial output sequence 𝑂1, … , 𝑂𝑘−1 until 𝑘

• At each time a word is drawn from the output distribution

• The drawn word is provided as input to the next time29

I ate an apple <eos>

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

Ich

Page 30: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

What the network actually produces

• At each time 𝑘 the network actually produces a probability distribution over the output vocabulary

– 𝑦𝑘𝑤 = 𝑃 𝑂𝑘 = 𝑤|𝑂𝑘−1, … , 𝑂1, 𝐼1, … , 𝐼𝑁

– The probability given the entire input sequence 𝐼1, … , 𝐼𝑁 and the partial output sequence 𝑂1, … , 𝑂𝑘−1 until 𝑘

• At each time a word is drawn from the output distribution

• The drawn word is provided as input to the next time30

I ate an apple <eos>

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

Ich

Ich

Page 31: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

What the network actually produces

• At each time 𝑘 the network actually produces a probability distribution over the output vocabulary

– 𝑦𝑘𝑤 = 𝑃 𝑂𝑘 = 𝑤|𝑂𝑘−1, … , 𝑂1, 𝐼1, … , 𝐼𝑁

– The probability given the entire input sequence 𝐼1, … , 𝐼𝑁 and the partial output sequence 𝑂1, … , 𝑂𝑘−1 until 𝑘

• At each time a word is drawn from the output distribution

• The drawn word is provided as input to the next time31

I ate an apple <eos>

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

𝑦1𝑖𝑐ℎ

𝑦1𝑏𝑖𝑒𝑟

𝑦1𝑎𝑝𝑓𝑒𝑙

𝑦1<𝑒𝑜𝑠>

Ich

Ich

Page 32: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

What the network actually produces

• At each time 𝑘 the network actually produces a probability distribution over the output vocabulary

– 𝑦𝑘𝑤 = 𝑃 𝑂𝑘 = 𝑤|𝑂𝑘−1, … , 𝑂1, 𝐼1, … , 𝐼𝑁

– The probability given the entire input sequence 𝐼1, … , 𝐼𝑁 and the partial output sequence 𝑂1, … , 𝑂𝑘−1 until 𝑘

• At each time a word is drawn from the output distribution

• The drawn word is provided as input to the next time32

I ate an apple <eos>

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

𝑦1𝑖𝑐ℎ

𝑦1𝑏𝑖𝑒𝑟

𝑦1𝑎𝑝𝑓𝑒𝑙

𝑦1<𝑒𝑜𝑠>

Ich

Ich

habe

Page 33: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

What the network actually produces

• At each time 𝑘 the network actually produces a probability distribution over the output vocabulary

– 𝑦𝑘𝑤 = 𝑃 𝑂𝑘 = 𝑤|𝑂𝑘−1, … , 𝑂1, 𝐼1, … , 𝐼𝑁

– The probability given the entire input sequence 𝐼1, … , 𝐼𝑁 and the partial output sequence 𝑂1, … , 𝑂𝑘−1 until 𝑘

• At each time a word is drawn from the output distribution

• The drawn word is provided as input to the next time33

I ate an apple <eos> Ich habe

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

𝑦1𝑖𝑐ℎ

𝑦1𝑏𝑖𝑒𝑟

𝑦1𝑎𝑝𝑓𝑒𝑙

𝑦1<𝑒𝑜𝑠>

Ich

Ich

habe

Page 34: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

What the network actually produces

• At each time 𝑘 the network actually produces a probability distribution over the output vocabulary

– 𝑦𝑘𝑤 = 𝑃 𝑂𝑘 = 𝑤|𝑂𝑘−1, … , 𝑂1, 𝐼1, … , 𝐼𝑁

– The probability given the entire input sequence 𝐼1, … , 𝐼𝑁 and the partial output sequence 𝑂1, … , 𝑂𝑘−1 until 𝑘

• At each time a word is drawn from the output distribution

• The drawn word is provided as input to the next time34

I ate an apple <eos> Ich habe

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

𝑦1𝑖𝑐ℎ

𝑦1𝑏𝑖𝑒𝑟

𝑦1𝑎𝑝𝑓𝑒𝑙

𝑦1<𝑒𝑜𝑠>

𝑦2𝑖𝑐ℎ

𝑦2𝑏𝑖𝑒𝑟

𝑦2𝑎𝑝𝑓𝑒𝑙

𝑦2<𝑒𝑜𝑠>

Ich

Ich

habe

Page 35: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

What the network actually produces

• At each time 𝑘 the network actually produces a probability distribution over the output vocabulary

– 𝑦𝑘𝑤 = 𝑃 𝑂𝑘 = 𝑤|𝑂𝑘−1, … , 𝑂1, 𝐼1, … , 𝐼𝑁

– The probability given the entire input sequence 𝐼1, … , 𝐼𝑁 and the partial output sequence 𝑂1, … , 𝑂𝑘−1 until 𝑘

• At each time a word is drawn from the output distribution

• The drawn word is provided as input to the next time35

I ate an apple <eos> Ich habe

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

𝑦1𝑖𝑐ℎ

𝑦1𝑏𝑖𝑒𝑟

𝑦1𝑎𝑝𝑓𝑒𝑙

𝑦1<𝑒𝑜𝑠>

𝑦2𝑖𝑐ℎ

𝑦2𝑏𝑖𝑒𝑟

𝑦2𝑎𝑝𝑓𝑒𝑙

𝑦2<𝑒𝑜𝑠>

Ich

Ich

habe einen

Page 36: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

What the network actually produces

• At each time 𝑘 the network actually produces a probability distribution over the output vocabulary

– 𝑦𝑘𝑤 = 𝑃 𝑂𝑘 = 𝑤|𝑂𝑘−1, … , 𝑂1, 𝐼1, … , 𝐼𝑁

– The probability given the entire input sequence 𝐼1, … , 𝐼𝑁 and the partial output sequence 𝑂1, … , 𝑂𝑘−1 until 𝑘

• At each time a word is drawn from the output distribution

• The drawn word is provided as input to the next time36

I ate an apple <eos>

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

𝑦1𝑖𝑐ℎ

𝑦1𝑏𝑖𝑒𝑟

𝑦1𝑎𝑝𝑓𝑒𝑙

𝑦1<𝑒𝑜𝑠>

𝑦2𝑖𝑐ℎ

𝑦2𝑏𝑖𝑒𝑟

𝑦2𝑎𝑝𝑓𝑒𝑙

𝑦2<𝑒𝑜𝑠>

𝑦3𝑖𝑐ℎ

𝑦3𝑏𝑖𝑒𝑟

𝑦3𝑎𝑝𝑓𝑒𝑙

𝑦3<𝑒𝑜𝑠>

𝑦4𝑖𝑐ℎ

𝑦4𝑏𝑖𝑒𝑟

𝑦4𝑎𝑝𝑓𝑒𝑙

𝑦4<𝑒𝑜𝑠>

𝑦5𝑖𝑐ℎ

𝑦5𝑏𝑖𝑒𝑟

𝑦5𝑎𝑝𝑓𝑒𝑙

𝑦5<𝑒𝑜𝑠>

Ich habe einen apfel gegessen

Ich habe einen apfel gegessen <eos>

Page 37: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating an output from the net

• At each time the network produces a probability distribution over words, given the entire input and previous outputs

• At each time a word is drawn from the output distribution

• The drawn word is provided as input to the next time

• The process continues until an <eos> is generated37

I ate an apple <eos>

Ich habe einen apfel gegessen <eos>

Ich habe einen apfel gegessen

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

𝑦1𝑖𝑐ℎ

𝑦1𝑏𝑖𝑒𝑟

𝑦1𝑎𝑝𝑓𝑒𝑙

𝑦1<𝑒𝑜𝑠>

𝑦2𝑖𝑐ℎ

𝑦2𝑏𝑖𝑒𝑟

𝑦2𝑎𝑝𝑓𝑒𝑙

𝑦2<𝑒𝑜𝑠>

𝑦3𝑖𝑐ℎ

𝑦3𝑏𝑖𝑒𝑟

𝑦3𝑎𝑝𝑓𝑒𝑙

𝑦3<𝑒𝑜𝑠>

𝑦4𝑖𝑐ℎ

𝑦4𝑏𝑖𝑒𝑟

𝑦4𝑎𝑝𝑓𝑒𝑙

𝑦4<𝑒𝑜𝑠>

𝑦5𝑖𝑐ℎ

𝑦5𝑏𝑖𝑒𝑟

𝑦5𝑎𝑝𝑓𝑒𝑙

𝑦5<𝑒𝑜𝑠>

Page 38: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

The probability of the output

𝑃 𝑂1, … , 𝑂𝐿|𝑊1𝑖𝑛, … ,𝑊𝑁

𝑖𝑛 = 𝑦1𝑂1𝑦1

𝑂2 …𝑦1𝑂𝐿

• The objective of drawing: Produce the most likely output (that ends in an <eos>)

argmax𝑂1,…,𝑂𝐿

𝑦1𝑂1𝑦1

𝑂2 …𝑦1𝑂𝐿

38

I ate an apple <eos>

Ich habe einen apfel gegessen <eos>

Ich habe einen apfel gegessen

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

𝑦1𝑖𝑐ℎ

𝑦1𝑏𝑖𝑒𝑟

𝑦1𝑎𝑝𝑓𝑒𝑙

𝑦1<𝑒𝑜𝑠>

𝑦2𝑖𝑐ℎ

𝑦2𝑏𝑖𝑒𝑟

𝑦2𝑎𝑝𝑓𝑒𝑙

𝑦2<𝑒𝑜𝑠>

𝑦3𝑖𝑐ℎ

𝑦3𝑏𝑖𝑒𝑟

𝑦3𝑎𝑝𝑓𝑒𝑙

𝑦3<𝑒𝑜𝑠>

𝑦4𝑖𝑐ℎ

𝑦4𝑏𝑖𝑒𝑟

𝑦4𝑎𝑝𝑓𝑒𝑙

𝑦4<𝑒𝑜𝑠>

𝑦5𝑖𝑐ℎ

𝑦5𝑏𝑖𝑒𝑟

𝑦5𝑎𝑝𝑓𝑒𝑙

𝑦5<𝑒𝑜𝑠>

Page 39: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

The probability of the output

• Cannot just pick the most likely symbol at each time

– That may cause the distribution to be more “confused” at the next time

– Choosing a different, less likely word could cause the distribution at the next time to be more peaky, resulting in a more likely output overall

39

I ate an apple <eos>

Ich habe einen apfel gegessen <eos>

Ich habe einen apfel gegessen

𝑦0𝑖𝑐ℎ

𝑦0𝑏𝑖𝑒𝑟

𝑦0𝑎𝑝𝑓𝑒𝑙

𝑦0<𝑒𝑜𝑠>

𝑦1𝑖𝑐ℎ

𝑦1𝑏𝑖𝑒𝑟

𝑦1𝑎𝑝𝑓𝑒𝑙

𝑦1<𝑒𝑜𝑠>

𝑦2𝑖𝑐ℎ

𝑦2𝑏𝑖𝑒𝑟

𝑦2𝑎𝑝𝑓𝑒𝑙

𝑦2<𝑒𝑜𝑠>

𝑦3𝑖𝑐ℎ

𝑦3𝑏𝑖𝑒𝑟

𝑦3𝑎𝑝𝑓𝑒𝑙

𝑦3<𝑒𝑜𝑠>

𝑦4𝑖𝑐ℎ

𝑦4𝑏𝑖𝑒𝑟

𝑦4𝑎𝑝𝑓𝑒𝑙

𝑦4<𝑒𝑜𝑠>

𝑦5𝑖𝑐ℎ

𝑦5𝑏𝑖𝑒𝑟

𝑦5𝑎𝑝𝑓𝑒𝑙

𝑦5<𝑒𝑜𝑠>

Objective:

argmax𝑂1,…,𝑂𝐿

𝑦1𝑂1𝑦1

𝑂2 …𝑦1𝑂𝐿

Page 40: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Greedy is not good

• Hypothetical example (from English speech recognition : Input is speech, ouputmust be text)

• “Nose” has highest probability at t=2 and is selected– The model is very confused at t=3 and assigns low probabilities to many words at the next

time

– Selecting any of these will result in low probability for the entire 3-word sequence

• “Knows” has slightly lower probability than “nose”, but is still high and is selected– “he knows” is a reasonable beginning and the model assigns high probabilities to words such

as “something”

– Selecting one of these results in higher overall probability for the 3-word sequence40

T=0 1 2 T=0 1 2w1 w2 w3 wV…

𝑃(𝑂

3|𝑂

1,𝑂

2,𝐼1,…

,𝐼𝑁)

w1 w2 w3 wV…

𝑃(𝑂

3|𝑂

1,𝑂

2,𝐼1,…

,𝐼𝑁)

Page 41: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Greedy is not good

• Problem: Impossible to know a priori which word leads to the more promising future

– Should we draw “nose” or “knows”?

– Effect may not be obvious until several words down the line

– Or the choice of the wrong word early may cumulatively lead to a poorer overall score over time

41

T=0 1 2w1 w2 w3 wV…

𝑃(𝑂

2|𝑂

1,𝐼1,…

,𝐼𝑁)What should we

have chosen at t=2??

Will selecting “nose”continue to have a bad effect into the distant future?

Page 42: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Greedy is not good

• Problem: Impossible to know a priori which word leads to the more promising future

– Even earlier: Choosing the lower probability “the” instead of “he” at T=0 may have made a choice of “nose” more reasonable at T=1..

• In general, making a poor choice at any time commits us to a poor future

– But we cannot know at that time the choice was poor

• Solution: Don’t choose..42

T=0 1 2w1 the w3 he…

𝑃(𝑂

1|𝐼1,…

,𝐼𝑁)

What should wehave chosen at t=1??

Choose “the” or “he”?

Page 43: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Multiple choices

• Retain both choices and fork the network

– With every possible word as input43

I

He

We

The

Page 44: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Problem: Multiple choices

• Problem: This will blow up very quickly

– For an output vocabulary of size 𝑉, after 𝑇 output steps we’d have forked out 𝑉𝑇 branches

44

I

He

We

The

Page 45: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Prune

• Solution: Prune

– At each time, retain only the top K scoring forks45

I

He

We

The

𝑇𝑜𝑝𝐾 𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

Page 46: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Prune

• Solution: Prune

– At each time, retain only the top K scoring forks46

I

He

We

The

𝑇𝑜𝑝𝐾 𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

Page 47: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Prune

• Solution: Prune

– At each time, retain only the top K scoring forks47

He

The

𝑇𝑜𝑝𝐾 𝑃(𝑂2𝑂1|𝐼1, … , 𝐼𝑁)

Note: based on product

= 𝑇𝑜𝑝𝐾 𝑃(𝑂2|𝑂1, 𝐼1, … , 𝐼𝑁)𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

I

Knows

I

Nose

Page 48: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Prune

• Solution: Prune

– At each time, retain only the top K scoring forks48

He

The

𝑇𝑜𝑝𝐾 𝑃(𝑂2𝑂1|𝐼1, … , 𝐼𝑁)

Note: based on product

= 𝑇𝑜𝑝𝐾 𝑃(𝑂2|𝑂1, 𝐼1, … , 𝐼𝑁)𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

I

Knows

I

Nose

Page 49: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Prune

• Solution: Prune

– At each time, retain only the top K scoring forks49

He

The

= 𝑇𝑜𝑝𝐾𝑃 𝑂2 𝑂1, 𝐼1, … , 𝐼𝑁 ×𝑃 𝑂2 𝑂1, 𝐼1, … , 𝐼𝑁 ×𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

Knows

Nose

Page 50: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Prune

• Solution: Prune

– At each time, retain only the top K scoring forks50

He

The

= 𝑇𝑜𝑝𝐾𝑃 𝑂2 𝑂1, 𝐼1, … , 𝐼𝑁 ×𝑃 𝑂2 𝑂1, 𝐼1, … , 𝐼𝑁 ×𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

Knows

Nose

Page 51: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Prune

• Solution: Prune

– At each time, retain only the top K scoring forks51

He

The

𝑇𝑜𝑝𝐾 ෑ

𝑡=1

𝑛

𝑃 𝑂𝑛 𝑂1, … , 𝑂𝑛−1, 𝐼1, … , 𝐼𝑁

Knows

Nose

Page 52: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Terminate

• Terminate

– When the current most likely path overall ends in <eos>

• Or continue producing more outputs (each of which terminates in <eos>) to get N-best outputs 52

He

The

Knows

<eos>

Nose

Page 53: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Termination: <eos>

• Terminate

– Paths cannot continue once the output an <eos>

• So paths may be different lengths– Select the most likely sequence ending in <eos> across all terminating sequences 53

He

The

Knows

<eos>

Nose

<eos>

<eos>

Example has K = 2

Page 54: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Training the system

• Must learn to make predictions appropriately

– Given “I ate an apple <eos>”, produce “Ich habeeinen apfel gegessen <eos>”.

54

I ate an apple <eos>

Ich habe einen apfel gegessen <eos>

Ich habe einen apfel gegessen

Page 55: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Training : Forward pass

• Forward pass: Input the source and target sequences, sequentially

– Output will be a probability distribution over target symbol set (vocabulary)

55

I ate an apple <eos> Ich habe einen apfel gegessen

𝑌0 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5

Page 56: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Training : Backward pass

• Backward pass: Compute the divergence

between the output distribution and target word

sequence

56

I ate an apple <eos> Ich habe einen apfel gegessen

𝑌0 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5

Ich habe einen apfel gegessen <eos>

Div Div Div Div Div Div

Page 57: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Training : Backward pass

• Backward pass: Compute the divergence between the output distribution and target word sequence

• Backpropagate the derivatives of the divergence through the network to learn the net

57

I ate an apple <eos> Ich habe einen apfel gegessen

𝑌0 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5

Ich habe einen apfel gegessen <eos>

Div Div Div Div Div Div

Page 58: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Training : Backward pass

• In practice, if we apply SGD, we may randomly sample words from the output to actually use for the backprop and update

– Typical usage: Randomly select one word from each input training instance (comprising an input-output pair)• For each iteration

– Randomly select training instance: (input, output)

– Forward pass

– Randomly select a single output y(t) and corresponding desired output d(t) for backprop 58

I ate an apple <eos> Ich habe einen apfel gegessen

𝑌0 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5

Ich habe einen apfel gegessen <eos>

Div Div Div Div Div Div

Page 59: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Overall training

• Given several training instance (𝐗, 𝐃)

• Forward pass: Compute the output of the network for (𝐗, 𝐃)

– Note, both 𝐗 and 𝐃 are used in the forward pass

• Backward pass: Compute the divergence between the desired target 𝐃 and the actual output 𝐘

– Propagate derivatives of divergence for updates

59

Page 60: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Trick of the trade: Reversing the input

• Standard trick of the trade: The input

sequence is fed in reverse order

– Things work better this way

60

Ich habe einen apfel gegessen

𝑌0 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5

Ich habe einen apfel gegessen <eos>

Div Div Div Div Div Div

Iateanapple<eos>

Page 61: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Trick of the trade: Reversing the input

• Standard trick of the trade: The input

sequence is fed in reverse order

– Things work better this way

61

Ich habe einen apfel gegessen

𝑌0 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5

Ich habe einen apfel gegessen <eos>

Div Div Div Div Div Div

Iateanapple<eos>

Page 62: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Trick of the trade: Reversing the input

• Standard trick of the trade: The input sequence is fed in reverse order

– Things work better this way

• This happens both for training and during actual decode

62

Iateanapple<eos> Ich habe einen apfel gegessen

𝑌0 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5

Ich habe einen apfel gegessen <eos>

Page 63: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Overall training

• Given several training instance (𝐗, 𝐃)

• Forward pass: Compute the output of the network for (𝐗, 𝐃) with input in reverse order

– Note, both 𝐗 and 𝐃 are used in the forward pass

• Backward pass: Compute the divergence between the desired target 𝐃 and the actual output 𝐘

– Propagate derivatives of divergence for updates

63

Page 64: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Applications

• Machine Translation– My name is Tom Ich heisse Tom/Mein name ist

Tom

• Automatic speech recognition– Speech recording “My name is Tom”

• Dialog– “I have a problem” “How may I help you”

• Image to text– Picture Caption for picture

64

Page 65: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Machine Translation Example

• Hidden state clusters by meaning!

– From “Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals and Le

65

Page 66: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Machine Translation Example

• Examples of translation

– From “Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals and Le 66

Page 67: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Human Machine Conversation: Example

• From “A neural conversational model”, Orin Vinyals and Quoc Le• Trained on human-human converstations• Task: Human text in, machine response out 67

Page 68: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Image Captions

• Not really a seq-to-seq problem, more an image-to-sequence problem

• Initial state is produced by a state-of-art CNN-based image classification system

– Subsequent model is just the decoder end of a seq-to-seq model• “Show and Tell: A Neural Image Caption Generator”, O. Vinyals, A. Toshev, S. Bengio, D.

Erhan68

CNN

Image

Page 69: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Image Captions

• Decoding: Given image

– Process it with CNN to get output of classification layer

– Sequentially generate words by drawing from the conditional output distribution 𝑃(𝑊𝑡|𝑊0𝑊1…𝑊𝑡−1, 𝐼𝑚𝑎𝑔𝑒)

– In practice, we can perform the beam search explained earlier69

Page 70: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Image Captions

• Decoding: Given image

– Process it with CNN to get output of classification layer

– Sequentially generate words by drawing from the conditional output distribution 𝑃(𝑊𝑡|𝑊0𝑊1…𝑊𝑡−1, 𝐼𝑚𝑎𝑔𝑒)

– In practice, we can perform the beam search explained earlier70

A

𝑦0𝑎

𝑦0𝑏𝑜𝑦

𝑦0𝑐𝑎𝑡

Page 71: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Image Captions

• Decoding: Given image

– Process it with CNN to get output of classification layer

– Sequentially generate words by drawing from the conditional output distribution 𝑃(𝑊𝑡|𝑊0𝑊1…𝑊𝑡−1, 𝐼𝑚𝑎𝑔𝑒)

– In practice, we can perform the beam search explained earlier71

A boy

A

𝑦0𝑎

𝑦0𝑏𝑜𝑦

𝑦0𝑐𝑎𝑡

𝑦1𝑎

𝑦1𝑏𝑜𝑦

𝑦1𝑐𝑎𝑡

Page 72: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Image Captions

• Decoding: Given image

– Process it with CNN to get output of classification layer

– Sequentially generate words by drawing from the conditional output distribution 𝑃(𝑊𝑡|𝑊0𝑊1…𝑊𝑡−1, 𝐼𝑚𝑎𝑔𝑒)

– In practice, we can perform the beam search explained earlier72

A boy on

A boy

𝑦0𝑎

𝑦0𝑏𝑜𝑦

𝑦0𝑐𝑎𝑡

𝑦1𝑎

𝑦1𝑏𝑜𝑦

𝑦1𝑐𝑎𝑡

𝑦2𝑎

𝑦2𝑏𝑜𝑦

𝑦2𝑐𝑎𝑡

Page 73: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Image Captions

• Decoding: Given image

– Process it with CNN to get output of classification layer

– Sequentially generate words by drawing from the conditional output distribution 𝑃(𝑊𝑡|𝑊0𝑊1…𝑊𝑡−1, 𝐼𝑚𝑎𝑔𝑒)

– In practice, we can perform the beam search explained earlier73

A boy on a

A boy on

𝑦0𝑎

𝑦0𝑏𝑜𝑦

𝑦0𝑐𝑎𝑡

𝑦1𝑎

𝑦1𝑏𝑜𝑦

𝑦1𝑐𝑎𝑡

𝑦2𝑎

𝑦2𝑏𝑜𝑦

𝑦2𝑐𝑎𝑡

𝑦3𝑎

𝑦3𝑏𝑜𝑦

𝑦3𝑐𝑎𝑡

Page 74: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Image Captions

• Decoding: Given image

– Process it with CNN to get output of classification layer

– Sequentially generate words by drawing from the conditional output distribution 𝑃(𝑊𝑡|𝑊0𝑊1…𝑊𝑡−1, 𝐼𝑚𝑎𝑔𝑒)

– In practice, we can perform the beam search explained earlier74

A boy on a surfboard

A boy on a

𝑦0𝑎

𝑦0𝑏𝑜𝑦

𝑦0𝑐𝑎𝑡

𝑦1𝑎

𝑦1𝑏𝑜𝑦

𝑦1𝑐𝑎𝑡

𝑦2𝑎

𝑦2𝑏𝑜𝑦

𝑦2𝑐𝑎𝑡

𝑦3𝑎

𝑦3𝑏𝑜𝑦

𝑦3𝑐𝑎𝑡

𝑦4𝑎

𝑦4𝑏𝑜𝑦

𝑦4𝑐𝑎𝑡

Page 75: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Generating Image Captions

• Decoding: Given image

– Process it with CNN to get output of classification layer

– Sequentially generate words by drawing from the conditional output distribution 𝑃(𝑊𝑡|𝑊0𝑊1…𝑊𝑡−1, 𝐼𝑚𝑎𝑔𝑒)

– In practice, we can perform the beam search explained earlier75

A boy on a surfboard<eos>

A boy on a surfboard

𝑦0𝑎

𝑦0𝑏𝑜𝑦

𝑦0𝑐𝑎𝑡

𝑦1𝑎

𝑦1𝑏𝑜𝑦

𝑦1𝑐𝑎𝑡

𝑦2𝑎

𝑦2𝑏𝑜𝑦

𝑦2𝑐𝑎𝑡

𝑦3𝑎

𝑦3𝑏𝑜𝑦

𝑦3𝑐𝑎𝑡

𝑦4𝑎

𝑦4𝑏𝑜𝑦

𝑦4𝑐𝑎𝑡

𝑦5𝑎

𝑦5𝑏𝑜𝑦

𝑦5𝑐𝑎𝑡

Page 76: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Training

• Training: Given several (Image, Caption) pairs– The image network is pretrained on a large corpus, e.g. image net

• Forward pass: Produce output distributions given the image and caption

• Backward pass: Compute the divergence w.r.t. training caption, and backpropagatederivatives– All components of the network, including final classification layer of the image classification net are

updated

– The CNN portions of the image classifier are not modified (transfer learning) 76

CNN

Image

Page 77: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Training: Given several (Image, Caption) pairs– The image network is pretrained on a large corpus, e.g. image net

• Forward pass: Produce output distributions given the image and caption

• Backward pass: Compute the divergence w.r.t. training caption, and backpropagatederivatives– All components of the network, including final classification layer of the image classification net are

updated

– The CNN portions of the image classifier are not modified (transfer learning) 77

A boy on a surfboard

𝑦0𝑎

𝑦0𝑏𝑜𝑦

𝑦0𝑐𝑎𝑡

𝑦1𝑎

𝑦1𝑏𝑜𝑦

𝑦1𝑐𝑎𝑡

𝑦2𝑎

𝑦2𝑏𝑜𝑦

𝑦2𝑐𝑎𝑡

𝑦3𝑎

𝑦3𝑏𝑜𝑦

𝑦3𝑐𝑎𝑡

𝑦4𝑎

𝑦4𝑏𝑜𝑦

𝑦4𝑐𝑎𝑡

𝑦5𝑎

𝑦5𝑏𝑜𝑦

𝑦5𝑐𝑎𝑡

Page 78: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Training: Given several (Image, Caption) pairs– The image network is pretrained on a large corpus, e.g. image net

• Forward pass: Produce output distributions given the image and caption

• Backward pass: Compute the divergence w.r.t. training caption, and backpropagatederivatives– All components of the network, including final classification layer of the image classification net are

updated

– The CNN portions of the image classifier are not modified (transfer learning) 78

A boy on a surfboard<eos>

A boy on a surfboard

𝑦0𝑎

𝑦0𝑏𝑜𝑦

𝑦0𝑐𝑎𝑡

𝑦1𝑎

𝑦1𝑏𝑜𝑦

𝑦1𝑐𝑎𝑡

𝑦2𝑎

𝑦2𝑏𝑜𝑦

𝑦2𝑐𝑎𝑡

𝑦3𝑎

𝑦3𝑏𝑜𝑦

𝑦3𝑐𝑎𝑡

𝑦4𝑎

𝑦4𝑏𝑜𝑦

𝑦4𝑐𝑎𝑡

𝑦5𝑎

𝑦5𝑏𝑜𝑦

𝑦5𝑐𝑎𝑡

Div Div Div Div Div Div

Page 79: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Examples from Vinyals et. Al.

79

Page 80: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate SaenkoNorth American Chapter of the Association for Computational Linguistics, Denver, Colorado, June 2015.

80

Page 81: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

A problem with this framework

• All the information about the input sequence is

embedded into a single vector

– The “hidden” node layer at the end of the input sequence

– This one node is “overloaded” with information

• Particularly if the input is long

81

I ate an apple <eos> Ich habe einen apfel gegessen

𝑌0 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5

Ich habe einen apfel gegessen <eos>

Page 82: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

A problem with this framework

• In reality: All hidden values carry information

– Some of which may be diluted downstream

82

I ate an apple <eos>

Page 83: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

A problem with this framework

• In reality: All hidden values carry information

– Some of which may be diluted downstream

• Different outputs are related to different inputs

– Recall input and output may not be in sequence 83

I ate an apple <eos> Ich habe einen apfel gegessen

Ich habe einen apfel gegessen <eos>

Page 84: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

A problem with this framework

• In reality: All hidden values carry information– Some of which may be diluted downstream

• Different outputs are related to different inputs– Recall input and output may not be in sequence

– Have no way of knowing a priori which input must connect to what output

• Connecting everything to everything is infeasible– Variable sized inputs and outputs

– Overparametrized

– Connection pattern ignores the actual a synchronous dependence of output on input 85

I ate an apple <eos> Ich habe einen apfel gegessen

Ich habe einen apfel gegessen <eos>

Page 85: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Attention models

• Separating the encoder and decoder in illustration86

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3𝒉−1

𝒔−1

Page 86: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Attention models

• Compute a weighted combination of all the hidden outputs into a single vector

– Weights vary by output time 87

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0 𝒔1 𝒔2 𝒔3 𝒔4 𝒔5

𝑖

𝑤𝑖 1 𝒉𝑖

𝑖

𝑤𝑖 2 𝒉𝑖

𝑖

𝑤𝑖 3 𝒉𝑖

𝑖

𝑤𝑖 4 𝒉𝑖

𝑖

𝑤𝑖 5 𝒉𝑖

𝑖

𝑤𝑖 6 𝒉𝑖

Page 87: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Attention models

• Compute a weighted combination of all the hidden outputs into a single vector

– Weights vary by output time 88

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0 𝒔1 𝒔2 𝒔3 𝒔4 𝒔5

𝑖

𝑤𝑖 1 𝒉𝑖

𝑖

𝑤𝑖 2 𝒉𝑖

𝑖

𝑤𝑖 3 𝒉𝑖

𝑖

𝑤𝑖 4 𝒉𝑖

𝑖

𝑤𝑖 5 𝒉𝑖

𝑖

𝑤𝑖 6 𝒉𝑖

Note: Weights varywith output time

Input to hidden decoder layer: σ𝑖𝑤𝑖 𝑡 𝒉𝑖Weights: 𝑤𝑖(𝑡) are scalars

Page 88: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Attention models

• Require a time-varying weight that specifies relationship of output time to input time

– Weights are functions of current output state 89

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0 𝒔1 𝒔2 𝒔3 𝒔4 𝒔5

𝑖

𝑤𝑖 1 𝒉𝑖

𝑖

𝑤𝑖 2 𝒉𝑖

𝑖

𝑤𝑖 3 𝒉𝑖

𝑖

𝑤𝑖 4 𝒉𝑖

𝑖

𝑤𝑖 5 𝒉𝑖

𝑖

𝑤𝑖 6 𝒉𝑖

𝑤𝑖 𝑡 = 𝑎 𝒉𝑖 , 𝒔𝑡−1

Input to hidden decoderlayer: σ𝑖𝑤𝑖 𝑡 𝒉𝑖

Page 89: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Attention models

• The weights are a distribution over the input

– Must automatically highlight the most important input components for any output 90

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0 𝒔1 𝒔2 𝒔3 𝒔4 𝒔5

𝑤𝑖 𝑡 = 𝑎 𝒉𝑖 , 𝒔𝑡−1

Input to hidden decoderlayer: σ𝑖𝑤𝑖 𝑡 𝒉𝑖

Ich habe einen

Ich habe einen

Sum to 1.0

Page 90: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Attention models

• “Raw” weight at any time: A function 𝑔() that works on the two hidden states

• Actual weight: softmax over raw weights 91

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0 𝒔1 𝒔2 𝒔3 𝒔4 𝒔5

𝑤𝑖 𝑡 =exp(𝑒𝑖 𝑡 )

σ𝑗 exp(𝑒𝑗 𝑡 )

Input to hidden decoderlayer: σ𝑖𝑤𝑖 𝑡 𝒉𝑖

Ich habe einen

Ich habe einen

Sum to 1.0

𝑒𝑖 𝑡 = 𝑔 𝒉𝑖 , 𝒔𝑡−1

Page 91: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Attention models

• Typical options for 𝑔()…

– Variables in red are to be learned 92

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0 𝒔1 𝒔2 𝒔3 𝒔4 𝒔5

𝑤𝑖 𝑡 =exp(𝑒𝑖 𝑡 )

σ𝑗 exp(𝑒𝑗 𝑡 )

Ich habe einen

Ich habe einen

𝑒𝑖 𝑡 = 𝑔 𝒉𝑖 , 𝒔𝑡−1

𝑔 𝒉𝑖 , 𝒔𝑡−1 = 𝒉𝑖𝑇𝒔𝑡−1

𝑔 𝒉𝑖 , 𝒔𝑡−1 = 𝒉𝑖𝑇𝑾𝑔𝒔𝑡−1

𝑔 𝒉𝑖 , 𝒔𝑡−1 = 𝒗𝑔𝑇𝒕𝒂𝒏𝒉 𝑾𝑔

𝒉𝑖𝒔𝑡−1

Page 92: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Converting an input (forward pass)

• Pass the input through the encoder to

produce hidden representations 𝒉𝑖 93

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

Page 93: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Compute weights for first output

94

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1

Converting an input (forward pass)What is this?Multiple options

Simplest: 𝒔−1 = 𝒉𝑁

If 𝒔 and 𝒉 are different sizes:𝒔−1 = 𝑾𝑠𝒉𝑁

𝑾𝑠 is learnable parameter

Page 94: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Compute weights (for every 𝒉𝑖) for first

output 95

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1

𝑤𝑖 0 =exp(𝑒𝑖 0 )

σ𝑗 exp(𝑒𝑗 0 )

Converting an input (forward pass)

𝑒𝑖 0 = 𝑔 𝒉𝑖 , 𝒔−1

𝑔 𝒉𝑖 , 𝒔−1 = 𝒉𝑖𝑇𝑾𝑔𝒔−1

Page 95: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Compute weights (for every 𝒉𝑖) for first output

• Compute weighted combination of hidden values96

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1

Converting an input (forward pass)

𝒛0 =

𝑖

𝑤𝑖 0 𝒉𝑖

𝑤𝑖 0 =exp(𝑒𝑖 0 )

σ𝑗 exp(𝑒𝑗 0 )

𝑒𝑖 0 = 𝑔 𝒉𝑖 , 𝒔−1

𝑔 𝒉𝑖 , 𝒔−1 = 𝒉𝑖𝑇𝑾𝑔𝒔−1

Page 96: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Produce the first output

– Will be distribution over words 97

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

Converting an input (forward pass)

𝒛0 =

𝑖

𝑤𝑖 0 𝒉𝑖 𝑦0𝑖𝑐ℎ

𝑦0𝑑𝑢

𝑦0ℎ𝑎𝑡

𝒀0

𝒀0

Page 97: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Produce the first output

– Will be distribution over words

– Draw a word from the distribution 98

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

Ich

Converting an input (forward pass)

𝒛0 =

𝑖

𝑤𝑖 0 𝒉𝑖 𝑦0𝑖𝑐ℎ

𝑦0𝑑𝑢

𝑦0ℎ𝑎𝑡

𝒀0

𝒀0

Page 98: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Compute the weights for all instances for

time = 1 99

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝑤𝑖 1 =exp(𝑒𝑖 1 )

σ𝑗 exp(𝑒𝑗 1 )

𝑒𝑖 1 = 𝑔 𝒉𝑖 , 𝒔0

𝑔 𝒉𝑖 , 𝒔0 = 𝒉𝑖𝑇𝑾𝑔𝒔0

Page 99: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Compute the weighted sum of hidden input

values at t=1 100

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝑤𝑖 1 =exp(𝑒𝑖 1 )

σ𝑗 exp(𝑒𝑗 1 )

𝑒𝑖 1 = 𝑔 𝒉𝑖 , 𝒔0

𝑔 𝒉𝑖 , 𝒔0 = 𝒉𝑖𝑇𝑾𝑔𝒔0

𝒛1 =

𝑖

𝑤𝑖 1 𝒉𝑖

Page 100: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Compute the output at t=1

– Will be a probability distribution over words 101

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒛1 =

𝑖

𝑤𝑖 1 𝒉𝑖

𝒔1

Ich

𝑦1𝑖𝑐ℎ

𝑦1𝑑𝑢

𝑦1ℎ𝑎𝑡

𝒀1

𝒀1

Page 101: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Draw a word from the output distribution at

t=1 102

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒛1 =

𝑖

𝑤𝑖 1 𝒉𝑖

𝒔1

Ich

𝒀1habe

𝑦1𝑖𝑐ℎ

𝑦1𝑑𝑢

𝑦1ℎ𝑎𝑡

𝒀1

Page 102: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

103

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒔1

Ich

𝒀1

habe

𝒛1

• Compute the weights for all instances for

time = 2

𝑤𝑖 2 =exp(𝑒𝑖 2 )

σ𝑗 exp(𝑒𝑗 2 )

𝑒𝑖 2 = 𝑔 𝒉𝑖 , 𝒔1

𝑔 𝒉𝑖 , 𝒔1 = 𝒉𝑖𝑇𝑾𝑔𝒔1

Page 103: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

104

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒔1

Ich

𝒀1

habe

𝒛1

𝑤𝑖 2 =exp(𝑒𝑖 2 )

σ𝑗 exp(𝑒𝑗 2 )

𝑒𝑖 2 = 𝑔 𝒉𝑖 , 𝒔1

𝑔 𝒉𝑖 , 𝒔1 = 𝒉𝑖𝑇𝑾𝑔𝒔1

• Compute the weighted sum of hidden input

values at t=2

𝒛2 =

𝑖

𝑤𝑖 2 𝒉𝑖

Page 104: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Compute the output at t=2

– Will be a probability distribution over words 105

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒛2 =

𝑖

𝑤𝑖 2 𝒉𝑖

𝒔1

Ich

𝑦2𝑖𝑐ℎ

𝑦2𝑑𝑢

𝑦2ℎ𝑎𝑡

𝒀1

𝒀2

𝒛1

𝒔2

habe

𝒀2

habe

Page 105: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Draw a word from the distribution

106

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒛2 =

𝑖

𝑤𝑖 2 𝒉𝑖

𝒔1

Ich

𝒀1

𝒛1

𝒔2

habe

𝒀2

habe

einen

𝑦2𝑖𝑐ℎ

𝑦2𝑑𝑢

𝑦2ℎ𝑎𝑡

𝒀2

Page 106: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

107

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒔1

Ich

𝒀1

habe

𝒛1

• Compute the weights for all instances for

time = 3

𝑤𝑖 3 =exp(𝑒𝑖 3 )

σ𝑗 exp(𝑒𝑗 3 )

𝑒𝑖 3 = 𝑔 𝒉𝑖 , 𝒔2

𝒔2

𝒀2

einen

𝒛2

habe

Page 107: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

108

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1

• Compute the weighted sum of hidden input

values at t=3

𝒛3 =

𝑖

𝑤𝑖 3 𝒉𝑖

𝒔0

𝒀0

𝒛0

Ich

𝒔1

Ich

𝒀1

habe

𝒛1

𝒔2

𝒀2

einen

𝒛2

habe

Page 108: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Compute the output at t=3

– Will be a probability distribution over words

– Draw a word from the distribution 109

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒛3 =

𝑖

𝑤𝑖 3 𝒉𝑖

𝒔1

Ich

𝑦2𝑖𝑐ℎ

𝑦2𝑑𝑢

𝑦2ℎ𝑎𝑡

𝒀1

𝒀3

𝒛1

𝒔2

habe

𝒀2

habe einen

𝒛2

𝒔3

einen

𝒀3apfel

Page 109: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Continue the process until an end-of-sequence

symbol is produced 110

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒔1

Ich

𝒀1

𝒛1

𝒔2

habe

𝒀2

habe einen

𝒛2

𝒔3

einen

𝒀3

apfel gegessen <eos>

𝒔4

apfel

𝒀4

𝒔5

gegessen

𝒀5

𝒛3 𝒛4 𝒛5

Page 110: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒔1

Ich

𝒀1

𝒛1

𝒔2

habe

𝒀2

habe einen

𝒛2

𝒔3

einen

𝒀3

apfel gegessen <eos>

𝒔4

apfel

𝒀4

𝒔5

gegessen

𝒀5

𝒛3 𝒛4 𝒛5

• As before, the objective of drawing: Produce the most likely output (that ends in an <eos>)

argmax𝑂1,…,𝑂𝐿

𝑦1𝑂1𝑦1

𝑂2 …𝑦1𝑂𝐿

• Simply selecting the most likely symbol at each time may result in suboptimal output

Page 111: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Solution: Multiple choices

• Retain all choices and fork the network

– With every possible word as input112

I

He

We

The

Page 112: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

To prevent blowup: Prune

• Prune

– At each time, retain only the top K scoring forks113

I

He

We

The

𝑇𝑜𝑝𝐾 𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

Page 113: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

To prevent blowup: Prune

• Prune

– At each time, retain only the top K scoring forks114

I

He

We

The

𝑇𝑜𝑝𝐾 𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

Page 114: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Decoding

• At each time, retain only the top K scoring forks115

He

The

𝑇𝑜𝑝𝐾 𝑃(𝑂2𝑂1|𝐼1, … , 𝐼𝑁)

Note: based on product

= 𝑇𝑜𝑝𝐾 𝑃(𝑂2|𝑂1, 𝐼1, … , 𝐼𝑁)𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

I

Knows

I

Nose

Page 115: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

116

He

The

𝑇𝑜𝑝𝐾 𝑃(𝑂2𝑂1|𝐼1, … , 𝐼𝑁)

Note: based on product

= 𝑇𝑜𝑝𝐾 𝑃(𝑂2|𝑂1, 𝐼1, … , 𝐼𝑁)𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

I

Knows

I

Nose

Decoding

• At each time, retain only the top K scoring forks

Page 116: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

117

He

The

= 𝑇𝑜𝑝𝐾𝑃 𝑂2 𝑂1, 𝐼1, … , 𝐼𝑁 ×𝑃 𝑂2 𝑂1, 𝐼1, … , 𝐼𝑁 ×𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

Knows

Nose

Decoding

• At each time, retain only the top K scoring forks

Page 117: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

118

He

The

= 𝑇𝑜𝑝𝐾𝑃 𝑂2 𝑂1, 𝐼1, … , 𝐼𝑁 ×𝑃 𝑂2 𝑂1, 𝐼1, … , 𝐼𝑁 ×𝑃(𝑂1|𝐼1, … , 𝐼𝑁)

Knows

Nose

Decoding

• At each time, retain only the top K scoring forks

Page 118: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

119

He

The

𝑇𝑜𝑝𝐾 ෑ

𝑡=1

𝑛

𝑃 𝑂𝑛 𝑂1, … , 𝑂𝑛−1, 𝐼1, … , 𝐼𝑁

Knows

Nose

Decoding

• At each time, retain only the top K scoring forks

Page 119: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Terminate

• Terminate

– When the current most likely path overall ends in <eos>

• Or continue producing more outputs (each of which terminates in <eos>) to get N-best outputs 120

He

The

Knows

<eos>

Nose

Page 120: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Termination: <eos>

• Terminate

– Paths cannot continue once the output an <eos>

• So paths may be different lengths– Select the most likely sequence ending in <eos> across all terminating sequences 121

He

The

Knows

<eos>

Nose

<eos>

<eos>

Example has K = 2

Page 121: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• The key component of this model is the attention weight

– It captures the relative importance of each position in the input to the current output 122

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0

𝒀0

𝒛0

Ich

𝒛1 =

𝑖

𝑤𝑖 1 𝒉𝑖

𝒔1

Ich

𝒀1

What does the attention learn?

𝑤𝑖 1 =exp(𝑒𝑖 1 )

σ𝑗 exp(𝑒𝑗 1 )

𝑒𝑖 1 = 𝑔 𝒉𝑖 , 𝒔0

𝑔 𝒉𝑖 , 𝒔0 = 𝒉𝑖𝑇𝑾𝑔𝒔0

Page 122: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

“Alignments” example: Bahdanau et al.

123i

t

t

Plot of 𝒘𝒊(𝒕)Color shows value (whiteis larger)

Note how most importantinput words for any outputword get automaticallyhighlighted

The general trend is somewhat linear becauseword order is roughlysimilar in both languages

i

Page 123: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Translation Examples

• Bahdanau et al. 2016124

Page 124: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Training the network

• We have seen how a trained network can be used to compute outputs

– Convert one sequence to another

• Lets consider training..

125

Page 125: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Given training input (source sequence, target sequence) pairs

• Forward pass: Pass the actual Pass the input sequence through the encoder– At each time the output is a probability distribution over words 126

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0 𝒔1 𝒔2 𝒔3 𝒔4 𝒔5

Ich habe einen apfel gegessen

𝑦0𝑖𝑐ℎ

𝑦0𝑑𝑢

𝑦0ℎ𝑎𝑡

𝑦1𝑖𝑐ℎ

𝑦1𝑑𝑢

𝑦1ℎ𝑎𝑡

𝑦2𝑖𝑐ℎ

𝑦2𝑑𝑢

𝑦2ℎ𝑎𝑡

𝑦3𝑖𝑐ℎ

𝑦3𝑑𝑢

𝑦3ℎ𝑎𝑡

𝑦4𝑖𝑐ℎ

𝑦4𝑑𝑢

𝑦4ℎ𝑎𝑡

𝑦5𝑖𝑐ℎ

𝑦5𝑑𝑢

𝑦5ℎ𝑎𝑡

Page 126: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Backward pass: Compute a divergence between target output and output distributions

– Backpropagate derivatives through the network 127

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0 𝒔1 𝒔2 𝒔3 𝒔4 𝒔5

Ich habe einen apfel gegessen

𝑦0𝑖𝑐ℎ

𝑦0𝑑𝑢

𝑦0ℎ𝑎𝑡

𝑦1𝑖𝑐ℎ

𝑦1𝑑𝑢

𝑦1ℎ𝑎𝑡

𝑦2𝑖𝑐ℎ

𝑦2𝑑𝑢

𝑦2ℎ𝑎𝑡

𝑦3𝑖𝑐ℎ

𝑦3𝑑𝑢

𝑦3ℎ𝑎𝑡

𝑦4𝑖𝑐ℎ

𝑦4𝑑𝑢

𝑦4ℎ𝑎𝑡

𝑦5𝑖𝑐ℎ

𝑦5𝑑𝑢

𝑦5ℎ𝑎𝑡

Ich habe einen apfel gegessen<eos>Div Div Div Div Div Div

Page 127: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

• Backward pass: Compute a divergence between target output and output distributions

– Backpropagate derivatives through the network 128

I ate an apple <eos>

𝒉0 𝒉1 𝒉2 𝒉3 𝒉4𝒉−1

𝒔−1 𝒔0 𝒔1 𝒔2 𝒔3 𝒔4 𝒔5

Ich habe einen apfel gegessen

𝑦0𝑖𝑐ℎ

𝑦0𝑑𝑢

𝑦0ℎ𝑎𝑡

𝑦1𝑖𝑐ℎ

𝑦1𝑑𝑢

𝑦1ℎ𝑎𝑡

𝑦2𝑖𝑐ℎ

𝑦2𝑑𝑢

𝑦2ℎ𝑎𝑡

𝑦3𝑖𝑐ℎ

𝑦3𝑑𝑢

𝑦3ℎ𝑎𝑡

𝑦4𝑖𝑐ℎ

𝑦4𝑑𝑢

𝑦4ℎ𝑎𝑡

𝑦5𝑖𝑐ℎ

𝑦5𝑑𝑢

𝑦5ℎ𝑎𝑡

Ich habe einen apfel gegessen<eos>Div Div Div Div Div Div

Back propagation also updates parameters of the “attention” function 𝑔()

Page 128: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Various extensions

• Attention: Local attention vs global attention

– E.g. “Effective Approaches to Attention-based Neural

Machine Translation”, Luong et al., 2015

– Other variants

• Bidirectional processing of input sequence

– Bidirectional networks in encoder

– E.g. “Neural Machine Translation by Jointly Learning

to Align and Translate”, Bahdanau et al. 2016

129

Page 129: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Some impressive results..

• Attention-based models are currently responsible for the state of the art in many sequence-conversion systems

– Machine translation

• Input: Speech in source language

• Output: Speech in target language

– Speech recognition

• Input: Speech audio feature vector sequence

• Output: Transcribed word or character sequence

130

Page 130: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Attention models in image captioning

• “Show attend and tell: Neural image caption generation with visual attention”, Xu et al., 2016

• Encoder network is a convolutional neural network

– Filter outputs at each location are the equivalent of 𝒉𝑖 in the regular sequence-to-sequence model

131

Page 131: Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

In closing

• Have looked at various forms of sequence-to-sequence models

• Generalizations of recurrent neural network formalisms

• For more details, please refer to papers– Post on piazza if you have questions

• Will appear in HW4: Speech recognition with attention models

132