modeling sequential data with recurrent networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are...

216
Modeling Sequential Data with Recurrent Networks Chris Dyer DeepMind Carnegie Mellon University July 28, 2016 LxMLS 2016

Upload: others

Post on 12-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Modeling Sequential Data with Recurrent Networks

Chris DyerDeepMind

Carnegie Mellon University

July 28, 2016LxMLS 2016

Page 2: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Outline: Part I• Neural networks as feature inducers

• Recurrent neural networks

• Application: language models

• Learning challenges and solutions

• Vanishing gradients

• Long short-term memories

• Gated recurrent units

• Break

Page 3: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Outline: Part II• RNN performance tuning and implementation tricks

• Bidirectional RNNs

• Application: better word representations

• Sequence to Sequence transduction with RNNs

• Applications: machine translation & image caption generation

• Sequences as matrices and attention

• Application: machine translation

Page 4: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Feature Induction

In linear regression, the goal is to learn W and b such that F is minimized for a dataset D consisting of M training instances. An engineer must select/design x carefully.

y = Wx+ b

F =1

M

MX

i=1

||yi � yi||22

Page 5: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Feature Induction

In linear regression, the goal is to learn W and b such that F is minimized for a dataset D consisting of M training instances. An engineer must select/design x carefully.

y = Wx+ b

F =1

M

MX

i=1

||yi � yi||22

Use “naive features” x and learn their transformations (conjunctions, nonlinear transformation, etc.) into h.

h = g(Vx+ c)

y = Wh+ b

“nonlinear regression”

Page 6: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Feature Induction

• What functions can this parametric form compute?

• If h is big enough (i.e., enough dimensions), it can represent any vector-valued function to any degree of precision

• This is a much more powerful regression model!

• You can think of h as “induced features” in a linear classifier

• The network did the job of a feature engineer

h = g(Vx+ c)

y = Wh+ b

Page 7: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Feature Induction

• What functions can this parametric form compute?

• If h is big enough (i.e., enough dimensions), it can represent any vector-valued function to any degree of precision

• This is a much more powerful regression model!

• You can think of h as “induced features” in a linear classifier

• The network did the job of a feature engineer

h = g(Vx+ c)

y = Wh+ b

Page 8: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks• Lots of interesting data is sequential in nature

• Words in sentences

• DNA

• Stock market returns

• …

• How do we represent an arbitrarily long history?

• we will train neural networks to build a representation of these arbitrarily big sequences

Page 9: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks• Lots of interesting data is sequential in nature

• Words in sentences

• DNA

• Stock market returns

• …

• How do we represent an arbitrarily long history?

• we will train neural networks to build a representation of these arbitrarily big sequences

Page 10: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

h = g(Vx+ c)

y = Wh+ b

Feed-forward NN

y

h

x

Page 11: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

h = g(Vx+ c)

y = Wh+ b

Feed-forward NN Recurrent NNht = g(Vxt +Uht�1 + c)

yt = Wht + b

y

h

x

xt

ht

yt

Page 12: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

h = g(Vx+ c)

y = Wh+ b

Feed-forward NN Recurrent NN

y

h

x

xt

ht

yt

ht = g(Vxt +Uht�1 + c)

ht = g(V[xt;ht�1] + c)

yt = Wht + b

Page 13: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

h = g(Vx+ c)

y = Wh+ b

Feed-forward NN Recurrent NN

y

h

x

xt

ht

yt

ht = g(Vxt +Uht�1 + c)

ht = g(V[xt;ht�1] + c)

yt = Wht + b

ht�1

Page 14: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

x1

h1

y1

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 15: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

x1

h1

y1

h0

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 16: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

x1

h1

x2

h2

y2y1

h0

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 17: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

x1

h1

x3

h3

y3

x2

h2

y2y1

h0

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 18: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 19: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

How do we train the RNN’s parameters?

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 20: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 21: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

cost1

y1

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 22: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

cost1

y1

cost2

y2

cost3

y3

cost4

y4

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 23: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Recurrent Neural Networks

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

cost1

y1

cost2

y2

cost3

y3

cost4

y4

F

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 24: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

F

Recurrent Neural Networks

• The unrolled graph is a well-formed (DAG) computation graph—we can run backprop

• Parameters are tied across time, derivatives are aggregated across all time steps

• This is historically called “backpropagation through time” (BPTT)

Page 25: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Parameter Tying

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

cost1

y1

cost2

y2

cost3

y3

cost4

y4

F

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 26: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Parameter Tying

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

cost1

y1

cost2

y2

cost3

y3

cost4

y4

F

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

U

Page 27: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Parameter Tying

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

U

Page 28: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Parameter Tying

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

U

@F@U

=4X

t=1

@ht

@U

@F@ht

Page 29: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Parameter Tying

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

U

@F@U

=4X

t=1

@ht

@U

@F@ht

Parameter tying also came up when learning the filters in convolutional networks (and in the transition matrices for HMMs!).

Page 30: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Parameter Tying• Why do we want to tie parameters?

• Reduce the number of parameters to be learned

• Deal with arbitrarily long sequences

• What if we always have short sequences?

• Maybe you might untie parameters, then. But you wouldn’t have an RNN anymore!

Page 31: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

What else can we do?

x1

h1

x4

h4

y4

x3

h3

y3

x2

h2

y2y1

h0

cost1

y1

cost2

y2

cost3

y3

cost4

y4

F

ht = g(Vxt +Uht�1 + c)

yt = Wht + b

Page 32: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

“Read and summarize”

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

ht = g(Vxt +Uht�1 + c)

y = Wh|x| + b

Summarize a sequence into a single vector. (This will be useful later…)

Page 33: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

View 2: Recursive Definition• Recall how to construct a list recursively:

base case [] is a list (the empty list) induction [t | h] where t is a list and h is an atom is a list

• RNNs define functions that compute representations recursively according to this definition of a list.

• Define (learn) a representation of the base cases

• Learn a representation of the inductive step

• Anything you can construct recursively, you can obtain an “embedding” of with neural networks using this general strategy

Page 34: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

View 2: Recursive Definition• Recall how to construct a list recursively:

base case [] is a list (the empty list) induction [t | h] where t is a list and h is an atom is a list

• RNNs define functions that compute representations recursively according to this definition of a list.

• Define (learn) a representation of the base cases

• Learn a representation of the inductive step

• Anything you can construct recursively, you can obtain an “embedding” of with neural networks using this general strategy

Page 35: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

View 2: Recursive Definition• Recall how to construct a list recursively:

base case [] is a list (the empty list) induction [t | h] where t is a list and h is an atom is a list

• RNNs define functions that compute representations recursively according to this definition of a list.

• Define (learn) a representation of the base case

• Learn a representation of the inductive step

• Anything you can construct recursively, you can obtain an “embedding” of with neural networks using this general strategy

Page 36: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

softmax

h

the a and cat dog horse runs says walked walks walking pig Lisbon sardines …

u = Wh+ b

pi =expuiPj expuj

h 2 Rd

|V | = 100, 000

What are the dimensions of ?W

Page 37: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

softmax

h

the a and cat dog horse runs says walked walks walking pig Lisbon sardines …

u = Wh+ b

pi =expuiPj expuj

h 2 Rd

|V | = 100, 000

What are the dimensions of ?b

Page 38: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

softmax

h

the a and cat dog horse runs says walked walks walking pig Lisbon sardines …

u = Wh+ b

pi =expuiPj expuj

h 2 Rd

|V | = 100, 000

What are the dimensions of ?b

p(e) =p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·

Page 39: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

softmax

h

the a and cat dog horse runs says walked walks walking pig Lisbon sardines …

u = Wh+ b

pi =expuiPj expuj

h 2 Rd

|V | = 100, 000

What are the dimensions of ?b

p(e) =p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·

istories are sequences of words…h

Page 40: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

Page 41: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

h1

h0 x1

<s>

Page 42: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

softmax

p1

h1

h0 x1

<s>

Page 43: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

softmax

p1

h1

h0 x1

<s>

tom

p(tom | hsi)

Page 44: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

softmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

Page 45: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

Page 46: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

⇠likes

⇥p(likes | hsi, tom)

Page 47: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

⇠likes

⇥p(likes | hsi, tom)

x3

h3

softmax

beer

⇥p(beer | hsi, tom, likes)

Page 48: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Example: Language Model

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

⇠likes

⇥p(likes | hsi, tom)

x3

h3

softmax

beer

⇥p(beer | hsi, tom, likes)

x4

h4

softmax

</s>

⇥p(h/si | hsi, tom, likes, beer)

Page 49: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language Model Training

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom⇠

likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

Page 50: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language Model Training

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4

Page 51: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language Model Training

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4{log lo

ss/

cross

entro

py

Page 52: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language Model Training

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4

F

{log lo

ss/

cross

entro

py

Page 53: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language Model Training

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4

F

{log lo

ss/

cross

entro

py

Page 54: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

RNN Language Models• Unlike Markov (n-gram) models, RNNs never forget

• However we will see they might have trouble learning to use their memories (more soon…)

• Algorithms

• Sample a sequence from the probability distribution defined by the RNN

• Train the RNN to minimize cross entropy (aka MLE)

• What about: what is the most probable sequence?

Page 55: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Questions?

Page 56: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challenges

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

What happens to gradients as you go back in time?

ht = g(Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=@h2

@h1

@h3

@h2

@h4

@h3| {z }Qi=t

i=2@hi

@hi�1

@y

@h4

@F@y

@F@F

Page 57: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challenges

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

What happens to gradients as you go back in time?

ht = g(Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=@h2

@h1

@h3

@h2

@h4

@h3| {z }Qi=t

i=2@hi

@hi�1

@y

@h4

@F@y

@F@F

Page 58: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challenges

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

What happens to gradients as you go back in time?

ht = g(Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=@h2

@h1

@h3

@h2

@h4

@h3| {z }Qi=t

i=2@hi

@hi�1

@y

@h4

@F@y

@F@F

Page 59: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challenges

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

What happens to gradients as you go back in time?

ht = g(Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=@h2

@h1

@h3

@h2

@h4

@h3| {z }Qi=t

i=2@hi

@hi�1

@y

@h4

@F@y

@F@F

Page 60: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challenges

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

What happens to gradients as you go back in time?

ht = g(Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=@h2

@h1

@h3

@h2

@h4

@h3| {z }Qi=t

i=2@hi

@hi�1

@y

@h4

@F@y

@F@F

Page 61: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challenges

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

What happens to gradients as you go back in time?

ht = g(Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=@h2

@h1

@h3

@h2

@h4

@h3| {z }Qi=t

i=2@hi

@hi�1

@y

@h4

@F@y

@F@F

Page 62: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challenges

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

What happens to gradients as you go back in time?

ht = g(Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=@h2

@h1

@h3

@h2

@h4

@h3| {z }Q4

t=2@ht

@ht�1

@y

@h4

@F@y

@F@F

Page 63: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challenges

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

What happens to gradients as you go back in time?

ht = g(Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=

0

@|x|Y

t=2

@ht

@ht�1

1

A @y

@h|x|

@F@y

@F@F

Page 64: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challenges

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

What happens to gradients as you go back in time?

@F@h1

=

0

@|x|Y

t=2

@ht

@ht�1

1

A @y

@h|x|

@F@y

@F@F

ht = g(

ztz }| {Vxt +Uht�1 + c)

y = Wh|x| + b

Page 65: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challenges

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

y

ht = g(

ztz }| {Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=

0

@|x|Y

t=2

@ht

@zt

@zt@ht�1

1

A @y

@h|x|

@F@y

@F@F

What happens to gradients as you go back in time?

Page 66: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challengesht = g(

ztz }| {Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=

0

@|x|Y

t=2

@ht

@zt

@zt@ht�1

1

A @y

@h|x|

@F@y

@F@F

Page 67: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challengesht = g(

ztz }| {Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=

0

@|x|Y

t=2

@ht

@zt

@zt@ht�1

1

A @y

@h|x|

@F@y

@F@F

@ht

@zt= diag(g0(zt))

Page 68: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challengesht = g(

ztz }| {Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=

0

@|x|Y

t=2

@ht

@zt

@zt@ht�1

1

A @y

@h|x|

@F@y

@F@F

@zt@ht�1

= U

@ht

@zt= diag(g0(zt))

?

Page 69: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challengesht = g(

ztz }| {Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=

0

@|x|Y

t=2

@ht

@zt

@zt@ht�1

1

A @y

@h|x|

@F@y

@F@F

@zt@ht�1

= U

@ht

@zt= diag(g0(zt))

Page 70: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challengesht = g(

ztz }| {Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=

0

@|x|Y

t=2

@ht

@zt

@zt@ht�1

1

A @y

@h|x|

@F@y

@F@F

@zt@ht�1

= U

@ht

@zt= diag(g0(zt))

@ht

@ht�1=

@ht

@zt

@zt@ht�1

= diag(g0(zt))U

Page 71: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challengesht = g(

ztz }| {Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=

0

@|x|Y

t=2

@ht

@zt

@zt@ht�1

1

A @y

@h|x|

@F@y

@F@F

@F@h1

=

0

@|x|Y

t=2

diag(g0(zt))U

1

A @y

@h|x|

@F@y

@F@F

Page 72: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Training Challengesht = g(

ztz }| {Vxt +Uht�1 + c)

y = Wh|x| + b

@F@h1

=

0

@|x|Y

t=2

@ht

@zt

@zt@ht�1

1

A @y

@h|x|

@F@y

@F@F

@F@h1

=

0

@|x|Y

t=2

diag(g0(zt))U

1

A @y

@h|x|

@F@y

@F@F

Three cases: largest eigenvalue is exactly 1; gradient propagation is stable <1; gradient vanishes (exponential decay) >1; gradient explodes (exponential growth)

Page 73: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Vanishing Gradients• In practice, the spectral radius of U is small, and gradients vanish

• In practice, this means that long-range dependencies are difficult to learn (although in theory they are learnable)

• Solutions

• Better optimizers (second order methods, approximate second order methods)

• Normalization to keep the gradient norms stable across time

• Clever initialization so that you at least start with good spectra (e.g., start with random orthonormal matrices)

• Alternative parameterizations: LSTMs and GRUs

Page 74: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Alternative RNNs• Long short-term memories (LSTMs; Hochreiter and

Schmidthuber, 1997)

• Gated recurrent units (GRUs; Cho et al., 2014)

• Intuition instead of multiplying across time (which leads to exponential growth), we want the error to be constant.

• What is a function whose Jacobian has a spectral radius of exactly I: the identity function

Page 75: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Memory cellsct = ct�1 + f(xt)

x1 x4x3x2h0

I I Ic3 c4c2c1

Page 76: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Memory cellsct = ct�1 + f(xt)

x1 x4x3x2h0

I I Ic3 c4c2c1

f(v) = tanh(Wv + b)

Page 77: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Memory cellsct = ct�1 + f(xt)

x1 x4x3x2h0

I I Ic3 c4c2c1

ht = g(ct)

h1 h4h3h2

f(v) = tanh(Wv + b)

Page 78: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Memory cellsct = ct�1 + f(xt)

x1 x4x3x2h0

I I I

F

y

y

c3 c4c2c1

ht = g(ct)

h1 h4h3h2

f(v) = tanh(Wv + b)

Page 79: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Memory cellsct = ct�1 + f(xt)

x1 x4x3x2h0

I I I

F

y

y

c3 c4c2c1

ht = g(ct)

h1 h4h3h2

Note:@ct

@ct�1= I

f(v) = tanh(Wv + b)

Page 80: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Memory cells

x1 x4x3x2h0

I I I

F

y

y

c3 c4c2c1

ht = g(ct)

h1 h4h3h2

ct = ct�1 + f([xt;ht�1])

@ct@ct�1

= I+ "

Page 81: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Memory cells

x1 x4x3x2h0

I I I

F

y

y

c3 c4c2c1

ht = g(ct)

h1 h4h3h2

ct = ct�1 + f([xt;ht�1])

“Almost constant”@ct

@ct�1= I+ "

Page 82: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Memory cells

x1 x4x3x2h0

F

y

y

c3 c4c2c1

ht = g(ct)

h1 h4h3h2

ct = ft � ct�1 + it � f([xt;ht�1])

ft = �(ff ([xt;ht�1]))

it = �(fi([xt;ht�1]))

“forget gate”“input gate”

Page 83: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Memory cells

x1 x4x3x2h0

F

y

y

c3 c4c2c1

ht = g(ct)

h1 h4h3h2

ct = ft � ct�1 + it � f([xt;ht�1])

ft = �(ff ([xt;ht�1]))

it = �(fi([xt;ht�1]))

“forget gate”“input gate”

Page 84: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Memory cells

x1 x4x3x2h0

F

y

y

c3 c4c2c1

ht = g(ct)

h1 h4h3h2

ct = ft � ct�1 + it � f([xt;ht�1])

ft = �(ff ([xt;ht�1]))

it = �(fi([xt;ht�1]))

“forget gate”“input gate”

Page 85: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

LSTM

x1 x4x3x2h0

F

y

y

c3 c4c2c1

h1 h4h3h2

ct = ft � ct�1 + it � f([xt;ht�1])

ft = �(ff ([xt;ht�1]))

it = �(fi([xt;ht�1]))

o

t

= �(fo

([xt

;ht�1]))

ht = ot � g(ct)

“forget gate”“input gate”“output gate”

Page 86: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

LSTM

x1 x4x3x2h0

F

y

y

c3 c4c2c1

h1 h4h3h2

ct = ft � ct�1 + it � f([xt;ht�1])

ft = �(ff ([xt;ht�1]))

it = �(fi([xt;ht�1]))

o

t

= �(fo

([xt

;ht�1]))

ht = ot � g(ct)

“forget gate”“input gate”“output gate”

Page 87: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

LSTM Variant

x1 x4x3x2h0

F

y

y

c3 c4c2c1

h1 h4h3h2

ft = �(ff ([xt;ht�1]))

it = �(fi([xt;ht�1]))

o

t

= �(fo

([xt

;ht�1]))

ht = ot � g(ct)

ct = (1� it)� ct�1 + it � f([xt;ht�1])

“input gate”“output gate”

ft = 1� it

Page 88: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

LSTM Variant

x1 x4x3x2h0

F

y

y

c3 c4c2c1

h1 h4h3h2

ft = �(ff ([xt;ht�1]))

it = �(fi([xt;ht�1]))

o

t

= �(fo

([xt

;ht�1]))

ht = ot � g(ct)

ct = (1� it)� ct�1 + it � f([xt;ht�1])

“input gate”“output gate”

ft = 1� it

Page 89: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Another Visualization

Figure credit: Christopher Olah

Page 90: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Another Visualization

Figure credit: Christopher Olah

Page 91: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Another Visualization

Figure credit: Christopher Olah

Forget some of the past

Page 92: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Another Visualization

Figure credit: Christopher Olah

Forget some of the past Add new memories

Page 93: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Gated Recurrent Units (GRUs)

zt = �(fz([ht�1;xt]))

ht = (1� zt)� ht�1 + zt � ht

rt = �(fr([ht�1;xt]))

ht = f([rt � ht�1;xt]))

x1

h1

x4

h4

x3

h3

x2

h2

h0

F

y

⌦ ⌦ ⌦

Page 94: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Summary• Better gradient propagation is possible when you use

additive rather than multiplicative/highly non-linear recurrent dynamics

• Recurrent architectures are an active area of research, requires a mix of mathematical analysis, creativity, problem-specific knowledge

• (LSTMs are hard to beat though!)

ct = ft � ct�1 + it � f([xt;ht�1])

RNN

LSTM

GRU ht = (1� zt)� ht�1 + zt � f([xt; rt � ht�1])

ht = f([xt;ht�1])

Page 95: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Summary• Better gradient propagation is possible when you use

additive rather than multiplicative/highly non-linear recurrent dynamics

• Recurrent architectures are an active area of research, requires a mix of mathematical analysis, creativity, problem-specific knowledge

• (LSTMs are hard to beat though!)

ct = ft � ct�1 + it � f([xt;ht�1])

RNN

LSTM

GRU ht = (1� zt)� ht�1 + zt � f([xt; rt � ht�1])

ht = f([xt;ht�1])

Page 96: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Summary• Better gradient propagation is possible when you use

additive rather than multiplicative/highly non-linear recurrent dynamics

• Recurrent architectures are an active area of research, requires a mix of mathematical analysis, creativity, problem-specific knowledge

• (LSTMs are hard to beat though!)

ct = ft � ct�1 + it � f([xt;ht�1])

RNN

LSTM

GRU ht = (1� zt)� ht�1 + zt � f([xt; rt � ht�1])

ht = f([xt;ht�1])

Page 97: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Questions?

Break?

Page 98: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

A Few Tricks of the Trade

• Depth

• Dropout

• Implementation tricks

Page 99: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

“Deep” LSTMs• This term has been defined several times, but the

following is the most standard convention

x1 x4x3x2

Page 100: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

“Deep” LSTMs• This term has been defined several times, but the

following is the most standard convention

x1 x4x3x2

Page 101: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

“Deep” LSTMs• This term has been defined several times, but the

following is the most standard convention

x1 x4x3x2

Page 102: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

“Deep” LSTMs• This term has been defined several times, but the

following is the most standard convention

x1 x4x3x2

y

Page 103: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

“Deep” LSTMs• This term has been defined several times, but the

following is the most standard convention

x1 x4x3x2

y

Page 104: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Does Depth Matter?• Yes, it helps

• It seems to play a less significant role in text than in audio/visual processing

• H1: More transformation of the input is required for ASR, image recognition, etc., than for common text applications (word vectors become customized to be “good inputs” to RNNs whereas you’re stuck with what nature gives you for speech/vision)

• H2: less effort has been made to find good architectures (RNNs are expensive to train; have been widely used for less long)

• H3: back prop through time + depth is hard and we need better optimizers

• Many other possibilities…

• 2-8 layers seems to be standard

• Input “skip” connections are used often but by no means universally

Page 105: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Dropout and Deep LSTMs• Applying dropout layers requires some care

x1 x4x3x2

y

Page 106: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Dropout and Deep LSTMs• Apply dropout between layers, but not on the

recurrent connections

x1 x4x3x2

y

dropout

dropout

dropout

dropout

dropout

dropout

dropout

dropout

dropout

dropout

dropout

dropout

dropout

Page 107: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Implementation Details• For speed

• Use diagonal matrices instead of full matrices (esp. for gates)

• Concatenate parameter matrices for all gates and do a single matrix-vector(/matrix) multiplication

• Use optimized implementations (from NVIDIA)

• Use GRUs or reduced-gate variant of LSTMs

• For learning speed and performance

• Initialize so that the bias on the forget gate is large (intuitively: at the beginning of training, the signal from the past is unreliable)

• Use random orthogonal matrices to initialize the square matrices

Page 108: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Implementation Details: Minibatching

• GPU hardware is

• pretty fast for elementwise operations (IO bound- can’t get enough data through the GPU)

• very fast for matrix-matrix multiplication (usually compute bound - the GPU will work at 100% capacity, and GPU cores are fast)

• RNNs, LSTMs, GRUs all consist of

• lots of elementwise operations (addition, multiplication, nonlinearities, …)

• lots of matrix-vector products

• Minibatching: convert many matrix-vector products into a single matrix-matrix multiplication

Page 109: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Minibatchinght = g(Vxt +Uht�1 + c)

yt = Wht + b

Single-instance RNN

Page 110: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Minibatchinght = g(Vxt +Uht�1 + c)

yt = Wht + b

Single-instance RNN

Ht = g(VXt +UHt�1 + c)

Yt = WHt + b

Minibatch RNN

We batch across instances, not across time.

z }| {

x1 x1 x1 X1

Page 111: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Minibatchinght = g(Vxt +Uht�1 + c)

yt = Wht + b

Single-instance RNN

Ht = g(VXt +UHt�1 + c)

Yt = WHt + b

Minibatch RNN

We batch across instances, not across time.

z }| {

x1 x1 x1 X1

Page 112: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Minibatchinght = g(Vxt +Uht�1 + c)

yt = Wht + b

Single-instance RNN

Ht = g(VXt +UHt�1 + c)

Yt = WHt + b

Minibatch RNN

We batch across instances, not across time.

z }| {

x1 x1 x1 X1

Page 113: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Minibatchinght = g(Vxt +Uht�1 + c)

yt = Wht + b

Single-instance RNN

Ht = g(VXt +UHt�1 + c)

Yt = WHt + b

Minibatch RNN

We batch across instances, not across time.

z }| {

x1 x1 x1 X1

anything wrong here?

Page 114: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Minibatching• The challenge with working with mini batches of

sequences is … sequences are of different lengths

• This usually means you bucket training instances based on similar lengths, and pad with 0’s

• Be careful when padding not to back propagate a non-zero value!

• Manual minibatching convinces me that this is the era of assembly language programming for neural networks. Make the future an easier place to program!

Page 115: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Questions?

Page 116: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Bidirectional RNNs

• We can read a sequence from left to right to obtain a representation

• Or we can read it from right to left

• Or we can read it from both and combine the representations

Page 117: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Word Embedding Models

car

Memorize Generalize

Page 118: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Word Embedding Models

car

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Page 119: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Word Embedding Models

car

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Page 120: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Word Embedding Models

c a rcar

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Page 121: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Word Embedding Models

c a r STOPSTARTcar

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Page 122: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Word Embedding Models

c a r STOPSTARTcar

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Page 123: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Word Embedding Models

c a r STOPSTARTcar

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Page 124: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Word Embedding Models

c a r STOPSTARTcar

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Page 125: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Word Embedding Models

c a r STOPSTARTcar

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Page 126: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Word Embedding Models

c a r STOPSTARTcar

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Page 127: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language modelingCharLSTM > Word Lookup

pplWords

pplChars Δ |θ|

Words|θ|

Chars

English 59.4 57.4 -2.0 4.3M 0.18M

Turkish 44.0 32.9 -11.1 5.7M 0.17M

German 59.1 43.0 -16.1 6.3M 0.18M

Portuguese 46.2 40.9 -5.3 4.2M 0.18M

Catalan 35.3 34.9 -0.4 4.3M 0.18M

Analytic

Agglutinative

Fusional

(

(

Page 128: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language modelingCharLSTM > Word Lookup

pplWords

pplChars Δ |θ|

Words|θ|

Chars

English 59.4 57.4 -2.0 4.3M 0.18M

Turkish 44.0 32.9 -11.1 5.7M 0.17M

German 59.1 43.0 -16.1 6.3M 0.18M

Portuguese 46.2 40.9 -5.3 4.2M 0.18M

Catalan 35.3 34.9 -0.4 4.3M 0.18M

Analytic

Agglutinative

Fusional

(

(

Page 129: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language modelingCharLSTM > Word Lookup

pplWords

pplChars Δ |θ|

Words|θ|

Chars

English 59.4 57.4 -2.0 4.3M 0.18M

Turkish 44.0 32.9 -11.1 5.7M 0.17M

German 59.1 43.0 -16.1 6.3M 0.18M

Portuguese 46.2 40.9 -5.3 4.2M 0.18M

Catalan 35.3 34.9 -0.4 4.3M 0.18M

Analytic

Agglutinative

Fusional

(

(

Page 130: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language modelingCharLSTM > Word Lookup

pplWords

pplChars Δ |θ|

Words|θ|

Chars

English 59.4 57.4 -2.0 4.3M 0.18M

Turkish 44.0 32.9 -11.1 5.7M 0.17M

German 59.1 43.0 -16.1 6.3M 0.18M

Portuguese 46.2 40.9 -5.3 4.2M 0.18M

Catalan 35.3 34.9 -0.4 4.3M 0.18M

Analytic

Agglutinative

Fusional

(

(

Page 131: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language modelingWord similarities

increased John Noahshire phding

reduced Richard Nottinghamshire mixing

improved George Bucharest modelling

expected James Saxony styling

decreased Robert Johannesburg blaming

targeted Edward Gloucestershire christening

Page 132: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Language modelingWord similarities

increased John Noahshire phding

reduced Richard Nottinghamshire mixing

improved George Bucharest modelling

expected James Saxony styling

decreased Robert Johannesburg blaming

targeted Edward Gloucestershire christening

Page 133: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Questions?

Page 134: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

x1

Recurrent Neural Networks (RNNs)

x2 x3startSTART

0

What is a vector representation of a sequence ?

x4

x

ht = f(ht�1,xt)

c = RNN(x)

x =

Page 135: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

x1

Recurrent Neural Networks (RNNs)

x2 x3startSTART

0

What is a vector representation of a sequence ?

x4

x

ht = f(ht�1,xt)

c = RNN(x)

x =

Page 136: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

x1

Recurrent Neural Networks (RNNs)

x2 x3startSTART

0

What is a vector representation of a sequence ?

x4

x

ht = f(ht�1,xt)

c = RNN(x)

x =

Page 137: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

x1

Recurrent Neural Networks (RNNs)

x2 x3startSTART

0

What is a vector representation of a sequence ?

x4

x

ht = f(ht�1,xt)

c = RNN(x)

x =

Page 138: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

x1

Recurrent Neural Networks (RNNs)

x2 x3startSTART

0

What is a vector representation of a sequence ?

x4

x

ht = f(ht�1,xt)

c = RNN(x)

x =

Page 139: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

x1

Recurrent Neural Networks (RNNs)

x2 x3startSTART

0

What is a vector representation of a sequence ?

x4

c

x

ht = f(ht�1,xt)

c = RNN(x)

x =

Page 140: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer STOP

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 141: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer STOP

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 142: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer STOP

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 143: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer STOP

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 144: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer STOP

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 145: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer STOP

c

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 146: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer STOP

START

c

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 147: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer STOP

START

Beginnings

c

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 148: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer

are

STOP

START

Beginnings

c

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 149: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer

are

STOP

START

difficultBeginnings

c

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 150: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Aller Anfang

RNN Encoder-Decoders

ist schwer

are

STOP

START

difficult STOPBeginnings

c

Cho et al. (2014); Sutskever et al. (2014)

c = RNN(x)

y | c ⇠ RNNLM(c)

What is the probability of a sequence ?y | x

Page 151: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

habeich Hunger </s> <s>

I’m

I’m

hungry

hungry

</s>

Page 152: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

habeich Hunger </s> <s>

I’m

I’m

hungry

hungry

</s>

Page 153: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Sutskever et al. (2014)

Page 154: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ensembles of NNs

• Sutskever noticed that their single models did not work well

• But by combining N independently trained models and obtaining a “consensus”, the performance could be improved a lot

• This is called ensembling.

Page 155: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Encode anything as a vector!

Page 156: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Encode anything as a vector!

Page 157: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Encode anything as a vector!

h1

x1 x2

⇠ ⇠

x3

h3

⇠ ⇠

h2

softmaxsoftmax

p1softmax

x4

h4

softmax

<s>

a spooky old house

Page 158: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Limitations

• A possible conceptual problem

• Sentences have unbounded lengths

• Vectors have finite capacity

• A possible practical problem

• Distance between “translations” and their sources are distant- can LSTMs learn this?

Page 159: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Two Goals• Represent a source sentence as a matrix

• Generate a target sentence from a matrix

• These two steps are:

• An algorithm for neural MT

• A way of introducing attention

Page 160: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Sentences as Matrices• Problem with the fixed-size vector model in translation

(maybe in images?)

• Sentences are of different sizes but vectors are of the same size

• Solution: use matrices instead

• Fixed number of rows, but number of columns depends on the number of words

• Usually |f| = #cols

Page 161: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Sentences as Matrices

Ich mochte ein Bier

Page 162: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Sentences as Matrices

Ich mochte ein Bier

Mach’s gut

Page 163: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Sentences as Matrices

Ich mochte ein Bier

Mach’s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtumer

Page 164: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Sentences as Matrices

Ich mochte ein Bier

Mach’s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtumer

Question: How do we build these matrices?

Page 165: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

With Concatenation• Each word type is represented by an n-dimensional

vector

• Take all of the vectors for the sentence and concatenate them into a matrix

• Simplest possible model

• So simple, no one has bothered to publish how well/badly it works!

Page 166: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

Page 167: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

fi = xi

Ich möchte ein Bier

x1 x2 x3 x4

Page 168: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich mochte ein Bier

fi = xi

F 2 Rn⇥|f |

Ich möchte ein Bier

x1 x2 x3 x4

Page 169: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

With Convolutional Nets• Apply convolutional networks to transform the naive

concatenated matrix to obtain a context-dependent matrix

• Closely related to the first “modern” neural translation model proposed (Kalchbrenner et al., 2013)

• No one has been using convnets lately in MT (including Kalchbrenner et al, who are using BiLSTMs these days)

• Note: convnets usually have a “pooling” operation at the top level that results in a fixed-sized representation. For sentences, it is probably good to leave this out.

Page 170: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

Page 171: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

Page 172: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

Filter 1

Page 173: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

⇤ ⇤

Filter 1 Filter 2

Page 174: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

⇤ ⇤

Ich mochte ein Bier

F 2 Rf(n)⇥g(|f |)

Filter 1 Filter 2

Page 175: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

With Bidirectional RNNs• By far the most widely used matrix representation, due to

Bahdanau et al (2015)

• One column per word

• Each column (word) has two halves concatenated together:

• a “forward representation”, i.e., a word and its left context

• a “reverse representation”, i.e., a word and its right context

• Implementation: bidirectional RNNs (GRUs or LSTMs) to read f from left to right and right to left, concatenate representations

Page 176: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

Page 177: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

Page 178: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

Page 179: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 180: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 181: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 182: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 183: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

Ich mochte ein Bier

F 2 R2n⇥|f |

fi = [ �h i;�!h i]

Page 184: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Where are we in 2016?• There are lots of ways to construct F

• Very little (published?) work comparing them

• There are many more undiscovered things out there

• convolutions are particularly interesting and under-explored

• syntactic information could help

• My intuition is simpler/faster models will work well for the matrix encoding part—context dependencies are limited in language.

• try something with phrase types instead of word types?

Page 185: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Generation from Matrices• We have a matrix F representing the input, now we need to generate from it

• Bahdanau et al. (2015) were the first to propose using attention for translating from matrix-encoded sentences

• High-level idea

• Generate the output sentence word by word using an RNN

• At each output position t, the RNN receives two inputs (in addition to any recurrent inputs)

• a fixed-size vector embedding of the previously generated output symbol et-1

• a fixed-size vector encoding a “view” of the input matrix

• How do we get a fixed-size vector from a matrix that changes over time?

• Bahdanau et al: do a weighted sum of the columns of F (i.e., words) based on how important they are at the current time step. (i.e., just a matrix-vector product Fat)

• The weighting of the input columns at each time-step (at) is called attention

Page 186: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

0

Recall RNNs…

Page 187: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

0

Recall RNNs…

Page 188: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

0

Recall RNNs…

Page 189: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

I'd

0

Recall RNNs…

Page 190: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

I'd

0

→ I'd

Recall RNNs…

Page 191: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

I'd

0

like

I'd

Recall RNNs…

Page 192: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

0

Page 193: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

0

Ich mochte ein Bier

Page 194: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

0

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 195: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

0

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 196: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

I'd

0

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 197: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

I'd

I'd →

like

0

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 198: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

I'd

I'd →

like

like

a

0

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 199: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

I'd

I'd →

like

like

a

a

beer

0

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 200: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

0

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 201: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Attention

• How do we know what to attend to at each time-step?

• That is, how do we compute ?at

Page 202: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every word: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the expected input embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Page 203: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every word: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the expected input embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Page 204: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every word: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the expected input embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Page 205: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every word: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the expected input embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Page 206: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Computing Attention• At each time step (one time step = one output word), we want to be able to

“attend” to different words in the source sentence

• We need a weight for every word: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the expected input embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Page 207: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Nonlinear Attention-Energy Model

• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:

• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF

• This can learn more complex interactions

• It is unclear if the added complexity is necessary for good performance

ut = F>rt

ut = tanh (WF+ rt)v

(simple model)(Bahdanau et al)

Page 208: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Nonlinear Attention-Energy Model

• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:

• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF

• This can learn more complex interactions

• It is unclear if the added complexity is necessary for good performance

ut = F>rt (simple model)(Bahdanau et al)ut = v> tanh(WF+ rt)

Page 209: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Nonlinear Attention-Energy Model

• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:

• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF

• This can learn more complex interactions

• It is unclear if the added complexity is necessary for good performance

ut = F>rt (simple model)(Bahdanau et al)ut = v> tanh(WF+ rt)

Page 210: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Putting it all togethere0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)

ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

ut = v> tanh(WF+ rt)

Page 211: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Putting it all togethere0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)

ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

doesn’t depend on output decisionsut = v> tanh(WF+ rt)

Page 212: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Putting it all togethere0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f) (Part 1 of lecture)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)

ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

X = WF

Xut = v> tanh(WF+ rt)

Page 213: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Putting it all togethere0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f) (Part 1 of lecture)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)

ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

X = WF

ut = v> tanh(X+ rt)

Page 214: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Summary• Attention is closely related to “pooling” operations in convnets (and other

architectures)

• Bahdanau’s attention model seems to only cares about “content”

• No obvious bias in favor of diagonals, short jumps, fertility, etc.

• Some work has begun to add other “structural” biases (Luong et al., 2015; Cohn et al., 2016), but there are lots more opportunities

• Attention is similar to alignment, but there are important differences

• alignment makes stochastic but hard decisions. Even if the alignment probability distribution is “flat”, the model picks one word or phrase at a time

• attention is “soft” (you add together all the words). Big difference between “flat” and “peaked” attention weights

Page 215: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Attention and Translation• Cho’s question: does a translator read and memorize

the input sentence/document and then generate the output?

• Compressing the entire input sentence into a vector basically says “memorize the sentence”

• Common sense experience says translators refer back and forth to the input. (also backed up by eye-tracking studies)

• Should humans be a model for machines?

Page 216: Modeling Sequential Data with Recurrent Networkslxmls.it.pt/2016/lxmls-dl2.pdfh istories are sequences of words … Example: Language Model Example: Language Model h 1 h 0 x 1

Questions?