sequence to sequence models · sequence to sequence models jon dehdari january 25, 2016 1 / 12....

21
Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12

Upload: others

Post on 21-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Sequence to Sequence Models

Jon Dehdari

January 25, 2016

1 / 12

Page 2: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Good Morning!

2 / 12

Page 3: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Sentence Vectors

• We’ve seen that words can be represented as vectors. Cansentences be represented as vectors?

• Sure, why not? How? From the hidden state at the end of asentence: hi = φenc(hi−1, si ) (φenc = LSTM or GRU)

• Are they any good? For Elman networks (SRNs), not so much. ForLSTMs or GRUs, yes, they’re pretty good

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2

3 / 12

Page 4: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Sentence Vectors

• We’ve seen that words can be represented as vectors. Cansentences be represented as vectors?

• Sure, why not?

How? From the hidden state at the end of asentence: hi = φenc(hi−1, si ) (φenc = LSTM or GRU)

• Are they any good? For Elman networks (SRNs), not so much. ForLSTMs or GRUs, yes, they’re pretty good

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2

3 / 12

Page 5: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Sentence Vectors

• We’ve seen that words can be represented as vectors. Cansentences be represented as vectors?

• Sure, why not? How? From the hidden state at the end of asentence: hi = φenc(hi−1, si ) (φenc = LSTM or GRU)

• Are they any good? For Elman networks (SRNs), not so much. ForLSTMs or GRUs, yes, they’re pretty good

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2

3 / 12

Page 6: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Sentence Vectors

• We’ve seen that words can be represented as vectors. Cansentences be represented as vectors?

• Sure, why not? How? From the hidden state at the end of asentence: hi = φenc(hi−1, si ) (φenc = LSTM or GRU)

• Are they any good? For Elman networks (SRNs), not so much. ForLSTMs or GRUs, yes, they’re pretty good

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2

3 / 12

Page 7: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Sentence Vectors

• We’ve seen that words can be represented as vectors. Cansentences be represented as vectors?

• Sure, why not? How? From the hidden state at the end of asentence: hi = φenc(hi−1, si ) (φenc = LSTM or GRU)

• Are they any good? For Elman networks (SRNs), not so much. ForLSTMs or GRUs, yes, they’re pretty good

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2

3 / 12

Page 8: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Sentence Vector Examples

−15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

I gave her a card in the garden

In the garden , I gave her a cardShe was given a card by me in the garden

She gave me a card in the gardenIn the garden , she gave me a card

I was given a card by her in the garden

−8 −6 −4 −2 0 2 4 6 8 10−6

−5

−4

−3

−2

−1

0

1

2

3

4

John respects Mary

Mary respects JohnJohn admires Mary

Mary admires John

Mary is in love with John

John is in love with Mary

Sentence vectors were projected to two dimensions using PCA

Images courtesy of Sutskever, et al (2014)

4 / 12

Page 9: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Generating Sentences from Vectors

• We can also try to go the other direction, generatingsentences from vectors

• How? Use an RNN to decode, rather than encode asentence: zi = φdec(zi−1,ui−1,hT )

• hT ensures global sentence coherency (& adequacy in MT);ui−1 ensures local fluency

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2

5 / 12

Page 10: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Using Neural Encoders & Decoders to Translate• We can combine the neural encoder and decoder of previous

slides to form an encoder-decoder model• This can be used for machine translation, and other tasks that

map sequences to sequences

• Monolingual word projections (vectors/embeddings) aretrained to maximize likelihood of next word

• Source-side word projections (si ) in an encoder-decodersetting are trained to maximize target-side likelihood

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2

6 / 12

Page 11: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Using Neural Encoders & Decoders to Translate• We can combine the neural encoder and decoder of previous

slides to form an encoder-decoder model• This can be used for machine translation, and other tasks that

map sequences to sequences

• Monolingual word projections (vectors/embeddings) aretrained to maximize likelihood of next word

• Source-side word projections (si ) in an encoder-decodersetting are trained to maximize target-side likelihood

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2

6 / 12

Page 12: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Bidirectional RNNs• The basic encoder-decoder architecture doesn’t handle long

sentences very well• Everything must fit into a fixed-size vector,• and RNNs remember recent items better

• We can combine left-to-right and right-to-left RNNs toovercome these issues

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3

7 / 12

Page 13: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Bidirectional RNNs• The basic encoder-decoder architecture doesn’t handle long

sentences very well• Everything must fit into a fixed-size vector,• and RNNs remember recent items better• We can combine left-to-right and right-to-left RNNs to

overcome these issues

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3

7 / 12

Page 14: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

What if . . .

• Even bidirectional encoder-decoders have a hard time withlong sentences

• We need a way to keep track of what’s already beentranslated and what to translate next

• For neural nets, the solution is often more neural nets . . .

8 / 12

Page 15: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

What if . . .

• Even bidirectional encoder-decoders have a hard time withlong sentences

• We need a way to keep track of what’s already beentranslated and what to translate next

• For neural nets, the solution is often more neural nets . . .

8 / 12

Page 16: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

What if . . .

• Even bidirectional encoder-decoders have a hard time withlong sentences

• We need a way to keep track of what’s already beentranslated and what to translate next

• For neural nets, the solution is often more neural nets . . .

8 / 12

Page 17: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Achtung, Baby!• Attention-based decoding adds another network (a) that takes

as input the encoder’s hidden state (h) and the decoder’shidden state (z), and outputs a probability for each sourceword at each time step (when and where to pay attention) :ei ,j = a(zi−1, hj) = v>a tanh(Wasi−1 + Uahj)

• The attention weights can also function as soft wordalignments. They’re trained on target-side MLE

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3

9 / 12

Page 18: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Achtung, Baby!• Attention-based decoding adds another network (a) that takes

as input the encoder’s hidden state (h) and the decoder’shidden state (z), and outputs a probability for each sourceword at each time step (when and where to pay attention) :ei ,j = a(zi−1, hj) = v>a tanh(Wasi−1 + Uahj)

• The attention weights can also function as soft wordalignments. They’re trained on target-side MLE

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3

9 / 12

Page 19: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Image Caption Generation• You can use attention-based decoding to give textual

descriptions of images

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-310 / 12

Page 20: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Image Caption Generation Examples

Images courtesy of http://arxiv.org/abs/1502.03044

11 / 12

Page 21: Sequence to Sequence Models · Sequence to Sequence Models Jon Dehdari January 25, 2016 1 / 12. Good Morning! ... enc = LSTM or GRU) Are they any good? For Elman networks (SRNs),

Image Caption Generation, Step by Step

Images courtesy of http://arxiv.org/abs/1502.03044

12 / 12