machine translation 2 - people.cs.umass.edu

53
machine translation 2: seq2seq / decoding / eval CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst some slides adapted from Richard Socher and Marine Carpuat

Upload: others

Post on 06-Oct-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: machine translation 2 - people.cs.umass.edu

machine translation 2: seq2seq / decoding / eval

CS 585, Fall 2018Introduction to Natural Language Processinghttp://people.cs.umass.edu/~miyyer/cs585/

Mohit IyyerCollege of Information and Computer Sciences

University of Massachusetts Amherst

some slides adapted from Richard Socher and Marine Carpuat

Page 2: machine translation 2 - people.cs.umass.edu

questions from last time…• info for project proposal? template now posted! • teammates for project? • HW2??? • recorded lecture audio not happening :( • midterm???

• will cover text classification / language modeling / word embeddings / sequence labeling / machine translation (including today’s lecture)

• will not cover CFGs / parsing. • 20% multiple choice, 80% short answer/computational qs • 1-page “cheat sheet” allowed, must be hand-written

• Mohit out next lecture and 11/1 2

Page 3: machine translation 2 - people.cs.umass.edu

limitations of IBM models

• discrete alignments • all alignments equally likely (model 1 only) • translation of each f word depends only on

aligned e word!

3

Page 4: machine translation 2 - people.cs.umass.edu

4

Recap: The Noisy Channel Model

I Goal: translation system from French to English

I Have a model p(e | f) which estimates conditional probabilityof any English sentence e given the French sentence f . Usethe training corpus to set the parameters.

I A Noisy Channel Model has two components:

p(e) the language model

p(f | e) the translation model

I Giving:

p(e | f) = p(e, f)

p(f)=

p(e)p(f | e)Pe p(e)p(f | e)

andargmaxep(e | f) = argmaxep(e)p(f | e)

Page 5: machine translation 2 - people.cs.umass.edu

5

Phrase-Based Model

• Foreign input is segmented in phrases

• Each phrase is translated into English

• Phrases are reordered

Chapter 5: Phrase-Based Models 2

Page 6: machine translation 2 - people.cs.umass.edu

phrase-based MT• better way of modeling : phrase

alignments instead of word alignments

6

p( f |e)

p( f |e) =I

∏i=1

ϕ( fi, ei)d(starti − endi−1 − 1)

phrase translation probability

reordering probability

set of phrases in f

Page 7: machine translation 2 - people.cs.umass.edu

7

Word Alignment

house

the

in

stay

will

he

that

assumes

michael

mic

hael

geht

davo

n

aus

dass

er im haus

blei

bt

,

Chapter 5: Phrase-Based Models 9

Phrase alignment from word alignment!use IBM models

to get word alignments!

Page 8: machine translation 2 - people.cs.umass.edu

8

Word Alignment

house

the

in

stay

will

he

that

assumes

michael

mic

hael

geht

davo

n

aus

dass

er im haus

blei

bt

,

Chapter 5: Phrase-Based Models 9

Phrase alignment from word alignment!

assumes / geht davon aus assumes that / geht davon aus , dass

use IBM models to get word alignments!

Page 9: machine translation 2 - people.cs.umass.edu

9

Real Example

• Phrase translations for den Vorschlag learned from the Europarl corpus:

English �(e| ¯f) English �(e| ¯f)

the proposal 0.6227 the suggestions 0.0114’s proposal 0.1068 the proposed 0.0114a proposal 0.0341 the motion 0.0091the idea 0.0250 the idea of 0.0091this proposal 0.0227 the proposal , 0.0068proposal 0.0205 its proposal 0.0068of the proposal 0.0159 it 0.0068the proposals 0.0159 ... ...

– lexical variation (proposal vs suggestions)– morphological variation (proposal vs proposals)– included function words (the, a, ...)– noise (it)

Chapter 5: Phrase-Based Models 4

in general, we learn a phrase table to store these translation probabilities. what are some limitations of phrase-based MT?

Page 10: machine translation 2 - people.cs.umass.edu

today: neural MT• instead of using the noisy channel model to

decompose , let’s directly model

10

p(e | f ) ∝ p( f |e)p(e)p(e | f )

p(e | f ) = p(e1, e2, …, el | f )

= p(e1 | f ) ⋅ p(e2 |e1, f ) ⋅ p(e3 |e2, e1, f ) ⋅ …

=L

∏i=1

p(ei |e1, …, ei−1, f )

this is a conditional language model. how is this different than the LMs we saw in the IBM models?

Page 11: machine translation 2 - people.cs.umass.edu

seq2seq models

• use two different RNNs to model

• first we have the encoder, which encodes the foreign sentence f

• then, we have the decoder, which produces the English sentence e

11

L

∏i=1

p(ei |e1, …, ei−1, f )

Page 12: machine translation 2 - people.cs.umass.edu

12

ARNNLanguageModel

the students opened theirwords/one-hotvectors

bookslaptops

wordembeddings

a zoo

outputdistribution

Note:thisinputsequencecouldbemuchlonger,butthisslidedoesn’thavespace!

hiddenstates

istheinitialhiddenstate

2/1/1825

c1, c2, c3, c4

c1 c2 c3 c4

the students opened their

y = softmax(W2h(t) + b2)

W2

h(t) = f(Whh(t−1) + Wect + b1)h(0) is initial hidden state!

Reminder: RNN language models!

Page 13: machine translation 2 - people.cs.umass.edu

13

Enco

der R

NN

Neural Machine Translation (NMT)

2/15/1823

<START>

Source sentence (input)

les pauvres sont démunis

The sequence-to-sequence modelTarget sentence (output)

Decoder RNN

Encoder RNN produces an encoding of the source sentence.

Encoding of the source sentence.Provides initial hidden state

for Decoder RNN.

Decoder RNN is a Language Model that generates target sentence conditioned on encoding.

the

argm

axthe

argm

ax

poor

poor

argm

ax

don’t

Note: This diagram shows test time behavior: decoder output is fed in as next step’s input

have any money <END>

don’t have any money

argm

ax

argm

ax

argm

ax

argm

ax

Page 14: machine translation 2 - people.cs.umass.edu

14

Enco

der R

NN

Neural Machine Translation (NMT)

2/15/1823

<START>

Source sentence (input)

les pauvres sont démunis

The sequence-to-sequence modelTarget sentence (output)

Decoder RNN

Encoder RNN produces an encoding of the source sentence.

Encoding of the source sentence.Provides initial hidden state

for Decoder RNN.

Decoder RNN is a Language Model that generates target sentence conditioned on encoding.

the

argm

axthe

argm

ax

poor

poor

argm

ax

don’t

Note: This diagram shows test time behavior: decoder output is fed in as next step’s input

have any money <END>

don’t have any money

argm

ax

argm

ax

argm

ax

argm

ax

Page 15: machine translation 2 - people.cs.umass.edu

15

Training a Neural Machine Translation system

2/15/1825

Enco

der R

NN

Source sentence (from corpus)

<START> the poor don’t have any moneyles pauvres sont démunis

Target sentence (from corpus)

Seq2seq is optimized as a single system.Backpropagation operates “end to end”.

Decoder RNN

!"# !"$ !"% !"& !"' !"( !")

*# *$ *% *& *' *( *)

= negative log prob of “the”

* = 1-./0#

1*/ = + + + + + +

= negative log prob of <END>

= negative log prob of “have”

what are the parameters of this model?

Page 16: machine translation 2 - people.cs.umass.edu

16

Training a Neural Machine Translation system

2/15/1825

Enco

der R

NN

Source sentence (from corpus)

<START> the poor don’t have any moneyles pauvres sont démunis

Target sentence (from corpus)

Seq2seq is optimized as a single system.Backpropagation operates “end to end”.

Decoder RNN

!"# !"$ !"% !"& !"' !"( !")

*# *$ *% *& *' *( *)

= negative log prob of “the”

* = 1-./0#

1*/ = + + + + + +

= negative log prob of <END>

= negative log prob of “have”

what are the parameters of this model?Wenc

h , Wence , Cenc, Wdec

h , Wdece , Cdec, Wout

C is word embedding matrix

Page 17: machine translation 2 - people.cs.umass.edu

decoding• given that we trained a seq2seq model, how

do we find the most probable English sentence?

• more concretely, how do we find

• can we enumerate all possible English sentences e?

17

arg maxL

∏i=1

p(ei |e1, …, ei−1, f )

Page 18: machine translation 2 - people.cs.umass.edu

decoding• given that we trained a seq2seq model, how

do we find the most probable English sentence?

• more concretely, how do we find

• can we enumerate all possible English sentences e?

• can we use the Viterbi algorithm like we did for HMMs?

18

arg maxL

∏i=1

p(ei |e1, …, ei−1, f )

Page 19: machine translation 2 - people.cs.umass.edu

decoding• given that we trained a seq2seq model, how

do we find the most probable English sentence?

• more concretely, how do we find

• can we enumerate all possible English sentences e?

• can we use the Viterbi algorithm like we did for HMMs?

19

arg maxL

∏i=1

p(ei |e1, …, ei−1, f )

Page 20: machine translation 2 - people.cs.umass.edu

decoding• given that we trained a seq2seq model, how

do we find the most probable English sentence?

• easiest option: greedy decoding

20

Better-than-greedy decoding?

• We showed how to generate (or “decode”) the target sentence by taking argmax on each step of the decoder

• This is greedy decoding (take most probable word on each step)• Problems?

2/15/1826

<START>

the

argm

ax

the

argm

ax

poor

poor

argm

axdon’t have any money <END>

don’t have any money

argm

ax

argm

ax

argm

ax

argm

ax

issues?

Page 21: machine translation 2 - people.cs.umass.edu

Beam search• in greedy decoding, we cannot go back and

revise previous decisions!

• fundamental idea of beam search: explore several different hypotheses instead of just a single one

• keep track of k most probable partial translations at each decoder step instead of just one!

Better-than-greedy decoding?

• Greedy decoding has no way to undo decisions! • les pauvres sont démunis (the poor don’t have any money)• → the ____• → the poor ____• → the poor are ____

• Better option: use beam search (a search algorithm) to explore several hypotheses and select the best one

2/15/1827

the beam size k is usually 5-10

Page 22: machine translation 2 - people.cs.umass.edu

22

Beam search decoding: example

Beam size = 2

2/15/1830

<START>

the

a

-1.05

-1.39

Page 23: machine translation 2 - people.cs.umass.edu

23

Beam search decoding: example

Beam size = 2

2/15/1831

poor

people

poor

person

<START>

the

a

-1.90

-1.54

-2.3

-3.2

Page 24: machine translation 2 - people.cs.umass.edu

24

Beam search decoding: example

Beam size = 2

2/15/1832

poor

people

poor

person

are

don’t

person

but

<START>

the

a

-2.42

-3.12

-2.13

-3.53

Page 25: machine translation 2 - people.cs.umass.edu

25

Beam search decoding: example

Beam size = 2

2/15/1833

poor

people

poor

person

are

don’t

person

but

always

not

have

take<START>

the

a

-3.82

-3.32

-2.67

-3.61

and so on…

Page 26: machine translation 2 - people.cs.umass.edu

26

Beam search decoding: example

Beam size = 2

2/15/1834

poor

people

poor

person

are

don’t

person

but

always

not

have

take

in

with

any

enough<START>

the

a

Page 27: machine translation 2 - people.cs.umass.edu

27

Beam search decoding: example

Beam size = 2

2/15/1835

poor

people

poor

person

are

don’t

person

but

always

not

have

take

in

with

any

enough

money

funds

money

funds

<START>

the

a

Page 28: machine translation 2 - people.cs.umass.edu

28

Beam search decoding: example

Beam size = 2

2/15/1836

poor

people

poor

person

are

don’t

person

but

always

not

have

take

in

with

any

enough

money

funds

money

funds

<START>

the

a

Page 29: machine translation 2 - people.cs.umass.edu

29

how many probabilities do we need to evaluate at each time step with a beam size of k?

what are the termination conditions for beam search?

does beam search always produce the best translation (i.e., does it always find the argmax?)

Page 30: machine translation 2 - people.cs.umass.edu

30

Advantages of NMT

Compared to SMT, NMT has many advantages:

• Better performance• More fluent• Better use of context• Better use of phrase similarities

• A single neural network to be optimized end-to-end• No subcomponents to be individually optimized

• Requires much less human engineering effort• No feature engineering• Same method for all language pairs

2/15/1837

Page 31: machine translation 2 - people.cs.umass.edu

31

Disadvantages of NMT?

Compared to SMT:

• NMT is less interpretable • Hard to debug

• NMT is difficult to control• For example, can’t easily specify rules or guidelines for

translation• Safety concerns!

2/15/1838

Page 32: machine translation 2 - people.cs.umass.edu

32

Sequence-to-sequence: the bottleneck problem

2/15/1848

Enco

der R

NN

Source sentence (input)

<START> the poor don’t have any moneyles pauvres sont démunis

the poor don’t have any money <END>

Decoder RNN

Target sentence (output)

Encoding of the source sentence.

This needs to capture all information about the

source sentence.Information bottleneck!

Page 33: machine translation 2 - people.cs.umass.edu

The solution: attention

• Attention mechanisms allow the decoder to focus on a particular part of the source sequence at each time step • Conceptually similar to alignments

33

Page 34: machine translation 2 - people.cs.umass.edu

34

Sequence-to-sequence with attention

2/15/1853

Enco

der

RNN

Source sentence (input)

<START>les pauvres sont démunis

Decoder RNNAt

tent

ion

scor

es

dot product

decoder, first time step

dot product with each encoder hidden state

Page 35: machine translation 2 - people.cs.umass.edu

35

Sequence-to-sequence with attention

2/15/1854

Enco

der

RNN

Source sentence (input)

<START>les pauvres sont démunis

Decoder RNNAt

tent

ion

scor

es

On this decoder timestep, we’re mostly focusing on the first encoder hidden state (”les”)

Atte

ntio

n di

strib

utio

n

Take softmax to turn the scores into a probability distribution

Page 36: machine translation 2 - people.cs.umass.edu

36

Sequence-to-sequence with attention

2/15/1855

Enco

der

RNN

Source sentence (input)

<START>les pauvres sont démunis

Decoder RNNAt

tent

ion

dist

ribut

ion

Atte

ntio

n sc

ores

Attention output

Use the attention distribution to take a weighted sum of the encoder hidden states.

The attention output mostly contains information the hidden states that received high attention.

Page 37: machine translation 2 - people.cs.umass.edu

37

Sequence-to-sequence with attention

2/15/1856

Enco

der

RNN

Source sentence (input)

<START>les pauvres sont démunis

Decoder RNNAt

tent

ion

dist

ribut

ion

Atte

ntio

n sc

ores

Attention output

Concatenate attention output with decoder hidden state, then use to compute !"# as before

!"#

the

Page 38: machine translation 2 - people.cs.umass.edu

38

Sequence-to-sequence with attention

2/15/1857

Enco

der

RNN

Source sentence (input)

<START>les pauvres sont démunis

Decoder RNNAt

tent

ion

scor

es

the

Atte

ntio

n di

strib

utio

n

Attention output

!"#

poor

decoder, second time step

Page 39: machine translation 2 - people.cs.umass.edu

39

Attention is great

• Attention significantly improves NMT performance• It’s very useful to allow decoder to focus on certain parts of the source

• Attention solves the bottleneck problem• Attention allows decoder to look directly at source; bypass bottleneck

• Attention helps with vanishing gradient problem• Provides shortcut to faraway states

• Attention provides some interpretability• By inspecting attention distribution, we can see

what the decoder was focusing on• We get alignment for free!• This is cool because we never explicitly trained

an alignment system• The network just learned alignment by itself

2/15/1863

9/24/14

5

Alignments: harder

The balance

was the

territory of

the aboriginal

people

Le reste appartenait aux autochtones

many-to-one alignments

The balance

was the

territory

of the

aboriginal people

Le

rest

e

appa

rtena

it au

x

auto

chto

nes

Alignments: hardest

The poor don’t have

any money

Les pauvres sont démunis

many-to-many alignment

The poor

don�t have

any

money

Les

pauv

res

sont

mun

is

phrase alignment

Alignment as a vector

Mary did not

slap

the green witch

1 2 3 4

5 6 7

Maria no daba una botefada a la bruja verde

1 2 3 4 5 6 7 8 9

i j

1 3 4 4 4 0 5 7 6

aj=i •  used in all IBM models •  a is vector of length J •  maps indexes j to indexes i •  each aj {0, 1 … I} •  aj = 0 fj is �spurious� •  no one-to-many alignments •  no many-to-many alignments •  but provides foundation for

phrase-based alignment

IBM Model 1 generative story

And the

program has

been implemented

aj

Le

prog

ram

me

a été

mis

en

ap

plic

atio

n

2 3 4 5 6 6 6

Choose length J for French sentence

For each j in 1 to J:

–  Choose aj uniformly from 0, 1, … I

–  Choose fj by translating eaj

Given English sentence e1, e2, … eI

We want to learn how to do this

Want: P(f|e)

IBM Model 1 parameters

And the

program has

been implemented

Le

prog

ram

me

a été

mis

en

ap

plic

atio

n

2 3 4 5 6 6 6 aj

Applying Model 1*

As translation model

As alignment model

P(f, a | e) can be used as a translation model or an alignment model

* Actually, any P(f, a | e), e.g., any IBM model

Page 40: machine translation 2 - people.cs.umass.edu

onto evaluation…

40

Page 41: machine translation 2 - people.cs.umass.edu

41

How good is a translation?Problem: no single right answer

Page 42: machine translation 2 - people.cs.umass.edu

42

Evaluation• How good is a given machine translation system?

• Many different translations acceptable

• Evaluation metrics• Subjective judgments by human evaluators• Automatic evaluation metrics• Task-based evaluation

Page 43: machine translation 2 - people.cs.umass.edu

43

Automatic Evaluation Metrics• Goal: computer program that computes quality of translations

• Advantages: low cost, optimizable, consistent

• Basic strategy• Given: MT output• Given: human reference translation• Task: compute similarity between them

Page 44: machine translation 2 - people.cs.umass.edu

44

Precision and Recall of Words

Page 45: machine translation 2 - people.cs.umass.edu

45

Precision and Recall of Words

Page 46: machine translation 2 - people.cs.umass.edu

46

BLEU Bilingual Evaluation Understudy

Page 47: machine translation 2 - people.cs.umass.edu

47

Multiple Reference Translations

Page 48: machine translation 2 - people.cs.umass.edu

48

BLEU examples

Page 49: machine translation 2 - people.cs.umass.edu

neural MT usually > phrase-based MT!

49 from Toral & Way, 2018

English-to-Catalan novel translation

Page 50: machine translation 2 - people.cs.umass.edu

what are some drawbacks of BLEU?

50

Page 51: machine translation 2 - people.cs.umass.edu

what are some drawbacks of BLEU?

• all words/n-grams treated as equally relevant • operates on local level • scores are meaningless (absolute value not

informative) • human translators also score low on BLEU

51

Page 52: machine translation 2 - people.cs.umass.edu

52

Yet automatic metrics such as BLEU correlate with human judgement

Page 53: machine translation 2 - people.cs.umass.edu

exercise!

53