neural machine translation by jointly learning to align...

Neural Machine Translation by Jointly Learning to Alignand Translate

Reporter: Fandong Meng

Institute of Computing TechnologyChinese Academy of Sciences

[email protected]

October 11, 2014

To appear in proceedings of NIPS 2014. [Download]

Reporter: Fandong Meng (ICT-CAS) RNN Encoder-Decoder for SMT October 11, 2014 1 / 26

http://arxiv.org/pdf/1409.0473v3.pdf

Overview

1 Motivation

2 RNN Encoder-Decoder

3 Learning to Align and TranslateDecoder: Gerenal DescriptionEncoder: Bidirectional RNN for Annotation SequencesHidden Unit that Adaptively Remembers and Forgets

4 Experiments

5 Conclusion


Motivation

Framework

Basic RNN encoder-decoder based machine translation.

Fit a parameterized model to maximize the conditional probability oftarget sentence y given a source sentence x , i.e., argmaxyp(y |x)using a training parallel corpus.

Compress all the necessary information of a source sentence into afixed-length vector.

Decode the vector into a variable-length target sentence.


Motivation

Issues

Compressing all the necessary information of a source sentence into afixed-length vector may make it difficult for the neural network to copewith long sentences, especially those that are longer than the sentences inthe training corpus.

To address this issue, this paper introduces an extension to theencoder-decoder model which learns to align and translate jointly.


Motivation

Distinguish from the basic encoder-decoder

It encodes the input sentence into a sequence of vectors and chooses asubset of these vectors adaptively while decoding.

To generate a word in a translation, the model searches for a set ofpositions in a source sentence where the most relevant information isconcentrated.

The model then predicts a target word based on the context vectorsassociated with these source positions and all the previous generatedtarget words.


RNN

Basic RNN framework

Figure : Basic RNN framework.


RNN Encoder-Decoder: Encoder

In the Encoder-Decoder framework, an encoder reads the inputsentence, a sequence of vectors x = (x1, ..., xTx ) into a vector c ,

Figure : Basic encoder-decoder framework.

where ht = f (xt , ht−1) and c = q(h1, ..., hTx ), f and q are somenonlinear functions.


RNN Encoder-Decoder: Decoder

The decoder is often trained to predict the next word yt given thecontext vector c and all the previously predicted words {y1, ..., yt−1}.In other words, the decoder defines a probability over the translationy by decomposing the joint probability into the ordered conditionals:

p(y) =T∏t=1

p(yt |y1, ..., yt−1, c) (1)

where y = {y1, ..., yTy }. With an RNN, each conditional probability ismodeled as

p(yt |{y1, ..., yt−1}, c) = g(yt−1, st , c) (2)

where g is a nonlinear, potentially multi-layered, function thatoutputs the probability of yt , and st is the hidden state of the RNN.


RNN Encoder-Decoder

Context vector c is a constant vector!


Learning to Align and Translate

Introduce two parts in detail.

Decoder: Gerenal Description

Encoder: Bidirectional RNN for Annotation Sequences



Derived in steps.

The conditional probability is

p(yi |{y1, ..., yi−1}, x) = g(yi−1, si , ci ) (3)

where si is a RNN hidden state for time i , computed by

si = f (si−1, yi−1, ci ). (4)

Here the probability is conditioned on a distinct context vector ci foreach target word yi , which is different from basic RNN decoder.



Derived in steps.

In si = f (si−1, yi−1, ci ), the context vector ci depends on a sequenceof annotations (h1, ..., hTx) to which an encoder maps the inputsentence.

Each annotation hi contains information about the whole inputsequence with a strong focus on the parts surrounding the i − th wordof the input sequence.

How to compute ci?

ci =Tx∑j=1

aijhj (5)



Derived in steps.

In ci =∑Tx

j=1 aijhj , the weight aij of each annotation hj is computedby

aij =exp(eij)∑Tx

k=1 exp(eik)(6)

where eij = a(si−1, hj), a is the so called alignment model, can bejointly trained with all the other components of the proposed system.



Derived in steps.

The probability aij , or its associated energy eij , reflects theimportance of the annotation hj with respect to the previous hiddenstate si−1 in deciding the next hidden state si and generating yi .

Figure : The graphical illustration of the proposed model trying to generate thet − th target word y given a source sentence (x1, ..., xT ).Reporter: Fandong Meng (ICT-CAS) RNN Encoder-Decoder for SMT October 11, 2014 14 / 26


Derived from top to down.

p(yi |{y1, ..., yi−1}, x) = g(yi−1, si , ci )

si = f (si−1, yi−1, ci )

ci =∑Tx

j=1 aijhj

aij =exp(eij )∑Tx

k=1 exp(eik )

eij = a(si−1, hj)

What are left?

a(si−1, hj)

hj



Alignment model

Use a single-layer multilayer perceptron.

a(si−1, hj) = vTa tanh(Wasi−1 + Uahj) (7)

where va,Wa,Ua are weight matrices, can be jointly trained with all theother components of the proposed system.



Derived from top to down.

p(yi |{y1, ..., yi−1}, x) = g(yi−1, si , ci )

si = f (si−1, yi−1, ci )

ci =∑Tx

j=1 aijhj

aij =exp(eij )∑Tx

k=1 exp(eik )

eij = a(si−1, hj)

a(si−1, hj) = vTa tanh(Wasi−1 + Uahj)

What is left now?

hj



A BiRNN consists of forward and backward RNNs.

The forward RNN−→f reads the input sequence as it is ordered (from

x1 to xTx ) and calculates a sequence of forward hidden states

(−→h 1, ...,

−→h Tx ).

The backward RNN←−f reads the input sequence as it is ordered (from

xTx to x1) and calculates a sequence of forward hidden states

(←−h 1, ...,

←−h Tx ).

Therefore, hj = [−→h T

j ;←−h T

j ]



Take a look at the annotation h again.

Figure : The graphical illustration of the proposed model trying to generate thet − th target word y given a source sentence (x1, ..., xT ).


Hidden Unit that Adaptively Remembers and Forgets

A new type of hidden unit.

In si = f (si−1, yi−1, ci ), f may be as simple as an element-wise logisticsigmoid function and as complex as a long short-term memory (LSTM)unit. This paper used a new type of hidden unit (Cho et al., 2014)) thathas been motivated by the LSTM unit but is much simpler to computeand implement. Let us describe how the activation of the jth hidden unitis computed:

The hidden state si of the decoder given the annotations from theencoder is computed by

si = (1− zi ) ◦ si−1 + zi ◦ s̃i (8)

where

s̃i = tanh(WEyi−1 + U[ri ◦ si−1] + Cci )zi = sigmod(WzEyi−1 + Uzsi−1 + Czci )ri = sigmod(WrEyi−1 + Ur si−1 + Crci )


Hidden Unit that Adaptively Remembers and Forgets

A new type of hidden unit.

Look at the follow example, we just take h as s and take X as Y .

si = f (si−1, yi−1, ci ) = (1− zi ) ◦ si−1 + zi ◦ s̃is̃i = tanh(WEyi−1 + U[ri ◦ si−1] + Cci )zi = sigmod(WzEyi−1 + Uzsi−1 + Czci )ri = sigmod(WrEyi−1 + Ur si−1 + Crci )

Figure : An illustration of the proposed hidden activation function. The updategate z selects whether the hidden state is to be updated with a new hidden stateh̃. The reset gate r decides whether the previous hidden state is ignored.Reporter: Fandong Meng (ICT-CAS) RNN Encoder-Decoder for SMT October 11, 2014 21 / 26

Experiments

Setup

Training Set

WMT 2014 English-French parallel corporahttp://www.statmt.org/wmt14/translation-task.html

Development Set

news-test-2012 and news-test-2013

Test Set

news-test-2014

Other Details

data selection30,000 most frequent wordsmap other words to [UNK]


Experiments

Results with respect to the lengths of the sentences.

Figure : The BLEU scores of the generated translations on the test set withrespect to the lengths of the sentences. The results are on the full test set whichincludes sentences having unknown words to the models.


Experiments

Main Results

Figure : BLEU scores of the trained models computed on the test set. Thesecond and third columns show respectively the scores on all the sentences and,on the sentences without any unknown word in themselves and in the referencetranslations. Note that RNNsearch-50? was trained much longer until theperformance on the development set stopped improving. We disallowed themodels to generate [UNK] tokens when only the sentences having no unknownwords were evaluated (last column).Reporter: Fandong Meng (ICT-CAS) RNN Encoder-Decoder for SMT October 11, 2014 24 / 26

Conclusion

Extend the basic encoderCdecoder by letting a model search for a setof input words, or their annotations computed by an encoder, whengenerating each target word. This novel architecture makes it betterfor long sentence translation.

One of challenges left for the future is to better handle unknown, orrare words.


The End! Thanks!


neural machine translation by jointly learning to align...

Documents