natural language processing with deep learningleeck/graph_theory/9-2_fancy...natural language...
TRANSCRIPT
Natural Language Processing with Deep Learning
CS224N/Ling284
Lecture9:Recapand
FancyRecurrentNeuralNetworksforMachineTransla<on
ChristopherManningandRichardSocher
Natural Language Processingwith Deep Learning
CS224N/Ling284
Christopher Manning and Richard Socher
Lecture 2: Word Vectors
Overview
2
• Recapofmostimportantconcepts&equa<ons
• Machinetransla<on
• FancyRNNModelstacklingMT:
• GatedRecurrentUnitsbyChoetal.(2014)
• Long-Short-Term-MemoriesbyHochreiterandSchmidhuber(1997)
MachineTransla8on
6
• Methodsaresta<s<cal
• Useparallelcorpora
• EuropeanParliament
• Firstparallelcorpus:• Rose[aStoneà
• Tradi<onalsystemsareverycomplex
PicturefromWikipedia
Currentsta8s8calmachinetransla8onsystems
• Sourcelanguagef,e.g.French• Targetlanguagee,e.g.English• Probabilis<cformula<on(usingBayesrule)
• Transla<onmodelp(f|e)trainedonparallelcorpus• Languagemodelp(e)trainedonEnglishonlycorpus(lots,free!)
7
Transla<onModelp(f|e)Frenchà àPiecesofEnglishà LanguageModel
p(e)
Decoderargmaxp(f|e)p(e) àProperEnglish
Step1:Alignment
8
Goal:knowwhichwordorphrasesinsourcelanguagewouldtranslatetowhatwordsorphrasesintargetlanguage?àHardalready!
AlignmentexamplesfromChrisManning/CS224n
9/24/14
4
Statistical MT
Pioneered at IBM in the early 1990s Let’s make a probabilistic model of translation P(e | f) Suppose f is de rien P(you’re welcome | de rien) = 0.45 P(nothing | de rien) = 0.13 P(piddling | de rien) = 0.01 P(underpants | de rien) = 0.000000001
Hieroglyphs
Statistical Solution
• Parallel Texts – Rosetta Stone
Demotic
Greek
Statistical Solution
– Instruction Manuals – Hong Kong/Macao
Legislation – Canadian Parliament
Hansards – United Nations Reports – Official Journal
of the European Communities
– Translated news
• Parallel Texts Hmm, every time one sees “banco”, translation is �bank” or “bench” … If it’s “banco de…”, it always becomes “bank”, never “bench”…
A Division of Labor
Spanish Broken English
English
Spanish/English Bilingual Text
English Text
Statistical Analysis Statistical Analysis
Que hambre tengo yo I am so hungry
Translation Model P(f|e)
Language Model P(e)
Decoding algorithm argmax P(f|e) * P(e) e
What hunger have I, Hungry I am so, I am so hungry, Have I that hunger …
Fidelity Fluency
Alignments We can factor the translation model P(f | e ) by identifying alignments (correspondences) between words in f and words in e
Japan shaken
by two
new quakes
Le Japon secoué par deux nouveaux séismes
Japan shaken
by two
new quakes
Le
Japo
n se
coué
pa
r de
ux
nouv
eaux
sé
ism
es
�spurious� word
Alignments: harder
And the
program has
been implemented
Le programme a été mis en application
�zero fertility� word not translated
And the
program has
been implemented
Le
prog
ram
me
a été
mis
en
ap
plic
atio
n
one-to-many alignment
Step1:Alignment
9
9/24/14
4
Statistical MT
Pioneered at IBM in the early 1990s Let’s make a probabilistic model of translation P(e | f) Suppose f is de rien P(you’re welcome | de rien) = 0.45 P(nothing | de rien) = 0.13 P(piddling | de rien) = 0.01 P(underpants | de rien) = 0.000000001
Hieroglyphs
Statistical Solution
• Parallel Texts – Rosetta Stone
Demotic
Greek
Statistical Solution
– Instruction Manuals – Hong Kong/Macao
Legislation – Canadian Parliament
Hansards – United Nations Reports – Official Journal
of the European Communities
– Translated news
• Parallel Texts Hmm, every time one sees “banco”, translation is �bank” or “bench” … If it’s “banco de…”, it always becomes “bank”, never “bench”…
A Division of Labor
Spanish Broken English
English
Spanish/English Bilingual Text
English Text
Statistical Analysis Statistical Analysis
Que hambre tengo yo I am so hungry
Translation Model P(f|e)
Language Model P(e)
Decoding algorithm argmax P(f|e) * P(e) e
What hunger have I, Hungry I am so, I am so hungry, Have I that hunger …
Fidelity Fluency
Alignments We can factor the translation model P(f | e ) by identifying alignments (correspondences) between words in f and words in e
Japan shaken
by two
new quakes
Le Japon secoué par deux nouveaux séismes
Japan shaken
by two
new quakes
Le
Japo
n se
coué
pa
r de
ux
nouv
eaux
sé
ism
es
�spurious� word
Alignments: harder
And the
program has
been implemented
Le programme a été mis en application
�zero fertility� word not translated
And the
program has
been implemented
Le
prog
ram
me
a été
mis
en
ap
plic
atio
n
one-to-many alignment
Step1:Alignment
10
9/24/14
5
Alignments: harder
The balance
was the
territory of
the aboriginal
people
Le reste appartenait aux autochtones
many-to-one alignments
The balance
was the
territory
of the
aboriginal people
Le
rest
e
appa
rtena
it au
x
auto
chto
nes
Alignments: hardest
The poor don’t have
any money
Les pauvres sont démunis
many-to-many alignment
The poor
don�t have
any
money
Les
pauv
res
sont
dé
mun
is
phrase alignment
Alignment as a vector
Mary did not
slap
the green witch
1 2 3 4
5 6 7
Maria no daba una botefada a la bruja verde
1 2 3 4 5 6 7 8 9
i j
1 3 4 4 4 0 5 7 6
aj=i • used in all IBM models • a is vector of length J • maps indexes j to indexes i • each aj {0, 1 … I} • aj = 0 fj is �spurious� • no one-to-many alignments • no many-to-many alignments • but provides foundation for
phrase-based alignment
IBM Model 1 generative story
And the
program has
been implemented
aj
Le
prog
ram
me
a été
mis
en
ap
plic
atio
n
2 3 4 5 6 6 6
Choose length J for French sentence
For each j in 1 to J:
– Choose aj uniformly from 0, 1, … I
– Choose fj by translating eaj
Given English sentence e1, e2, … eI
We want to learn how to do this
Want: P(f|e)
IBM Model 1 parameters
And the
program has
been implemented
Le
prog
ram
me
a été
mis
en
ap
plic
atio
n
2 3 4 5 6 6 6 aj
Applying Model 1*
As translation model
As alignment model
P(f, a | e) can be used as a translation model or an alignment model
* Actually, any P(f, a | e), e.g., any IBM model
Reallyhard:/
Step1:Alignment
11
9/24/14
5
Alignments: harder
The balance
was the
territory of
the aboriginal
people
Le reste appartenait aux autochtones
many-to-one alignments
The balance
was the
territory
of the
aboriginal people
Le
rest
e
appa
rtena
it au
x
auto
chto
nes
Alignments: hardest
The poor don’t have
any money
Les pauvres sont démunis
many-to-many alignment
The poor
don�t have
any
money
Les
pauv
res
sont
dé
mun
is
phrase alignment
Alignment as a vector
Mary did not
slap
the green witch
1 2 3 4
5 6 7
Maria no daba una botefada a la bruja verde
1 2 3 4 5 6 7 8 9
i j
1 3 4 4 4 0 5 7 6
aj=i • used in all IBM models • a is vector of length J • maps indexes j to indexes i • each aj {0, 1 … I} • aj = 0 fj is �spurious� • no one-to-many alignments • no many-to-many alignments • but provides foundation for
phrase-based alignment
IBM Model 1 generative story
And the
program has
been implemented
aj
Le
prog
ram
me
a été
mis
en
ap
plic
atio
n
2 3 4 5 6 6 6
Choose length J for French sentence
For each j in 1 to J:
– Choose aj uniformly from 0, 1, … I
– Choose fj by translating eaj
Given English sentence e1, e2, … eI
We want to learn how to do this
Want: P(f|e)
IBM Model 1 parameters
And the
program has
been implemented
Le
prog
ram
me
a été
mis
en
ap
plic
atio
n
2 3 4 5 6 6 6 aj
Applying Model 1*
As translation model
As alignment model
P(f, a | e) can be used as a translation model or an alignment model
* Actually, any P(f, a | e), e.g., any IBM model
Step1:Alignment
12
• Wecouldspendanen<relectureonalignmentmodels
• Notonlysinglewordsbutcouldusephrases,syntax
• Thenconsiderreorderingoftranslatedphrases
ExamplefromPhilippKoehn
Translation Process
• Task: translate this sentence from German into English
er geht ja nicht nach hause
er geht ja nicht nach hause
he does not go home
• Pick phrase in input, translate
Chapter 6: Decoding 6
A@ermanysteps
13
Eachphraseinsourcelanguagehasmanypossibletransla<onsresul<nginlargesearchspace:
Translation Options
he
er geht ja nicht nach hause
it, it
, he
isare
goesgo
yesis
, of course
notdo not
does notis not
afterto
according toin
househome
chamberat home
notis not
does notdo not
homeunder housereturn home
do not
it ishe will be
it goeshe goes
isare
is after alldoes
tofollowingnot after
not to
,
notis not
are notis not a
• Many translation options to choose from
– in Europarl phrase table: 2727 matching phrase pairs for this sentence– by pruning to the top 20 per phrase, 202 translation options remain
Chapter 6: Decoding 8
Decode:Searchforbestofmanyhypotheses
14
HardsearchproblemthatalsoincludeslanguagemodelDecoding: Find Best Path
er geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
backtrack from highest scoring complete hypothesis
Chapter 6: Decoding 15
Tradi8onalMT
15
• Skippedhundredsofimportantdetails
• Alotofhumanfeatureengineering
• Verycomplexsystems
• Manydifferent,independentmachinelearningproblems
Deeplearningtotherescue!…?
16
Maybe,wecouldtranslatedirectlywithanRNN?
Decoder:
Encoder
x1 x2 x3
h1 h2 h3W W
y1 y2
Echt dicke Kiste
Awesome sauce
Thisneedstocapturetheen<rephrase!
MTwithRNNs–SimplestModel
17
Encoder:
Decoder:
Minimizecrossentropyerrorforalltargetwordscondi<onedonsourcewords
It’snotquitethatsimple;)
RNNTransla8onModelExtensions
18
1.TraindifferentRNNweightsforencodinganddecoding
x1 x2 x3
h1 h2 h3W W
y1 y2
Echt dicke Kiste
Awesome sauce
RNNTransla8onModelExtensions
19
Nota<on:Eachinputofϕhasitsownlineartransforma<onmatrix.Simple:
2. Computeeveryhiddenstateindecoderfrom
• Previoushiddenstate(standard)
• Lasthiddenvectorofencoderc=hT
• Previouspredictedoutputwordyt-1
2 RNN Encoder–Decoder
2.1 Preliminary: Recurrent Neural NetworksA recurrent neural network (RNN) is a neural net-work that consists of a hidden state h and anoptional output y which operates on a variable-length sequence x = (x1, . . . , xT ). At each timestep t, the hidden state hhti of the RNN is updatedby
hhti = f
�
hht�1i, xt�
, (1)
where f is a non-linear activation func-tion. f may be as simple as an element-wise logistic sigmoid function and as com-plex as a long short-term memory (LSTM)unit (Hochreiter and Schmidhuber, 1997).
An RNN can learn a probability distributionover a sequence by being trained to predict thenext symbol in a sequence. In that case, the outputat each timestep t is the conditional distributionp(xt | xt�1, . . . , x1). For example, a multinomialdistribution (1-of-K coding) can be output using asoftmax activation function
p(xt,j = 1 | xt�1, . . . , x1) =exp
�
wjhhti�
PKj0=1 exp
�
wj0hhti�
,
(2)
for all possible symbols j = 1, . . . ,K, where wj
are the rows of a weight matrix W. By combiningthese probabilities, we can compute the probabil-ity of the sequence x using
p(x) =TY
t=1
p(xt | xt�1, . . . , x1). (3)
From this learned distribution, it is straightfor-ward to sample a new sequence by iteratively sam-pling a symbol at each time step.
2.2 RNN Encoder–DecoderIn this paper, we propose a novel neural networkarchitecture that learns to encode a variable-lengthsequence into a fixed-length vector representationand to decode a given fixed-length vector rep-resentation back into a variable-length sequence.From a probabilistic perspective, this new modelis a general method to learn the conditional dis-tribution over a variable-length sequence condi-tioned on yet another variable-length sequence,e.g. p(y1, . . . , yT 0 | x1, . . . , xT ), where one
�� �� ��
��� �� ��
�
����
����Figure 1: An illustration of the proposed RNNEncoder–Decoder.
should note that the input and output sequencelengths T and T
0 may differ.The encoder is an RNN that reads each symbol
of an input sequence x sequentially. As it readseach symbol, the hidden state of the RNN changesaccording to Eq. (1). After reading the end ofthe sequence (marked by an end-of-sequence sym-bol), the hidden state of the RNN is a summary cof the whole input sequence.
The decoder of the proposed model is anotherRNN which is trained to generate the output se-quence by predicting the next symbol yt given thehidden state hhti. However, unlike the RNN de-scribed in Sec. 2.1, both yt and hhti are also con-ditioned on yt�1 and on the summary c of the inputsequence. Hence, the hidden state of the decoderat time t is computed by,
hhti = f
�
hht�1i, yt�1, c�
,
and similarly, the conditional distribution of thenext symbol is
P (yt|yt�1, yt�2, . . . , y1, c) = g
�
hhti, yt�1, c�
.
for given activation functions f and g (the lattermust produce valid probabilities, e.g. with a soft-max).
See Fig. 1 for a graphical depiction of the pro-posed model architecture.
The two components of the proposed RNNEncoder–Decoder are jointly trained to maximizethe conditional log-likelihood
max
✓
1
N
NX
n=1
log p✓(yn | xn), (4)
Choetal.2014
Differentpicture,sameidea
20e = (Economic, growth, has, slowed, down, in, recent, years, .)
1-of
-K c
odin
gC
ontin
uous
-spa
ceW
ord
Rep
rese
ntat
ion
si
wi
Rec
urre
ntSt
ate hi
Wor
d Ss
ampl
e
ui
Rec
urre
ntSt
atez i
f = (La, croissance, économique, s'est, ralentie, ces, dernières, années, .)
ip
Wor
d Pr
obab
ility
Encoder
Decoder
KyunghyunChoetal.2014
RNNTransla8onModelExtensions
21
3. Trainstacked/deepRNNswithmul<plelayers
4. Poten<allytrainbidirec<onalencoder
5. Traininputsequenceinreverseorderforsimpleop<miza<onproblem:InsteadofABCàXY,trainwithCBAàXY
Going Deep
h! (i)t = f (W
!"! (i)ht(i−1) +V
!" (i)h! (i)t−1 + b! (i))
h! (i)t = f (W
!"" (i)ht(i−1) +V
!" (i)h! (i)t+1 + b! (i))
yt = g(U[h!t(L );h!t(L )]+ c)
y
h(3)
xEach memory layer passes an intermediate sequential representation to the next.
h(2)
h(1)
6.MainImprovement:BePerUnits
22
• Morecomplexhiddenunitcomputa<oninrecurrence!
• GatedRecurrentUnits(GRU)introducedbyChoetal.2014(seereadinglist)
• Mainideas:
• keeparoundmemoriestocapturelongdistancedependencies
• allowerrormessagestoflowatdifferentstrengthsdependingontheinputs
GRUs
23
• StandardRNNcomputeshiddenlayeratnext<mestepdirectly:
• GRUfirstcomputesanupdategate(anotherlayer)basedoncurrentinputwordvectorandhiddenstate
• Computeresetgatesimilarlybutwithdifferentweights
GRUs
24
• Updategate
• Resetgate
• Newmemorycontent:Ifresetgateunitis~0,thenthisignorespreviousmemoryandonlystoresthenewwordinforma<on
• Finalmemoryat<mestepcombinescurrentandprevious<mesteps:
APemptatacleanillustra8on
25
rtrt-1
zt-1
~ht~ht-1
zt
ht-1 ht
xtxt-1Input:
Resetgate
Updategate
Memory(reset)
Finalmemory
GRUintui8on
26
• Ifresetiscloseto0,ignoreprevioushiddenstateàAllowsmodeltodropinforma<onthatisirrelevantinthefuture
• Updategatezcontrolshowmuchofpaststateshouldma[ernow.
• Ifzcloseto1,thenwecancopyinforma<oninthatunitthroughmany<mesteps!Lessvanishinggradient!
• Unitswithshort-termdependenciesooenhaveresetgatesveryac<ve
GRUintui8on
27
• Unitswithlongtermdependencieshaveac<veupdategatesz
• Illustra<on:
• Deriva<veof?àrestissamechainrule,butimplementwithmodulariza8onorautoma<cdifferen<a<on
where ✓ is the set of the model parameters andeach (xn,yn) is an (input sequence, output se-quence) pair from the training set. In our case,as the output of the decoder, starting from the in-put, is differentiable, we can use a gradient-basedalgorithm to estimate the model parameters.
Once the RNN Encoder–Decoder is trained, themodel can be used in two ways. One way is to usethe model to generate a target sequence given aninput sequence. On the other hand, the model canbe used to score a given pair of input and outputsequences, where the score is simply a probabilityp✓(y | x) from Eqs. (3) and (4).
2.3 Hidden Unit that Adaptively Remembersand Forgets
In addition to a novel model architecture, we alsopropose a new type of hidden unit (f in Eq. (1))that has been motivated by the LSTM unit but ismuch simpler to compute and implement.1 Fig. 2shows the graphical depiction of the proposed hid-den unit.
Let us describe how the activation of the j-thhidden unit is computed. First, the reset gate rj iscomputed by
rj = �
⇣
[Wrx]j +⇥
Urhht�1i⇤
j
⌘
, (5)
where � is the logistic sigmoid function, and [.]jdenotes the j-th element of a vector. x and ht�1
are the input and the previous hidden state, respec-tively. Wr and Ur are weight matrices which arelearned.
Similarly, the update gate zj is computed by
zj = �
⇣
[Wzx]j +⇥
Uzhht�1i⇤
j
⌘
. (6)
The actual activation of the proposed unit hj isthen computed by
h
htij = zjh
ht�1ij + (1� zj)
˜
h
htij , (7)
where
˜
h
htij = �
⇣
[Wx]j +⇥
U�
r� hht�1i�⇤
j
⌘
. (8)
In this formulation, when the reset gate is closeto 0, the hidden state is forced to ignore the pre-vious hidden state and reset with the current input
1 The LSTM unit, which has shown impressive results inseveral applications such as speech recognition, has a mem-ory cell and four gating units that adaptively control the in-formation flow inside the unit, compared to only two gatingunits in the proposed hidden unit. For details on LSTM net-works, see, e.g., (Graves, 2012).
�
�� �� �
Figure 2: An illustration of the proposed hiddenactivation function. The update gate z selectswhether the hidden state is to be updated witha new hidden state ˜
h. The reset gate r decideswhether the previous hidden state is ignored. SeeEqs. (5)–(8) for the detailed equations of r, z, hand ˜
h.
only. This effectively allows the hidden state todrop any information that is found to be irrelevantlater in the future, thus, allowing a more compactrepresentation.
On the other hand, the update gate controls howmuch information from the previous hidden statewill carry over to the current hidden state. Thisacts similarly to the memory cell in the LSTMnetwork and helps the RNN to remember long-term information. Furthermore, this may be con-sidered an adaptive variant of a leaky-integrationunit (Bengio et al., 2013).
As each hidden unit has separate reset and up-date gates, each hidden unit will learn to capturedependencies over different time scales. Thoseunits that learn to capture short-term dependencieswill tend to have reset gates that are frequently ac-tive, but those that capture longer-term dependen-cies will have update gates that are mostly active.
In our preliminary experiments, we found thatit is crucial to use this new unit with gating units.We were not able to get meaningful result with anoft-used tanh unit without any gating.
3 Statistical Machine Translation
In a commonly used statistical machine translationsystem (SMT), the goal of the system (decoder,specifically) is to find a translation f given a sourcesentence e, which maximizes
p(f | e) / p(e | f)p(f),
where the first term at the right hand side is calledtranslation model and the latter language model(see, e.g., (Koehn, 2005)). In practice, however,most SMT systems model log p(f | e) as a log-linear model with additional features and corre-
Long-short-term-memories(LSTMs)
28
• Wecanmaketheunitsevenmorecomplex
• Alloweach<mesteptomodify
• Inputgate(currentcellma[ers)
• Forget(gate0,forgetpast)
• Output(howmuchcellisexposed)
• Newmemorycell
• Finalmemorycell:
• Finalhiddenstate:
Somevisualiza8ons
29
ByChrisOla:h[p://colah.github.io/posts/2015-08-Understanding-LSTMs/
Mostillustra8onsabitoverwhelming;)
30
h[p://people.idsia.ch/~juergen/lstm/sld017.htm
h[p://deeplearning.net/tutorial/lstm.html
Intui<on:memorycellscankeepinforma<onintact,unlessinputsmakesthemforgetitoroverwriteitwithnewinput.Cellcandecidetooutputthisinforma<onorjuststoreit
LongShort-TermMemorybyHochreiterandSchmidhuber(1997)
inj
injout j
out j
w ic j
wic j
yc j
g h1.0
netw i w i
yinj yout j
net c j
g yinj
= g+sc jsc j
yinj
h yout j
net
LSTMsarecurrentlyveryhip!
31
• Envoguedefaultmodelformostsequencelabelingtasks
• Verypowerful,especiallywhenstackedandmadeevendeeper(eachhiddenlayerisalreadycomputedbyadeepinternalnetwork)
• Mostusefulifyouhavelotsandlotsofdata
DeepLSTMscomparedtotradi8onalsystems2015
32
Method test BLEU score (ntst14)Bahdanau et al. [2] 28.45
Baseline System [29] 33.30
Single forward LSTM, beam size 12 26.17Single reversed LSTM, beam size 12 30.59
Ensemble of 5 reversed LSTMs, beam size 1 33.00Ensemble of 2 reversed LSTMs, beam size 12 33.27Ensemble of 5 reversed LSTMs, beam size 2 34.50Ensemble of 5 reversed LSTMs, beam size 12 34.81
Table 1: The performance of the LSTM on WMT’14 English to French test set (ntst14). Note thatan ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam ofsize 12.
Method test BLEU score (ntst14)Baseline System [29] 33.30
Cho et al. [5] 34.54Best WMT’14 result [9] 37.0
Rescoring the baseline 1000-best with a single forward LSTM 35.61Rescoring the baseline 1000-best with a single reversed LSTM 35.85
Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs 36.5
Oracle Rescoring of the Baseline 1000-best lists ∼45
Table 2: Methods that use neural networks together with an SMT system on the WMT’14 Englishto French test set (ntst14).
task by a sizeable margin, despite its inability to handle out-of-vocabulary words. The LSTM iswithin 0.5 BLEU points of the best WMT’14 result if it is used to rescore the 1000-best list of thebaseline system.
3.7 Performance on long sentences
We were surprised to discover that the LSTM did well on long sentences, which is shown quantita-tively in figure 3. Table 3 presents several examples of long sentences and their translations.
3.8 Model Analysis
−8 −6 −4 −2 0 2 4 6 8 10−6
−5
−4
−3
−2
−1
0
1
2
3
4
John respects Mary
Mary respects JohnJohn admires Mary
Mary admires John
Mary is in love with John
John is in love with Mary
−15 −10 −5 0 5 10 15 20−20
−15
−10
−5
0
5
10
15
I gave her a card in the garden
In the garden , I gave her a cardShe was given a card by me in the garden
She gave me a card in the gardenIn the garden , she gave me a card
I was given a card by her in the garden
Figure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtainedafter processing the phrases in the figures. The phrases are clustered by meaning, which in these examples isprimarily a function of word order, which would be difficult to capture with a bag-of-words model. Notice thatboth clusters have similar internal structure.
One of the attractive features of our model is its ability to turn a sequence of words into a vectorof fixed dimensionality. Figure 2 visualizes some of the learned representations. The figure clearlyshows that the representations are sensitive to the order of words, while being fairly insensitive to the
6
SequencetoSequenceLearningbySutskeveretal.2014
DeepLSTMs(withalotmoretweaks)
33
WMT2016compe<<onresultsfromlastyear
DeepLSTMforMachineTransla8on
34
Method test BLEU score (ntst14)Bahdanau et al. [2] 28.45
Baseline System [29] 33.30
Single forward LSTM, beam size 12 26.17Single reversed LSTM, beam size 12 30.59
Ensemble of 5 reversed LSTMs, beam size 1 33.00Ensemble of 2 reversed LSTMs, beam size 12 33.27Ensemble of 5 reversed LSTMs, beam size 2 34.50Ensemble of 5 reversed LSTMs, beam size 12 34.81
Table 1: The performance of the LSTM on WMT’14 English to French test set (ntst14). Note thatan ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam ofsize 12.
Method test BLEU score (ntst14)Baseline System [29] 33.30
Cho et al. [5] 34.54Best WMT’14 result [9] 37.0
Rescoring the baseline 1000-best with a single forward LSTM 35.61Rescoring the baseline 1000-best with a single reversed LSTM 35.85
Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs 36.5
Oracle Rescoring of the Baseline 1000-best lists ∼45
Table 2: Methods that use neural networks together with an SMT system on the WMT’14 Englishto French test set (ntst14).
task by a sizeable margin, despite its inability to handle out-of-vocabulary words. The LSTM iswithin 0.5 BLEU points of the best WMT’14 result if it is used to rescore the 1000-best list of thebaseline system.
3.7 Performance on long sentences
We were surprised to discover that the LSTM did well on long sentences, which is shown quantita-tively in figure 3. Table 3 presents several examples of long sentences and their translations.
3.8 Model Analysis
−8 −6 −4 −2 0 2 4 6 8 10−6
−5
−4
−3
−2
−1
0
1
2
3
4
John respects Mary
Mary respects JohnJohn admires Mary
Mary admires John
Mary is in love with John
John is in love with Mary
−15 −10 −5 0 5 10 15 20−20
−15
−10
−5
0
5
10
15
I gave her a card in the garden
In the garden , I gave her a cardShe was given a card by me in the garden
She gave me a card in the gardenIn the garden , she gave me a card
I was given a card by her in the garden
Figure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtainedafter processing the phrases in the figures. The phrases are clustered by meaning, which in these examples isprimarily a function of word order, which would be difficult to capture with a bag-of-words model. Notice thatboth clusters have similar internal structure.
One of the attractive features of our model is its ability to turn a sequence of words into a vectorof fixed dimensionality. Figure 2 visualizes some of the learned representations. The figure clearlyshows that the representations are sensitive to the order of words, while being fairly insensitive to the
6
SequencetoSequenceLearningbySutskeveretal.2014
PCAofvectorsfromlast<mestephiddenlayer
FurtherImprovements:MoreGates!
35
GatedFeedbackRecurrentNeuralNetworks,Chungetal.2015Gated Feedback Recurrent Neural Networks
(a) Conventional stacked RNN (b) Gated Feedback RNN
Figure 1. Illustrations of (a) conventional stacking approach and (b) gated-feedback approach to form a deep RNN architecture. Bulletsin (b) correspond to global reset gates. Skip connections are omitted to simplify the visualization of networks.
The global reset gate is computed as:
g
i!j
=
⇣w
i!j
g
h
j�1t
+ u
i!j
g
h
⇤t�1
⌘, (12)
where L is the number of hidden layers, wi!j
g
and u
i!j
g
are the weight vectors for the input and the hidden states ofall the layers at time-step t � 1, respectively. For j = 1,h
j�1t
is xt
.
The global reset gate gi!j is applied collectively to the sig-nal from the i-th layer hi
t�1 to the j-th layer hj
t
. In otherwords, the signal from the layer i to the layer j is controlledbased on the input and the previous hidden states.
Fig. 1 illustrates the difference between the conventionalstacked RNN and our proposed GF-RNN. In both mod-els, information flows from lower layers to upper layers,respectively, corresponding to finer timescale and coarsertimescale. The GF-RNN, however, further allows infor-mation from the upper recurrent layer, corresponding tocoarser timescale, flows back into the lower layers, corre-sponding to finer timescales.
We call this RNN with a fully-connected recurrent tran-sition and global reset gates, a gated-feedback RNN (GF-RNN). In the remainder of this section, we describe how touse the previously described LSTM unit, GRU, and moretraditional tanh unit in the GF-RNN.
3.1. Practical Implementation of GF-RNN
3.1.1. tanh UNIT
For a stacked tanh-RNN, the signal from the previoustime-step is gated. The hidden state of the j-th layer is
computed by
h
j
t
=tanh W
j�1!j
h
j�1t
+
LX
i=1
g
i!j
U
i!j
h
i
t�1
!,
where W
j�1!j and U
i!j are the weight matrices of theincoming connections from the input and the i-th module,respectively. Compared to Eq. (2), the only difference isthat the previous hidden states are controlled by the globalreset gates.
3.1.2. LONG SHORT-TERM MEMORY AND GATEDRECURRENT UNIT
In the cases of LSTM and GRU, we do not use the globalreset gates when computing the unit-wise gates. In otherwords, Eqs. (5)–(7) for LSTM, and Eqs. (9) and (11) forGRU are not modified. We only use the global reset gateswhen computing the new state (see Eq. (4) for LSTM, andEq. (10) for GRU).
The new memory content of an LSTM at the j-th layer iscomputed by
˜
c
j
t
= tanh W
j�1!j
c
h
j�1t
+
LX
i=1
g
i!j
U
i!j
c
h
i
t�1
!.
In the case of a GRU, similarly,
˜
h
j
t
= tanh W
j�1!j
h
j�1t
+ r
j
t
�LX
i=1
g
i!j
U
i!j
h
i
t�1
!.
4. Experiment Settings
4.1. Tasks
We evaluated the proposed gated-feedback RNN (GF-RNN) on character-level language modeling and Python
Summary
41
• RecurrentNeuralNetworksarepowerful
• Alotofongoingworkrightnow
• GatedRecurrentUnitsevenbe[er
• LSTMsmaybeevenbe[er(jurys<llout)
• Thiswasanadvancedlectureàgainintui<on,encourageexplora<on
• Nextup:Midtermreview