cs224d: deep nlp lecture 9: wrap up: lstms and … · cs224d: deep nlp lecture 9: wrap up: lstms...

50
CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher [email protected]

Upload: vutruc

Post on 18-Aug-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

CS224d:DeepNLP

Lecture9:Wrapup:LSTMs

andRecursiveNeuralNetworks

[email protected]

Page 2: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Overview

4/29/16RichardSocher2

Videoissuesandfirealarm

FinishLSTMs

RecursiveNeuralNetworks

• Motivation:Compositionality

• Structureprediction:Parsing

• Backpropagation throughStructure

• VisionExample

NextLecture:

• RvNN improvements

Page 3: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Long-short-term-memories(LSTMs)

4/29/16RichardSocher3

• Wecanmaketheunitsevenmorecomplex

• Alloweachtimesteptomodify• Inputgate(currentcellmatters)

• Forget(gate0,forgetpast)

• Output(howmuchcellisexposed)

• Newmemorycell

• Finalmemorycell:

• Finalhiddenstate:

Page 4: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Illustrationsallabitoverwhelming;)

4/29/16RichardSocher4

http://people.idsia.ch/~juergen/lstm/sld017.htm

http://deeplearning.net/tutorial/lstm.html

Intuition:memorycellscankeepinformation intact,unlessinputsmakesthemforgetitoroverwriteitwithnewinput.Cellcandecidetooutput thisinformationorjuststoreit

LongShort-TermMemorybyHochreiter andSchmidhuber (1997)

inj

injout j

out j

w ic j

wic j

yc j

g h1.0

netw i w i

yinj yout j

net c j

g yinj

= g+sc jsc j

yinj

h yout j

net

Page 5: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

LSTMsarecurrentlyveryhip!

4/29/16RichardSocher5

• Envoguedefaultmodelformostsequencelabelingtasks

• Verypowerful,especiallywhenstackedandmadeevendeeper(eachhiddenlayerisalreadycomputedbyadeepinternalnetwork)

• Mostusefulifyouhavelotsandlotsofdata

Page 6: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

DeepLSTMscomparedtotraditionalsystems2015

6

Method test BLEU score (ntst14)Bahdanau et al. [2] 28.45

Baseline System [29] 33.30

Single forward LSTM, beam size 12 26.17Single reversed LSTM, beam size 12 30.59

Ensemble of 5 reversed LSTMs, beam size 1 33.00Ensemble of 2 reversed LSTMs, beam size 12 33.27Ensemble of 5 reversed LSTMs, beam size 2 34.50Ensemble of 5 reversed LSTMs, beam size 12 34.81

Table 1: The performance of the LSTM on WMT’14 English to French test set (ntst14). Note thatan ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam ofsize 12.

Method test BLEU score (ntst14)Baseline System [29] 33.30

Cho et al. [5] 34.54Best WMT’14 result [9] 37.0

Rescoring the baseline 1000-best with a single forward LSTM 35.61Rescoring the baseline 1000-best with a single reversed LSTM 35.85

Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs 36.5

Oracle Rescoring of the Baseline 1000-best lists ∼45

Table 2: Methods that use neural networks together with an SMT system on the WMT’14 Englishto French test set (ntst14).

task by a sizeable margin, despite its inability to handle out-of-vocabulary words. The LSTM iswithin 0.5 BLEU points of the best WMT’14 result if it is used to rescore the 1000-best list of thebaseline system.

3.7 Performance on long sentences

We were surprised to discover that the LSTM did well on long sentences, which is shown quantita-tively in figure 3. Table 3 presents several examples of long sentences and their translations.

3.8 Model Analysis

−8 −6 −4 −2 0 2 4 6 8 10−6

−5

−4

−3

−2

−1

0

1

2

3

4

John respects Mary

Mary respects JohnJohn admires Mary

Mary admires John

Mary is in love with John

John is in love with Mary

−15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

I gave her a card in the garden

In the garden , I gave her a cardShe was given a card by me in the garden

She gave me a card in the gardenIn the garden , she gave me a card

I was given a card by her in the garden

Figure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtainedafter processing the phrases in the figures. The phrases are clustered by meaning, which in these examples isprimarily a function of word order, which would be difficult to capture with a bag-of-words model. Notice thatboth clusters have similar internal structure.

One of the attractive features of our model is its ability to turn a sequence of words into a vectorof fixed dimensionality. Figure 2 visualizes some of the learned representations. The figure clearlyshows that the representations are sensitive to the order of words, while being fairly insensitive to the

6

Sequence toSequenceLearningbySutskever etal.2014

Page 7: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

DeepLSTMs(withalotmoretweaks)today

4/29/16RichardSocher7

WMT2016competitionresultsfromyesterday

Page 8: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

DeepLSTMforMachineTranslation

4/29/16RichardSocher8

Method test BLEU score (ntst14)Bahdanau et al. [2] 28.45

Baseline System [29] 33.30

Single forward LSTM, beam size 12 26.17Single reversed LSTM, beam size 12 30.59

Ensemble of 5 reversed LSTMs, beam size 1 33.00Ensemble of 2 reversed LSTMs, beam size 12 33.27Ensemble of 5 reversed LSTMs, beam size 2 34.50Ensemble of 5 reversed LSTMs, beam size 12 34.81

Table 1: The performance of the LSTM on WMT’14 English to French test set (ntst14). Note thatan ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam ofsize 12.

Method test BLEU score (ntst14)Baseline System [29] 33.30

Cho et al. [5] 34.54Best WMT’14 result [9] 37.0

Rescoring the baseline 1000-best with a single forward LSTM 35.61Rescoring the baseline 1000-best with a single reversed LSTM 35.85

Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs 36.5

Oracle Rescoring of the Baseline 1000-best lists ∼45

Table 2: Methods that use neural networks together with an SMT system on the WMT’14 Englishto French test set (ntst14).

task by a sizeable margin, despite its inability to handle out-of-vocabulary words. The LSTM iswithin 0.5 BLEU points of the best WMT’14 result if it is used to rescore the 1000-best list of thebaseline system.

3.7 Performance on long sentences

We were surprised to discover that the LSTM did well on long sentences, which is shown quantita-tively in figure 3. Table 3 presents several examples of long sentences and their translations.

3.8 Model Analysis

−8 −6 −4 −2 0 2 4 6 8 10−6

−5

−4

−3

−2

−1

0

1

2

3

4

John respects Mary

Mary respects JohnJohn admires Mary

Mary admires John

Mary is in love with John

John is in love with Mary

−15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

I gave her a card in the garden

In the garden , I gave her a cardShe was given a card by me in the garden

She gave me a card in the gardenIn the garden , she gave me a card

I was given a card by her in the garden

Figure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtainedafter processing the phrases in the figures. The phrases are clustered by meaning, which in these examples isprimarily a function of word order, which would be difficult to capture with a bag-of-words model. Notice thatboth clusters have similar internal structure.

One of the attractive features of our model is its ability to turn a sequence of words into a vectorof fixed dimensionality. Figure 2 visualizes some of the learned representations. The figure clearlyshows that the representations are sensitive to the order of words, while being fairly insensitive to the

6

Sequence toSequenceLearningbySutskever etal.2014

PCAofvectorsfromlasttimestephidden layer

Page 9: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

FurtherImprovements:MoreGates!

4/29/16RichardSocher9

GatedFeedbackRecurrentNeuralNetworks,Chungetal.2015Gated Feedback Recurrent Neural Networks

(a) Conventional stacked RNN (b) Gated Feedback RNN

Figure 1. Illustrations of (a) conventional stacking approach and (b) gated-feedback approach to form a deep RNN architecture. Bulletsin (b) correspond to global reset gates. Skip connections are omitted to simplify the visualization of networks.

The global reset gate is computed as:

g

i!j

= �

⇣w

i!j

g

h

j�1t

+ u

i!j

g

h

⇤t�1

⌘, (12)

where L is the number of hidden layers, wi!j

g

and u

i!j

g

are the weight vectors for the input and the hidden states ofall the layers at time-step t � 1, respectively. For j = 1,h

j�1t

is xt

.

The global reset gate gi!j is applied collectively to the sig-nal from the i-th layer hi

t�1 to the j-th layer hj

t

. In otherwords, the signal from the layer i to the layer j is controlledbased on the input and the previous hidden states.

Fig. 1 illustrates the difference between the conventionalstacked RNN and our proposed GF-RNN. In both mod-els, information flows from lower layers to upper layers,respectively, corresponding to finer timescale and coarsertimescale. The GF-RNN, however, further allows infor-mation from the upper recurrent layer, corresponding tocoarser timescale, flows back into the lower layers, corre-sponding to finer timescales.

We call this RNN with a fully-connected recurrent tran-sition and global reset gates, a gated-feedback RNN (GF-RNN). In the remainder of this section, we describe how touse the previously described LSTM unit, GRU, and moretraditional tanh unit in the GF-RNN.

3.1. Practical Implementation of GF-RNN

3.1.1. tanh UNIT

For a stacked tanh-RNN, the signal from the previoustime-step is gated. The hidden state of the j-th layer is

computed by

h

j

t

=tanh

W

j�1!j

h

j�1t

+

LX

i=1

g

i!j

U

i!j

h

i

t�1

!,

where W

j�1!j and U

i!j are the weight matrices of theincoming connections from the input and the i-th module,respectively. Compared to Eq. (2), the only difference isthat the previous hidden states are controlled by the globalreset gates.

3.1.2. LONG SHORT-TERM MEMORY AND GATEDRECURRENT UNIT

In the cases of LSTM and GRU, we do not use the globalreset gates when computing the unit-wise gates. In otherwords, Eqs. (5)–(7) for LSTM, and Eqs. (9) and (11) forGRU are not modified. We only use the global reset gateswhen computing the new state (see Eq. (4) for LSTM, andEq. (10) for GRU).

The new memory content of an LSTM at the j-th layer iscomputed by

˜

c

j

t

= tanh

W

j�1!j

c

h

j�1t

+

LX

i=1

g

i!j

U

i!j

c

h

i

t�1

!.

In the case of a GRU, similarly,

˜

h

j

t

= tanh

W

j�1!j

h

j�1t

+ r

j

t

�LX

i=1

g

i!j

U

i!j

h

i

t�1

!.

4. Experiment Settings

4.1. Tasks

We evaluated the proposed gated-feedback RNN (GF-RNN) on character-level language modeling and Python

Page 10: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Summary

4/29/16RichardSocher10

• RecurrentNeuralNetworksarepowerful

• Alotofongoingworkrightnow

• GatedRecurrentUnitsevenbetter

• LSTMsmaybeevenbetter(jurystillout)

• Thiswasanadvancedlectureà gainintuition,encourageexploration

• Nextup:RecursiveNeuralNetworkssimplerandalsopowerful:)

Page 11: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

BuildingonWordVectorSpaceModels

11

x2

x1012345678910

5

4

3

2

1Monday

92

Tuesday 9.51.5

Bymappingthemintothesamevectorspace!

15

1.14

thecountryofmybirththeplacewhereIwasborn

Buthowcanwerepresentthemeaningoflongerphrases?

France 22.5

Germany 13

Page 12: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

SemanticVectorSpaces

• DistributionalTechniques• BrownClusters• Usefulasfeaturesinside

models,e.g.CRFsforNER,etc.• Cannotcapturelongerphrases

SingleWordVectors DocumentsVectors

• Bagofwordsmodels• PCA(LSA, LDA)• GreatforIR,document

exploration,etc.• Ignorewordorder,no

detailedunderstanding

VectorsrepresentingPhrasesandSentencesthatdonotignorewordorderandcapturesemanticsforNLPtasks

Page 13: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Howshouldwemapphrasesintoavectorspace?

the countryof my birth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

UseprincipleofcompositionalityThemeaning(vector)ofasentenceisdeterminedby(1) themeaningsofitswordsand(2) therulesthatcombinethem.

Modelsinthissectioncanjointlylearnparsetreesandcompositionalvectorrepresentations

x2

x1012345678910

5

4

3

2

1

thecountryofmybirth

theplacewhereIwasborn

Monday

Tuesday

FranceGermany

13

Page 14: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

SentenceParsing:Whatwewant

91

53

85

91

43

NPNP

PP

S

71

VP

Thecatsatonthemat.14

Page 15: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

LearnStructureandRepresentation

NPNP

PP

S

VP

52 3

3

83

54

73

Thecatsatonthemat.

91

53

85

91

43

71

15

Page 16: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

LearnStructureandRepresentation?

• Dowereallyneedtolearnthisstructure?

4/29/16RichardSocherLecture1,Slide 16

Page 17: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Sidenote:Recursivevs recurrentneuralnetworks

4/29/16

the countryof my birth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

the countryof my birth

0.40.3

2.33.6

44.5

77

2.13.3

4.53.8

5.56.1

13.5

15

2.53.8

Page 18: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Sidenote:Arelanguagesrecursive?

• Cognitivelydebatable• But:recursionhelpfulindescribingnaturallanguage• Example:“thechurchwhichhasnicewindows”,anounphrase

containingarelativeclausethatcontainsanounphrases• Argumentsfornow:1)Helpfulindisambiguation:

4/29/16RichardSocherLecture1,Slide 18

Page 19: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Isrecursionuseful?

2)Helpfulforsometaskstorefertospecificphrases:• JohnandJanewenttoabigfestival.Theyenjoyedthetripandthemusicthere.

• “they”:JohnandJane• “thetrip”:wenttoabigfestival• “there”:bigfestival

3)Labelinglessclearifspecifictoonlysubphrases• Ilikedthebrightscreenbutnotthebuggyslowkeyboard ofthephone.Itwasapaintotypewith.Itwasnicetolookat.

4)Worksbetterforsometaskstousegrammaticaltreestructure

• Thisisstillupfordebate.

4/29/16RichardSocherLecture1,Slide 19

Page 20: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

RecursiveNeuralNetworksforStructurePrediction

onthemat.

91

43

33

83

85

33

Neural Network

83

1.3

Inputs:twocandidatechildren’srepresentationsOutputs:1. Thesemanticrepresentationifthetwonodesaremerged.2. Scoreofhowplausiblethenewnodewouldbe.

85

20

Page 21: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

RecursiveNeuralNetworkDefinition

score=UTp

p =tanh(W +b),

SameW parametersatallnodesofthetree

85

33

Neural Network

83

1.3score= =parent

c1 c2

c1c2

21

Page 22: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

ParsingasentencewithanRNN

Neural Network

0.120

Neural Network

0.410

Neural Network

2.333

91

53

85

91

43

71

Neural Network

3.152

Neural Network

0.301

Thecatsatonthemat.

22

Page 23: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Parsingasentence

91

53

52

Neural Network

1.121

Neural Network

0.120

Neural Network

0.410

Neural Network

2.333

53

85

91

43

71

23

Thecatsatonthemat.

Page 24: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Parsingasentence

52

Neural Network

1.121

Neural Network

0.120

33

Neural Network

3.683

91

5353

85

91

43

71

24

Thecatsatonthemat.

Page 25: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Parsingasentence

52

33

83

54

73

91

5353

85

91

43

71

25Thecatsatonthemat.

Page 26: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Max-MarginFramework- Details

• Thescoreofatreeiscomputedbythesumoftheparsingdecisionscoresateachnode:

85

33

RNN

831.3

26

Page 27: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Max-MarginFramework- Details

• Similartomax-marginparsing(Taskar etal.2004),asupervisedmax-marginobjective

• Thelosspenalizesallincorrectdecisions

• StructuresearchforA(x)wasmaximallygreedy• Instead:BeamSearchwithChart

27

Page 28: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

BackpropagationThroughStructure

IntroducedbyGoller&Küchler (1996)

Principallythesameasgeneralbackpropagation

Threedifferencesresultingfromtherecursionandtreestructure:1. SumderivativesofW fromallnodes2. Splitderivativesateachnode3. Adderrormessagesfromparent+nodeitself

28

The second derivative in eq. 28 for output units is simply

@a

(nl

)i

@W

(nl

�1)ij

=@

@W

(nl

�1)ij

z

(nl

)i

=@

@W

(nl

�1)ij

⇣W

(nl

�1)i· a

(nl

�1)⌘= a

(nl

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

@E

n

@W

(nl

�1)ij

= (yi

� t

i

)a(nl

�1)j

= �

(nl

)i

a

(nl

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

@E

@W

(nl

�2)ij

=X

n

@E

n

@a

(nl

)| {z }�

(nl

)

@a

(nl

)

@W

(nl

�2)ij

+ �W

(nl

�2)ji

. (48)

Now,

(�(nl

))T@a

(nl

)

@W

(nl

�2)ij

= (�(nl

))T@z

(nl

)

@W

(nl

�2)ij

(49)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)a

(nl

�1) (50)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)·i a

(nl

�1)i

(51)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

a

(nl

�1)i

(52)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(W (nl

�2)i· a

(nl

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

)a(nl

�2)j

(55)

=⇣(�(nl

))TW (nl

�1)·i

⌘f

0(z(nl

�1)i

)a(nl

�2)j

(56)

=

0

@s

l+1X

j=1

W

(nl

�1)ji

(nl

)j

)

1

Af

0(z(nl

�1)i

)

| {z }

a

(nl

�2)j

(57)

= �

(nl

�1)i

a

(nl

�2)j

(58)

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

7

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

@

@W

(l)ij

E

R

= a

(l)j

(l+1)i

+ �W

(l)ij

(60)

(61)

Which in one simplified vector notation becomes:

@

@W

(l)E

R

= �

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

n

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

(nl

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to [email protected]

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

8

Page 29: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

BTS:1)Sumderivativesofallnodes

Youcanactuallyassumeit’sadifferentW ateachnodeIntuitionviaexample:

Ifwetakeseparatederivativesofeachoccurrence,wegetsame:

29

Page 30: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

BTS:2)Splitderivativesateachnode

Duringforwardprop,theparentiscomputedusing2children

Hence,theerrorsneedtobecomputedwrt eachofthem:

whereeachchild’serrorisn-dimensional

85

33

83

c1p=tanh(W+b)c1

c2c2

85

33

83

c1 c2

30

Page 31: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

BTS:3)Adderrormessages

• Ateachnode:• Whatcameup(fprop)mustcomedown(bprop)• Totalerrormessages± =errormessages fromparent+errormessagefromownscore

4/29/16RichardSocherLecture1,Slide 31

85

33

83

c1 c2

parentscore

Page 32: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

BTSPythonCode:forwardProp

4/29/16RichardSocherLecture1,Slide 32

Page 33: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

BTSPythonCode:backProp

4/29/16RichardSocherLecture1,Slide 33

The second derivative in eq. 28 for output units is simply

@a

(nl

)i

@W

(nl

�1)ij

=@

@W

(nl

�1)ij

z

(nl

)i

=@

@W

(nl

�1)ij

⇣W

(nl

�1)i· a

(nl

�1)⌘= a

(nl

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

@E

n

@W

(nl

�1)ij

= (yi

� t

i

)a(nl

�1)j

= �

(nl

)i

a

(nl

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

@E

@W

(nl

�2)ij

=X

n

@E

n

@a

(nl

)| {z }�

(nl

)

@a

(nl

)

@W

(nl

�2)ij

+ �W

(nl

�2)ji

. (48)

Now,

(�(nl

))T@a

(nl

)

@W

(nl

�2)ij

= (�(nl

))T@z

(nl

)

@W

(nl

�2)ij

(49)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)a

(nl

�1) (50)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)·i a

(nl

�1)i

(51)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

a

(nl

�1)i

(52)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(W (nl

�2)i· a

(nl

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

)a(nl

�2)j

(55)

=⇣(�(nl

))TW (nl

�1)·i

⌘f

0(z(nl

�1)i

)a(nl

�2)j

(56)

=

0

@s

l+1X

j=1

W

(nl

�1)ji

(nl

)j

)

1

Af

0(z(nl

�1)i

)

| {z }

a

(nl

�2)j

(57)

= �

(nl

�1)i

a

(nl

�2)j

(58)

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

7

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

@

@W

(l)ij

E

R

= a

(l)j

(l+1)i

+ �W

(l)ij

(60)

(61)

Which in one simplified vector notation becomes:

@

@W

(l)E

R

= �

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

n

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

(nl

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to [email protected]

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

8

Page 34: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

BTS:Optimization

• Asbefore,wecanplugthegradientsintoastandardoff-the-shelfL-BFGSoptimizerorSGD

• BestresultswithAdaGrad (Duchi etal,2011):

• Fornon-continuousobjectiveusesubgradientmethod (Ratliffetal.2007)

34

Page 35: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Discussion:SimpleRNN• GoodresultswithsinglematrixRNN(morelater)

• SingleweightmatrixRNNcouldcapturesomephenomenabutnotadequateformorecomplex,higherordercompositionandparsinglongsentences

• Thecompositionfunctionisthesameforallsyntacticcategories,punctuation,etc

W

c1 c2

pWscore s

Page 36: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Solution:Syntactically-UntiedRNN

• Idea:Conditionthecompositionfunctiononthesyntacticcategories,“untietheweights”

• Allowsfordifferentcompositionfunctionsforpairsofsyntacticcategories,e.g.Adv+AdjP,VP+NP

• Combinesdiscretesyntacticcategorieswithcontinuoussemanticinformation

Page 37: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Solution:CompositionalVectorGrammars

• Problem:Speed.Everycandidatescoreinbeamsearchneedsamatrix-vectorproduct.

• Solution:Computescoreonlyforasubsetoftreescomingfromasimpler,fastermodel(PCFG)• Prunesveryunlikelycandidatesforspeed• Providescoarsesyntacticcategoriesofthechildrenforeachbeamcandidate

• CompositionalVectorGrammars:CVG=PCFG+RNN

Page 38: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Details:CompositionalVectorGrammar

• ScoresateachnodecomputedbycombinationofPCFGandSU-RNN:

• Interpretation:Factoringdiscreteandcontinuousparsinginonemodel:

• Socheretal.(2013)

Page 39: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Relatedworkforrecursiveneuralnetworks

Pollack(1990):Recursiveauto-associativememories

PreviousRecursiveNeuralNetworksworkbyGoller &Küchler (1996),Costaetal.(2003) assumedfixedtreestructureandusedonehotvectors.

Hinton(1990)andBottou (2011):Relatedideasaboutrecursivemodelsandrecursiveoperatorsassmoothversionsoflogicoperations

39

Page 40: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

RelatedWorkforparsing

• ResultingCVGParserisrelatedtopreviousworkthatextendsPCFGparsers

• KleinandManning(2003a):manualfeatureengineering• Petrov etal.(2006):learningalgorithmthatsplitsandmerges

syntacticcategories• Lexicalizedparsers(Collins,2003;Charniak,2000):describeeach

categorywithalexicalitem• HallandKlein(2012)combineseveralsuchannotationschemesina

factoredparser.• CVGsextendtheseideasfromdiscreterepresentationstoricher

continuousones

Page 41: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Experiments• StandardWSJsplit,labeledF1• BasedonsimplePCFGwithfewerstates• Fastpruningofsearchspace,fewmatrix-vectorproducts• 3.8%higherF1,20%fasterthanStanfordfactoredparser

Parser Test, AllSentencesStanfordPCFG, (KleinandManning, 2003a) 85.5Stanford Factored(KleinandManning, 2003b) 86.6

FactoredPCFGs(Hall andKlein,2012) 89.4Collins(Collins, 1997) 87.7SSN(Henderson, 2004) 89.4Berkeley Parser(Petrov andKlein,2007) 90.1CVG(RNN)(Socheretal., ACL2013) 85.0CVG(SU-RNN)(Socheretal., ACL2013) 90.4Charniak - SelfTrained (McClosky etal.2006) 91.0Charniak - SelfTrained-ReRanked (McClosky etal.2006) 92.1

Page 42: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

SU-RNNAnalysis

• Learnsnotionofsoftheadwords

DT-NP

VP-NP

Page 43: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Analysisofresultingvectorrepresentations

Allthefiguresareadjustedforseasonalvariations1.Allthenumbersareadjustedforseasonalfluctuations2.Allthefiguresareadjustedtoremoveusualseasonalpatterns

Knight-Ridderwouldn’tcommentontheoffer1.Harscodeclinedtosaywhatcountryplacedtheorder2.Coastalwouldn’tdisclosetheterms

Salesgrewalmost7%to$UNKm.from$UNKm.1.Salesrosemorethan7%to$94.9m.from$88.3m.2.Salessurged40%toUNKb.yenfromUNKb.

Page 44: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

SU-RNNAnalysis

• Cantransfersemanticinformationfromsinglerelatedexample

• Trainsentences:• Heeatsspaghettiwithafork.• Sheeatsspaghettiwithpork.

• Testsentences• Heeatsspaghettiwithaspoon.• Heeatsspaghettiwithmeat.

Page 45: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

SU-RNNAnalysis

Page 46: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

LabelinginRecursiveNeuralNetworks

Neural Network

83

• Wecanuseeach node’srepresentationasfeaturesforasoftmax classifier:

• Trainingsimilartomodelinpart1withstandardcross-entropyerror+scores

SoftmaxLayer

NP

46

Page 47: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

SceneParsing

• Themeaningofasceneimageisalsoafunctionofsmallerregions,

• howtheycombineaspartstoformlargerobjects,

• andhowtheobjectsinteract.

Similarprincipleofcompositionality.

47

Page 48: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

AlgorithmforParsingImages

SameRecursiveNeuralNetworkasfornaturallanguageparsing!(Socheretal.ICML2011)

Features

Grass Tree

Segments

SemanticRepresentations

People Building

ParsingNaturalSceneImagesParsingNaturalSceneImages

48

Page 49: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Multi-classsegmentation

Method Accuracy

PixelCRF (Gouldetal.,ICCV2009) 74.3

Classifier onsuperpixel features 75.9

Region-basedenergy (Gouldetal.,ICCV2009) 76.4

Locallabelling (Tighe &Lazebnik,ECCV2010) 76.9

Superpixel MRF(Tighe &Lazebnik, ECCV2010) 77.5

SimultaneousMRF(Tighe &Lazebnik,ECCV2010) 77.5

RecursiveNeuralNetwork 78.1

StanfordBackgroundDataset(Gouldetal.2009)49

Page 50: CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

4/29/16RichardSocherLecture1,Slide 50