natural language processing with deep learning cs224n/ling284€¦ · with deep learning...

48
Natural Language Processing with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks and Constituency Parsing

Upload: others

Post on 05-Sep-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Natural Language Processingwith Deep Learning

CS224N/Ling284

Richard Socher

Lecture 14: Tree Recursive Neural Networks and Constituency Parsing

Page 2: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

LecturePlan

1. Motivation:CompositionalityandRecursion2. StructurepredictionwithsimpleTreeRNN:Parsing3. BackpropagationthroughStructure4. Morecomplexunits

Reminders/comments:LearnuponGPUs,Azure,DockerAss4:Getsomethingworking,usingaGPUformilestoneFinalprojectdiscussions– comemeetwithus!OHtodayafterclass.YouhavetocometoeveryOH.NoadditionalfeedbackbeyondOH.Nothingongradescope.

3/1/181

Page 3: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

1.ThespectrumoflanguageinCS

2 3/1/18

Page 4: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Semanticinterpretationoflanguage–Notjustwordvectors

Howcanweknowwhenlargerunitsaresimilarinmeaning?

• Thesnowboarder isleapingoveramogul

• Apersononasnowboardjumpsintotheair

Peopleinterpretthemeaningoflargertextunits–entities,descriptiveterms,facts,arguments,stories– bysemanticcomposition ofsmallerelements

3/1/183

Page 5: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Compositionality

Page 6: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18
Page 7: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Language understanding –& Artificial Intelligence – requires being able to understand bigger

things from knowing about smaller parts

3/1/186

Page 8: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

7

Page 9: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Arelanguagesrecursive?

• Cognitivelysomewhatdebatable• But:recursionisnaturalfordescribinglanguage• [Themanfrom[thecompanythatyouspokewithabout[the

project]yesterday]]• nounphrasecontaininganounphrasecontaininganounphrase• Argumentsfornow:1)Helpfulindisambiguation:

3/1/18Lecture1,Slide8

Page 10: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Isrecursionuseful?

2)Helpfulforsometaskstorefertospecificphrases:• JohnandJanewenttoabigfestival.Theyenjoyedthetripandthemusicthere.

• “they”:JohnandJane• “thetrip”:wenttoabigfestival• “there”:bigfestival

3)Worksbetterforsometaskstousegrammaticaltreestructure• It’sapowerfulpriorforlanguagestructure

3/1/18Lecture1,Slide9

Page 11: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

BuildingonWordVectorSpaceModels

10

x2

x1012345678910

5

4

3

2

1Monday

92

Tuesday 9.51.5

Bymappingthemintothesamevectorspace!

15

1.14

thecountryofmybirththeplacewhereIwasborn

Howcanwerepresentthemeaningoflongerphrases?

France 22.5

Germany 13

3/1/18

Page 12: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Howshouldwemapphrasesintoavectorspace?

the countryof my birth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

UseprincipleofcompositionalityThemeaning(vector)ofasentenceisdeterminedby(1) themeaningsofitswordsand(2) therulesthatcombinethem.

Modelsinthissectioncanjointlylearnparsetreesandcompositionalvectorrepresentations

x2

x1012345678910

5

4

3

2

1

thecountryofmybirth

theplacewhereIwasborn

Monday

Tuesday

FranceGermany

113/1/18

Page 13: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

ConstituencySentenceParsing:Whatwewant

91

53

85

91

43

NPNP

PP

S

71

VP

Thecatsatonthemat.12 3/1/18

Page 14: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

LearnStructureandRepresentation

NPNP

PP

S

VP

52 3

3

83

54

73

Thecatsatonthemat.

91

53

85

91

43

71

13 3/1/18

Page 15: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Recursivevs.recurrentneuralnetworks

3/1/18

the countryof my birth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

the countryof my birth

0.40.3

2.33.6

44.5

77

2.13.3

4.53.8

5.56.1

13.5

15

2.53.8

Lecture1,Slide14

Page 16: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Recursivevs.recurrentneuralnetworks

3/1/18

• Recursiveneuralnetsrequireatreestructure

• Recurrentneuralnetscannotcapturephraseswithoutprefixcontextandoftencapturetoomuchoflastwordsinfinalvector

the countryof my birth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

the countryof my birth

0.40.3

2.33.6

44.5

77

2.13.3

4.53.8

5.56.1

13.5

15

2.53.8

Lecture1,Slide15

Page 17: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

RecursiveNeuralNetworksforStructurePrediction

onthemat.

91

43

33

83

85

33

Neural Network

83

1.3

Inputs:twocandidatechildren’srepresentationsOutputs:1. Thesemanticrepresentationifthetwonodesaremerged.2. Scoreofhowplausiblethenewnodewouldbe.

85

16 3/1/18

Page 18: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

RecursiveNeuralNetworkDefinition

score=UTp

p =tanh(W +b),

SameW parametersatallnodesofthetree

85

33

Neural Network

83

1.3score= =parent

c1 c2

c1c2

17 3/1/18

Page 19: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

ParsingasentencewithanRNN

Neural Network

0.120

Neural Network

0.410

Neural Network

2.333

91

53

85

91

43

71

Neural Network

3.152

Neural Network

0.301

Thecatsatonthemat.

18 3/1/18

Page 20: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Parsingasentence

91

53

52

Neural Network

1.121

Neural Network

0.120

Neural Network

0.410

Neural Network

2.333

53

85

91

43

71

19

Thecatsatonthemat.

3/1/18

Page 21: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Parsingasentence

52

Neural Network

1.121

Neural Network

0.120

33

Neural Network

3.683

91

5353

85

91

43

71

20

Thecatsatonthemat.

3/1/18

Page 22: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Parsingasentence

52

33

83

54

73

91

5353

85

91

43

71

21Thecatsatonthemat.

3/1/18

Page 23: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Max-MarginFramework- Details

• Thescoreofatreeiscomputedbythesumoftheparsingdecisionscoresateachnode:

• x issentence;y isparsetree

85

33

RNN

831.3

22 3/1/18

Page 24: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Max-MarginFramework- Details

• Similartomax-marginparsing(Taskar etal.2004),asupervisedmax-marginobjective

• Thelosspenalizesallincorrectdecisions

• StructuresearchforA(x)wasgreedy(joinbestnodeseachtime)• Instead:Beamsearchwithchart

23 3/1/18

Page 25: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

BackpropagationThroughStructure

IntroducedbyGoller &Küchler (1996)

Principallythesameasgeneralbackpropagation

Threedifferencesresultingfromtherecursionandtreestructure:1. SumderivativesofW fromallnodes(likeRNN)2. Splitderivativesateachnode(fortree)3. Adderrormessagesfromparent+nodeitself

24

The second derivative in eq. 28 for output units is simply

@a

(nl

)i

@W

(nl

�1)ij

=@

@W

(nl

�1)ij

z

(nl

)i

=@

@W

(nl

�1)ij

⇣W

(nl

�1)i· a

(nl

�1)⌘= a

(nl

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

@E

n

@W

(nl

�1)ij

= (yi

� t

i

)a(nl

�1)j

= �

(nl

)i

a

(nl

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

@E

@W

(nl

�2)ij

=X

n

@E

n

@a

(nl

)| {z }�

(nl

)

@a

(nl

)

@W

(nl

�2)ij

+ �W

(nl

�2)ji

. (48)

Now,

(�(nl

))T@a

(nl

)

@W

(nl

�2)ij

= (�(nl

))T@z

(nl

)

@W

(nl

�2)ij

(49)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)a

(nl

�1) (50)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)·i a

(nl

�1)i

(51)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

a

(nl

�1)i

(52)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(W (nl

�2)i· a

(nl

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

)a(nl

�2)j

(55)

=⇣(�(nl

))TW (nl

�1)·i

⌘f

0(z(nl

�1)i

)a(nl

�2)j

(56)

=

0

@s

l+1X

j=1

W

(nl

�1)ji

(nl

)j

)

1

Af

0(z(nl

�1)i

)

| {z }

a

(nl

�2)j

(57)

= �

(nl

�1)i

a

(nl

�2)j

(58)

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

7

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

@

@W

(l)ij

E

R

= a

(l)j

(l+1)i

+ �W

(l)ij

(60)

(61)

Which in one simplified vector notation becomes:

@

@W

(l)E

R

= �

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

n

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

(nl

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to [email protected]

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

8

3/1/18

Page 26: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

BTS:1)Sumderivativesofallnodes

Youcanactuallyassumeit’sadifferentW ateachnodeIntuitionviaexample:

Ifwetakeseparatederivativesofeachoccurrence,wegetsame:

25 3/1/18

Page 27: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

BTS:2)Splitderivativesateachnode

Duringforwardprop,theparentiscomputedusing2children

Hence,theerrorsneedtobecomputedwrt eachofthem:

whereeachchild’serrorisn-dimensional

85

33

83

c1p=tanh(W+b)c1

c2c2

85

33

83

c1 c2

26 3/1/18

Page 28: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

BTS:3)Adderrormessages

• Ateachnode:• Whatcameup(fprop)mustcomedown(bprop)• Totalerrormessages=errormessagesfromparent+errormessagefromownscore

3/1/18Lecture1,Slide27

85

33

83

c1 c2

parentscore

Page 29: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

BTSPythonCode:forwardProp

3/1/18Lecture1,Slide28

Page 30: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

BTSPythonCode:backProp

3/1/18Lecture1,Slide29

The second derivative in eq. 28 for output units is simply

@a

(nl

)i

@W

(nl

�1)ij

=@

@W

(nl

�1)ij

z

(nl

)i

=@

@W

(nl

�1)ij

⇣W

(nl

�1)i· a

(nl

�1)⌘= a

(nl

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

@E

n

@W

(nl

�1)ij

= (yi

� t

i

)a(nl

�1)j

= �

(nl

)i

a

(nl

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

@E

@W

(nl

�2)ij

=X

n

@E

n

@a

(nl

)| {z }�

(nl

)

@a

(nl

)

@W

(nl

�2)ij

+ �W

(nl

�2)ji

. (48)

Now,

(�(nl

))T@a

(nl

)

@W

(nl

�2)ij

= (�(nl

))T@z

(nl

)

@W

(nl

�2)ij

(49)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)a

(nl

�1) (50)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)·i a

(nl

�1)i

(51)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

a

(nl

�1)i

(52)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(W (nl

�2)i· a

(nl

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

)a(nl

�2)j

(55)

=⇣(�(nl

))TW (nl

�1)·i

⌘f

0(z(nl

�1)i

)a(nl

�2)j

(56)

=

0

@s

l+1X

j=1

W

(nl

�1)ji

(nl

)j

)

1

Af

0(z(nl

�1)i

)

| {z }

a

(nl

�2)j

(57)

= �

(nl

�1)i

a

(nl

�2)j

(58)

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

7

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

@

@W

(l)ij

E

R

= a

(l)j

(l+1)i

+ �W

(l)ij

(60)

(61)

Which in one simplified vector notation becomes:

@

@W

(l)E

R

= �

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

n

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

(nl

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to [email protected]

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

8

Page 31: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Discussion:SimpleRNN• DecentresultswithsinglematrixTreeRNN

• SingleweightmatrixTreeRNN couldcapturesomephenomenabutnotadequateformorecomplex,higherordercompositionandparsinglongsentences

• Thereisnorealinteractionbetweentheinputwords

• Thecompositionfunctionisthesameforallsyntacticcategories,punctuation,etc. W

c1 c2

pWscore s

3/1/18Lecture1,Slide30

Page 32: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Version2:Syntactically-UntiedRNN

• AsymbolicContext-FreeGrammar(CFG)backboneisadequateforbasicsyntacticstructure

• Weusethediscretesyntacticcategoriesofthechildrentochoosethecompositionmatrix

• ATreeRNN candobetterwithdifferentcompositionmatrixfordifferentsyntacticenvironments

• Theresultgivesusabettersemantics

3/1/18Lecture1,Slide31

Page 33: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

CompositionalVectorGrammars

• Problem:Speed.Everycandidatescoreinbeamsearchneedsamatrix-vectorproduct.

• Solution:Computescoreonlyforasubsetoftreescomingfromasimpler,fastermodel(PCFG)• Prunesveryunlikelycandidatesforspeed• Providescoarsesyntacticcategoriesofthechildrenforeachbeamcandidate

• CompositionalVectorGrammar=PCFG+TreeRNN

3/1/18Lecture1,Slide32

Page 34: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

RelatedWorkforparsing

• ResultingCVGParserisrelatedtopreviousworkthatextendsPCFGparsers

• KleinandManning(2003a):manualfeatureengineering• Petrov etal.(2006):learningalgorithmthatsplitsandmerges

syntacticcategories• Lexicalizedparsers(Collins,2003;Charniak,2000):describeeach

categorywithalexicalitem• HallandKlein(2012)combineseveralsuchannotationschemesina

factoredparser.• CVGsextendtheseideasfromdiscreterepresentationstoricher

continuousones

3/1/18Lecture1,Slide33

Page 35: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Experiments• StandardWSJsplit,labeledF1• BasedonsimplePCFGwithfewerstates• Fastpruningofsearchspace,fewmatrix-vectorproducts• 3.8%higherF1,20%fasterthanStanfordfactoredparser

Parser Test, AllSentencesStanfordPCFG, (KleinandManning,2003a) 85.5Stanford Factored(KleinandManning,2003b) 86.6

FactoredPCFGs(Hall andKlein,2012) 89.4Collins(Collins, 1997) 87.7SSN(Henderson, 2004) 89.4Berkeley Parser(Petrov andKlein,2007) 90.1CVG(RNN)(Socheretal., ACL2013) 85.0CVG(SU-RNN)(Socheretal., ACL2013) 90.4Charniak - SelfTrained (McClosky etal.2006) 91.0Charniak - SelfTrained-ReRanked (McClosky etal.2006) 92.13/1/18Lecture1,Slide34

Page 36: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

SU-RNN/CVG[Socher,Bauer,Manning,Ng2013]

LearnssoftnotionofheadwordsInitialization:

NP-CC

NP-PP PP-NP

PRP$-NP

3/1/18Lecture1,Slide35

Page 37: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

SU-RNN/CVG[Socher,Bauer,Manning,Ng2013]

ADJP-NP

ADVP-ADJP

JJ-NP

DT-NP

3/1/18Lecture1,Slide36

Page 38: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Analysisofresultingvectorrepresentations

Allthefiguresareadjustedforseasonalvariations1.Allthenumbersareadjustedforseasonalfluctuations2.Allthefiguresareadjustedtoremoveusualseasonalpatterns

Knight-Ridderwouldn’tcommentontheoffer1.Harscodeclinedtosaywhatcountryplacedtheorder2.Coastalwouldn’tdisclosetheterms

Salesgrewalmost7%to$UNKm.from$UNKm.1.Salesrosemorethan7%to$94.9m.from$88.3m.2.Salessurged40%toUNKb.yenfromUNKb.

3/1/18Lecture1,Slide37

Page 39: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Version3:CompositionalityThroughRecursiveMatrix-VectorSpaces

OnewaytomakethecompositionfunctionmorepowerfulwasbyuntyingtheweightsW

Butwhatifwordsactmostlyasanoperator,e.g.“very”inverygood

Proposal:Anewcompositionfunction

p=tanh(W+b)c1c2

Before:

3/1/1838

Page 40: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

CompositionalityThroughRecursiveMatrix-VectorRecursiveNeuralNetworks

p=tanh(W+b)c1c2

p=tanh(W+b)C2c1C1c2

39 3/1/18

Page 41: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Matrix-vectorRNNs[Socher,Huval,Bhat,Manning,&Ng,2012]

p=

AB

=P

3/1/1840

Page 42: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

PredictingSentimentDistributionsGoodexamplefornon-linearityinlanguage

41 3/1/18

Page 43: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

ClassificationofSemanticRelationships

• CananMV-RNNlearnhowalargesyntacticcontextconveysasemanticrelationship?

• My[apartment]e1 hasaprettylarge[kitchen] e2à component-wholerelationship(e2,e1)

• Buildasinglecompositionalsemanticsfortheminimalconstituentincludingbothterms

3/1/18Lecture1,Slide42

Page 44: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

ClassificationofSemanticRelationships

Classifier Features F1SVM POS,stemming,syntacticpatterns 60.1MaxEnt POS,WordNet,morphologicalfeatures,noun

compoundsystem,thesauri,Googlen-grams77.6

SVM POS,WordNet,prefixes,morphologicalfeatures,dependencyparsefeatures,Levinclasses,PropBank,FrameNet,NomLex-Plus,Googlen-grams,paraphrases,TextRunner

82.2

RNN – 74.8MV-RNN – 79.1MV-RNN POS,WordNet,NER 82.4

3/1/18Lecture1,Slide43

Page 45: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

SceneParsing

• Themeaningofasceneimageisalsoafunctionofsmallerregions,

• howtheycombineaspartstoformlargerobjects,

• andhowtheobjectsinteract.

Similarprincipleofcompositionality.

44 3/1/18

Page 46: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

AlgorithmforParsingImages

SameRecursiveNeuralNetworkasfornaturallanguageparsing!(Socheretal.ICML2011)

Features

Grass Tree

Segments

SemanticRepresentations

People Building

ParsingNaturalSceneImagesParsingNaturalSceneImages

45 3/1/18

Page 47: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Multi-classsegmentation

Method Accuracy

PixelCRF (Gouldetal.,ICCV2009) 74.3

Classifier onsuperpixel features 75.9

Region-basedenergy (Gouldetal.,ICCV2009) 76.4

Locallabelling (Tighe &Lazebnik,ECCV2010) 76.9

Superpixel MRF(Tighe &Lazebnik, ECCV2010) 77.5

SimultaneousMRF(Tighe &Lazebnik,ECCV2010) 77.5

RecursiveNeuralNetwork 78.1

StanfordBackgroundDataset(Gouldetal.2009)46 3/1/18

Page 48: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18

Nextlecture

• Modeloverview,comparison,extensions,combinations,etc

3/1/18Lecture1,Slide47