deep learning for nlp part 2 - stanford university...deep learning for nlp part 2 cs224n christopher...

49
Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio)

Upload: others

Post on 26-Jun-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

DeepLearningforNLPPart2

CS224NChristopherManning

(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)

Page 2: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Word RepresentationsPart1.3:TheBasics

2

Page 3: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

The standard word representation

Thevastmajorityofrule-basedand statisticalNLPworkregardswordsasatomicsymbols:hotel, conference, walk

Invectorspaceterms,thisisavectorwithone1andalotofzeroes

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]Dimensionality:20K(speech) – 50K(PTB)– 500K(bigvocab)– 13M(Google1T)

Wecallthisa“one-hot”representation.Itsproblem:

motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] ANDhotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0

3

Page 4: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Distributional similarity based representations

Youcangetalotofvaluebyrepresentingawordbymeansofitsneighbors

“Youshallknowawordbythecompanyitkeeps”(J.R.Firth1957:11)

OneofthemostsuccessfulideasofmodernstatisticalNLP

government debt problems turning into banking crises as has happened in

saying that Europe needs unified banking regulation to replace the hodgepodge

ë Thesewordswillrepresentbankingì

4Youcanvarywhetheryouuse localorlargecontexttogetamoresyntacticorsemantic clustering

Page 5: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Distributional word vectors

5

Distributionalcountsgivesamedimension,denserrepresentation

motelhotelbank

debtcrisesunifiedHodgepodgethepillowreceptioninternet

3100

12221258

9001

147253719

171152

183138

Page 6: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Two traditional word representations:Class-based and soft clustering

Classbasedmodelslearnwordclassesofsimilarwordsbasedondistributionalinformation(~classHMM)• Brownclustering(Brownetal.1992,Liang2005)• Exchangeclustering(Martinetal.1998,Clark2003)

1. Clinton,Jiang,Bush,Wilensky,Suharto,Reagan,…5. also,still,already,currently,actually,typically, ...6. recovery,strength,expansion, freedom,resistance, ...

Softclusteringmodelslearnforeachcluster/topicadistributionoverwordsofhowlikelythatwordisineachcluster• LatentSemanticAnalysis(LSA/LSI),Randomprojections• LatentDirichlet Analysis(LDA),HMMclustering6

Page 7: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Neural word embeddings as a distributed representation

Similaridea:Wordmeaningisrepresentedasa(dense)vector– apointina(medium-dimensional)vectorspace

Neuralwordembeddingscombinevectorspacesemanticswiththepredictionofprobabilisticmodels(Bengio etal.2003,Collobert &Weston2008,Huangetal.2012)

linguistics=

7

0.2860.792−0.177−0.1070.109−0.5420.3490.271

Page 8: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Neural word embeddings -visualization

8

Page 9: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Kulkarni, Al-Rfou, Perozzi, & Skiena (2015)

Page 10: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Distributed vs. distributional representations

Distributed: Aconceptisrepresentedascontinuousactivationlevelsinanumberofelements[Contrast:local]

Distributional:Meaningisrepresentedbycontextsofuse[Contrast:denotational]

Page 11: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Advantages of the neural word embedding approach

11

Comparedtoothermethods,neuralwordembeddingscanbecomemoremeaningfulthroughaddingsupervisionfromoneormultipletasks

Forinstance,sentimentisusuallynotcapturedinunsupervisedwordembeddingsbutcanbeinneuralwordvectors

Wecanbuildcompositionalvectorrepresentationsforlongerphrases(nextlecture)

Page 12: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Unsupervised word vector learning

Part1.4:TheBasics

12

Page 13: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

The mainstream methods for neural word vector learning in 2015

1. word2vec• https://code.google.com/p/word2vec/• Codefortwodifferent algorithms (Skipgram withnegativesampling–

SGNS,andContinuousBagofWords– CBOW)• Usessimple, fastbilinearmodelstolearnverygoodwordrepresentations• Mikolov,Sutskever, Chen,Corrado, andDean.NIPS2013

2. GloVe• http://nlp.stanford.edu/projects/glove/• Anon-linearmatrixfactorization: Startswithcountmatrixandfactorizes

itwithanexplicitlossfunction• SortofsimilartoSGNS,really,butfromStanfordJ• Pennington,Socher,andManning.EMNLP2014

13

Page 14: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

But we won’t look at either, but … Contrastive Estimation of Word Vectors

Aneuralnetworkforlearningwordvectors(Collobert etal.JMLR2011)

Idea:Awordanditscontextisapositivetrainingsample;arandomwordinthatsamecontextgivesanegativetrainingsample:

catchillson amat catchillsOhioamat14

Page 15: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

A neural network for learning word vectors

15

Howdoweformalizethisidea? Askthatscore(catchillsonamat)>score(catchillsOhioamat)

Howdowecomputethescore?

• Withaneuralnetwork• Eachwordisassociatedwithan

n-dimensionalvector

Page 16: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Word embedding matrix

• Initializeallwordvectorsrandomly toformawordembeddingmatrix

|V|

L = … n

thecatmat…• Thesearethewordfeatureswewanttolearn• Alsocalledalook-uptable

• Mathematicallyyougetaword’svectorbymultiplyingLwithaone-hotvectore:x =Le

[]

16

Page 17: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

• score(catchillsonamat)• Todescribeaphrase,retrieve(viaindex)thecorresponding

vectorsfromL

catchillsonamat

• Thenconcatenatethemtoforma5n vector:• x =[ ]• Howdowethencomputescore(x)?

Word vectors as input to a neural network

17

Page 18: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Scoring a Single Layer Neural Network

• Asinglelayerisacombinationofalinearlayerandanonlinearity:

• Theneuralactivationscanthenbeusedtocomputesomefunction.

• Forinstance,thescorewecareabout:

18

Page 19: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Summary: Feed-forward Computation

19

Computingawindow’sscorewitha3-layerNeuralNet:s=score(catchillsonamat)

catchillsonamat

Page 20: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Summary: Feed-forward Computation

• s =score(catchillsonamat)• sc =score(catchillsOhioamat)

• Ideafortrainingobjective:makescoreoftruewindowlargerandcorruptwindow’sscorelower(untilthey’resufficientlyseparated).Minimize

• Thisiscontinuous,canperformSGD• Lookatafewexamples,nudgeweightstomakeJ smaller

20

Page 21: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

The Backpropagation Algorithm

• Backpropagation isawayofcomputinggradientsofexpressionsefficiently throughrecursiveapplicationof chainrule.• EitherRumelhart,Hinton&McClelland1986orWerbos 1974orLinnainmaa1970

• Thederivativeofaloss(anobjectivefunction)oneachvariabletellsyouthesensitivityofthelosstoitsvalue.

• Anindividualstepofthechainruleislocal.Ittellsyouhowthesensitivityoftheobjectivefunctionismodulatedbythatfunctioninthenetwork• Theinputchangesatsomerateforavariable• Thefunctionatthenodescalesthatrate

21

Page 22: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Training the parameters of the model

IfcostJ is<0(or=0!),thederivativeis0.Donothing.AssumingcostJis>0,itissimpletoseethatwecancomputethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x.Thentakediffences forJ.

22

Page 23: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Training with Backpropagation

• Let’sconsiderthederivativeofasingleweightWij

• Thisonlyappearsinsideai

• Forexample:W23 isonlyusedtocomputea2

x1 x2x3 +1

a1 a2

s U2

W23

23

Page 24: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Training with Backpropagation

DerivativeofweightWij:

24x1 x2x3 +1

a1 a2

s U2

W23

Page 25: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

whereforlogistic f

Training with Backpropagation

DerivativeofsingleweightWij :

Localerrorsignal

Localinputsignal

25x1 x2x3 +1

a1 a2

s U2

W23

Page 26: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

• Wewantallcombinationsofi =1,2 and j=1,2,3

• Solution:Outerproduct:whereisthe“responsibility”comingfromeachactivationa

Training with Backpropagation

• FromsingleweightWij tofullW:

26x1 x2x3 +1

a1 a2

s U2

W23

δ1x1 δ1x2 δ1x3δ2x1 δ2x2 δ2x3

Page 27: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Training with Backpropagation

• Forbiasesb,weget:

27x1 x2x3 +1

a1 a2

s U2

W23

Page 28: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Training with Backpropagation

28

That’salmostbackpropagationIt’ssimplytakingderivativesandusingthechainrule!

Remainingtrick:Efficiency

wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers

Example:lastderivativesofmodel,thewordvectorsinx

Page 29: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Training with Backpropagation

• Takederivativeofscorewithrespecttosinglewordvector(forsimplicitya1dvector,butsameifitwerelonger)

• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:

Re-used partofpreviousderivative29

Page 30: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Learning word-level classifiers: POS and NER

Part1.6:TheBasics

30

Page 31: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

The Model(Collobert &Weston2008;Collobert etal.2011)

• SimilartowordvectorlearningbutreplacesthesinglescalarscorewithaSoftmax/Maxentclassifier

• Trainingisagaindoneviabackpropagationwhichgivesanerrorsimilarto(butnotthesameas!)thescoreintheunsupervisedwordvectorlearningmodel

31

x1 x2x3 +1

a1 a2

s U2

W23

Page 32: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

The Model - Training

• Wealreadyknowsoftmax/MaxEnt andhowtooptimizeit• Theinterestingtwistindeeplearningisthattheinputfeatures

arealsolearned,similartolearningwordvectorswithascore:

Sc1 c2 c3

x1 x2x3 +1

a1 a2

s U2

W23

x1 x2x3 +1

a1 a2

32

Page 33: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Training with Backpropagation: softmax

33

Whatisthemajorbenefitoflearnedwordvectors?

Abilitytoalsopropagatelabeledinformationintothem,viasoftmax/maxent andhiddenlayer:

Sc1 c2 c3

x1 x2x3 +1

a1 a2P(c | d,λ) = eλ

T f (c,d )

eλT f ( !c ,d )

!c∑

Page 34: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

POSWSJ (acc.)

NERCoNLL (F1)

State-of-the-art* 97.24 89.31Supervised NN 96.37 81.47Unsupervised pre-trainingfollowedbysupervised NN**

97.20 88.87

+hand-crafted features*** 97.29 89.59*ResultsusedinCollobert &Weston(2011).Representative systems:POS:(Toutanova etal.2003),NER: (Ando&Zhang2005)**130,000-wordembedding trainedonWikipediaandReuters with11wordwindow,100unithiddenlayer– for7weeks! – thensupervised tasktraining***Features arecharacter suffixes forPOSandagazetteer forNER

For small supervised data sets, unsupervised pre-training helps a lot

34

Page 35: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

POSWSJ (acc.)

NERCoNLL (F1)

SupervisedNN 96.37 81.47NNwithBrownclusters 96.92 87.15Fixedembeddings* 97.10 88.87C&W 2011** 97.29 89.59

*Samearchitecture asC&W2011,butwordembeddings arekeptconstantduringthesupervised trainingphase**C&Wisunsupervised pre-train +supervised NN+features modeloflastslide

Supervised refinement of the unsupervised word representation helps

35

Page 36: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Backpropagation TrainingPart1.5:TheBasics

36

Page 37: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Back-Prop

• Computegradientofexample-wiselosswrtparameters

• Simplyapplyingthederivativechainrulewisely

• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient

37

Page 38: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Simple Chain Rule

38

Page 39: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Multiple Paths Chain Rule

39

Page 40: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Multiple Paths Chain Rule - General

40

Page 41: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Chain Rule in Flow Graph

Flowgraph:anydirectedacyclicgraphnode=computationresultarc=computationdependency

=successors of

41

Page 42: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Back-Prop in Multi-Layer Net

42

h = sigmoid(Vx)

Page 43: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Back-Prop in General Flow Graph

=successors of

1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors

2. Bprop:- initialize outputgradient=1- visitnodes inreverseorder:

Computegradientwrt eachnodeusinggradientwrt successors

Singlescalaroutput

43

Page 44: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Automatic Differentiation

• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.

• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.

• Easyandfastprototyping• SeeTheano (Python)

44

Page 45: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Sharing statistical strengthPart1.7

45

Page 46: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Sharing Statistical Strength

• Besidesveryfastprediction,themainadvantageofdeeplearningisstatistical

• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training&Multi-tasklearning• Semi-supervisedlearningà

46

Page 47: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

purelysupervised

47

Page 48: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

semi-supervised

48

Page 49: Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Multi-Task Learning

• GeneralizingbettertonewtasksiscrucialtoapproachAI

• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks

• Goodrepresentationsmakesenseformanytasks

raw input x

task 1 output y1

task 3 output y3

task 2output y2

shared intermediate representation h

49