deep learning for nlp part 2 - stanford university...deep learning for nlp part 2 cs224n christopher...

DeepLearningforNLPPart2

CS224NChristopherManning

(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)

Word RepresentationsPart1.3:TheBasics

2

The standard word representation

Thevastmajorityofrule-basedand statisticalNLPworkregardswordsasatomicsymbols:hotel, conference, walk

Invectorspaceterms,thisisavectorwithone1andalotofzeroes

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]Dimensionality:20K(speech) – 50K(PTB)– 500K(bigvocab)– 13M(Google1T)

Wecallthisa“one-hot”representation.Itsproblem:

motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] ANDhotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0

3

Distributional similarity based representations

Youcangetalotofvaluebyrepresentingawordbymeansofitsneighbors

“Youshallknowawordbythecompanyitkeeps”(J.R.Firth1957:11)

OneofthemostsuccessfulideasofmodernstatisticalNLP

government debt problems turning into banking crises as has happened in

saying that Europe needs unified banking regulation to replace the hodgepodge

ë Thesewordswillrepresentbankingì

4Youcanvarywhetheryouuse localorlargecontexttogetamoresyntacticorsemantic clustering

Distributional word vectors

5

Distributionalcountsgivesamedimension,denserrepresentation

motelhotelbank

debtcrisesunifiedHodgepodgethepillowreceptioninternet

3100

12221258

9001

147253719

171152

183138

Two traditional word representations:Class-based and soft clustering

Classbasedmodelslearnwordclassesofsimilarwordsbasedondistributionalinformation(~classHMM)• Brownclustering(Brownetal.1992,Liang2005)• Exchangeclustering(Martinetal.1998,Clark2003)

1. Clinton,Jiang,Bush,Wilensky,Suharto,Reagan,…5. also,still,already,currently,actually,typically, ...6. recovery,strength,expansion, freedom,resistance, ...

Softclusteringmodelslearnforeachcluster/topicadistributionoverwordsofhowlikelythatwordisineachcluster• LatentSemanticAnalysis(LSA/LSI),Randomprojections• LatentDirichlet Analysis(LDA),HMMclustering6

Neural word embeddings as a distributed representation

Similaridea:Wordmeaningisrepresentedasa(dense)vector– apointina(medium-dimensional)vectorspace

Neuralwordembeddingscombinevectorspacesemanticswiththepredictionofprobabilisticmodels(Bengio etal.2003,Collobert &Weston2008,Huangetal.2012)

linguistics=

7

0.2860.792−0.177−0.1070.109−0.5420.3490.271

Neural word embeddings -visualization

8

Kulkarni, Al-Rfou, Perozzi, & Skiena (2015)

Distributed vs. distributional representations

Distributed: Aconceptisrepresentedascontinuousactivationlevelsinanumberofelements[Contrast:local]

Distributional:Meaningisrepresentedbycontextsofuse[Contrast:denotational]

Advantages of the neural word embedding approach

11

Comparedtoothermethods,neuralwordembeddingscanbecomemoremeaningfulthroughaddingsupervisionfromoneormultipletasks

Forinstance,sentimentisusuallynotcapturedinunsupervisedwordembeddingsbutcanbeinneuralwordvectors

Wecanbuildcompositionalvectorrepresentationsforlongerphrases(nextlecture)

Unsupervised word vector learning

Part1.4:TheBasics

12

The mainstream methods for neural word vector learning in 2015

1. word2vec• https://code.google.com/p/word2vec/• Codefortwodifferent algorithms (Skipgram withnegativesampling–

SGNS,andContinuousBagofWords– CBOW)• Usessimple, fastbilinearmodelstolearnverygoodwordrepresentations• Mikolov,Sutskever, Chen,Corrado, andDean.NIPS2013

2. GloVe• http://nlp.stanford.edu/projects/glove/• Anon-linearmatrixfactorization: Startswithcountmatrixandfactorizes

itwithanexplicitlossfunction• SortofsimilartoSGNS,really,butfromStanfordJ• Pennington,Socher,andManning.EMNLP2014

13

But we won’t look at either, but … Contrastive Estimation of Word Vectors

Aneuralnetworkforlearningwordvectors(Collobert etal.JMLR2011)

Idea:Awordanditscontextisapositivetrainingsample;arandomwordinthatsamecontextgivesanegativetrainingsample:

catchillson amat catchillsOhioamat14

A neural network for learning word vectors

15

Howdoweformalizethisidea? Askthatscore(catchillsonamat)>score(catchillsOhioamat)

Howdowecomputethescore?

• Withaneuralnetwork• Eachwordisassociatedwithan

n-dimensionalvector

Word embedding matrix

• Initializeallwordvectorsrandomly toformawordembeddingmatrix

|V|

L = … n

thecatmat…• Thesearethewordfeatureswewanttolearn• Alsocalledalook-uptable

• Mathematicallyyougetaword’svectorbymultiplyingLwithaone-hotvectore:x =Le

[]

16

• score(catchillsonamat)• Todescribeaphrase,retrieve(viaindex)thecorresponding

vectorsfromL

catchillsonamat

• Thenconcatenatethemtoforma5n vector:• x =[ ]• Howdowethencomputescore(x)?

Word vectors as input to a neural network

17

Scoring a Single Layer Neural Network

• Asinglelayerisacombinationofalinearlayerandanonlinearity:

• Theneuralactivationscanthenbeusedtocomputesomefunction.

• Forinstance,thescorewecareabout:

18

Summary: Feed-forward Computation

19

Computingawindow’sscorewitha3-layerNeuralNet:s=score(catchillsonamat)

catchillsonamat

Summary: Feed-forward Computation

• s =score(catchillsonamat)• sc =score(catchillsOhioamat)

• Ideafortrainingobjective:makescoreoftruewindowlargerandcorruptwindow’sscorelower(untilthey’resufficientlyseparated).Minimize

• Thisiscontinuous,canperformSGD• Lookatafewexamples,nudgeweightstomakeJ smaller

20

The Backpropagation Algorithm

• Backpropagation isawayofcomputinggradientsofexpressionsefficiently throughrecursiveapplicationof chainrule.• EitherRumelhart,Hinton&McClelland1986orWerbos 1974orLinnainmaa1970

• Thederivativeofaloss(anobjectivefunction)oneachvariabletellsyouthesensitivityofthelosstoitsvalue.

• Anindividualstepofthechainruleislocal.Ittellsyouhowthesensitivityoftheobjectivefunctionismodulatedbythatfunctioninthenetwork• Theinputchangesatsomerateforavariable• Thefunctionatthenodescalesthatrate

21

Training the parameters of the model

IfcostJ is<0(or=0!),thederivativeis0.Donothing.AssumingcostJis>0,itissimpletoseethatwecancomputethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x.Thentakediffences forJ.

22

Training with Backpropagation

• Let’sconsiderthederivativeofasingleweightWij

• Thisonlyappearsinsideai

• Forexample:W23 isonlyusedtocomputea2

x1 x2x3 +1

a1 a2

s U2

W23

23


DerivativeofweightWij:

24x1 x2x3 +1

a1 a2

s U2

W23

whereforlogistic f


DerivativeofsingleweightWij :

Localerrorsignal

Localinputsignal

25x1 x2x3 +1

a1 a2

s U2

W23

• Wewantallcombinationsofi =1,2 and j=1,2,3

• Solution:Outerproduct:whereisthe“responsibility”comingfromeachactivationa


• FromsingleweightWij tofullW:

26x1 x2x3 +1

a1 a2

s U2

W23

δ1x1 δ1x2 δ1x3δ2x1 δ2x2 δ2x3


• Forbiasesb,weget:

27x1 x2x3 +1

a1 a2

s U2

W23


28

That’salmostbackpropagationIt’ssimplytakingderivativesandusingthechainrule!

Remainingtrick:Efficiency

wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers

Example:lastderivativesofmodel,thewordvectorsinx


• Takederivativeofscorewithrespecttosinglewordvector(forsimplicitya1dvector,butsameifitwerelonger)

• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:

Re-used partofpreviousderivative29

Learning word-level classifiers: POS and NER

Part1.6:TheBasics

30

The Model(Collobert &Weston2008;Collobert etal.2011)

• SimilartowordvectorlearningbutreplacesthesinglescalarscorewithaSoftmax/Maxentclassifier

• Trainingisagaindoneviabackpropagationwhichgivesanerrorsimilarto(butnotthesameas!)thescoreintheunsupervisedwordvectorlearningmodel

31

x1 x2x3 +1

a1 a2

s U2

W23

The Model - Training

• Wealreadyknowsoftmax/MaxEnt andhowtooptimizeit• Theinterestingtwistindeeplearningisthattheinputfeatures

arealsolearned,similartolearningwordvectorswithascore:

Sc1 c2 c3

x1 x2x3 +1

a1 a2

s U2

W23

x1 x2x3 +1

a1 a2

32

Training with Backpropagation: softmax

33

Whatisthemajorbenefitoflearnedwordvectors?

Abilitytoalsopropagatelabeledinformationintothem,viasoftmax/maxent andhiddenlayer:

Sc1 c2 c3

x1 x2x3 +1

a1 a2P(c | d,λ) = eλ

T f (c,d )

eλT f ( !c ,d )

!c∑

POSWSJ (acc.)

NERCoNLL (F1)

State-of-the-art* 97.24 89.31Supervised NN 96.37 81.47Unsupervised pre-trainingfollowedbysupervised NN**

97.20 88.87

+hand-crafted features*** 97.29 89.59*ResultsusedinCollobert &Weston(2011).Representative systems:POS:(Toutanova etal.2003),NER: (Ando&Zhang2005)**130,000-wordembedding trainedonWikipediaandReuters with11wordwindow,100unithiddenlayer– for7weeks! – thensupervised tasktraining***Features arecharacter suffixes forPOSandagazetteer forNER

For small supervised data sets, unsupervised pre-training helps a lot

34

POSWSJ (acc.)

NERCoNLL (F1)

SupervisedNN 96.37 81.47NNwithBrownclusters 96.92 87.15Fixedembeddings* 97.10 88.87C&W 2011** 97.29 89.59

*Samearchitecture asC&W2011,butwordembeddings arekeptconstantduringthesupervised trainingphase**C&Wisunsupervised pre-train +supervised NN+features modeloflastslide

Supervised refinement of the unsupervised word representation helps

35

Backpropagation TrainingPart1.5:TheBasics

36

Back-Prop

• Computegradientofexample-wiselosswrtparameters

• Simplyapplyingthederivativechainrulewisely

• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient

37

Simple Chain Rule

38

Multiple Paths Chain Rule

39

Multiple Paths Chain Rule - General

…

40

Chain Rule in Flow Graph

…

…

…

Flowgraph:anydirectedacyclicgraphnode=computationresultarc=computationdependency

=successors of

41

Back-Prop in Multi-Layer Net

…

…

42

h = sigmoid(Vx)

Back-Prop in General Flow Graph

…

…

…

=successors of

1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors

2. Bprop:- initialize outputgradient=1- visitnodes inreverseorder:

Computegradientwrt eachnodeusinggradientwrt successors

Singlescalaroutput

43

Automatic Differentiation

• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.

• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.

• Easyandfastprototyping• SeeTheano (Python)

44

Sharing statistical strengthPart1.7

45

Sharing Statistical Strength

• Besidesveryfastprediction,themainadvantageofdeeplearningisstatistical

• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training&Multi-tasklearning• Semi-supervisedlearningà

46

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

purelysupervised

47

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

semi-supervised

48

Multi-Task Learning

• GeneralizingbettertonewtasksiscrucialtoapproachAI

• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks

• Goodrepresentationsmakesenseformanytasks

raw input x

task 1 output y1

task 3 output y3

task 2output y2

shared intermediate representation h

49

deep learning for nlp part 2 - stanford university...deep learning for nlp part 2 cs224n christopher...

Documents