deep learning for nlp part 2 - stanford university...deep learning for nlp part 2 cs224n christopher...
TRANSCRIPT
DeepLearningforNLPPart2
CS224NChristopherManning
(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)
Word RepresentationsPart1.3:TheBasics
2
The standard word representation
Thevastmajorityofrule-basedand statisticalNLPworkregardswordsasatomicsymbols:hotel, conference, walk
Invectorspaceterms,thisisavectorwithone1andalotofzeroes
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]Dimensionality:20K(speech) – 50K(PTB)– 500K(bigvocab)– 13M(Google1T)
Wecallthisa“one-hot”representation.Itsproblem:
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] ANDhotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0
3
Distributional similarity based representations
Youcangetalotofvaluebyrepresentingawordbymeansofitsneighbors
“Youshallknowawordbythecompanyitkeeps”(J.R.Firth1957:11)
OneofthemostsuccessfulideasofmodernstatisticalNLP
government debt problems turning into banking crises as has happened in
saying that Europe needs unified banking regulation to replace the hodgepodge
ë Thesewordswillrepresentbankingì
4Youcanvarywhetheryouuse localorlargecontexttogetamoresyntacticorsemantic clustering
Distributional word vectors
5
Distributionalcountsgivesamedimension,denserrepresentation
motelhotelbank
debtcrisesunifiedHodgepodgethepillowreceptioninternet
3100
12221258
9001
147253719
171152
183138
Two traditional word representations:Class-based and soft clustering
Classbasedmodelslearnwordclassesofsimilarwordsbasedondistributionalinformation(~classHMM)• Brownclustering(Brownetal.1992,Liang2005)• Exchangeclustering(Martinetal.1998,Clark2003)
1. Clinton,Jiang,Bush,Wilensky,Suharto,Reagan,…5. also,still,already,currently,actually,typically, ...6. recovery,strength,expansion, freedom,resistance, ...
Softclusteringmodelslearnforeachcluster/topicadistributionoverwordsofhowlikelythatwordisineachcluster• LatentSemanticAnalysis(LSA/LSI),Randomprojections• LatentDirichlet Analysis(LDA),HMMclustering6
Neural word embeddings as a distributed representation
Similaridea:Wordmeaningisrepresentedasa(dense)vector– apointina(medium-dimensional)vectorspace
Neuralwordembeddingscombinevectorspacesemanticswiththepredictionofprobabilisticmodels(Bengio etal.2003,Collobert &Weston2008,Huangetal.2012)
linguistics=
7
0.2860.792−0.177−0.1070.109−0.5420.3490.271
Neural word embeddings -visualization
8
Kulkarni, Al-Rfou, Perozzi, & Skiena (2015)
Distributed vs. distributional representations
Distributed: Aconceptisrepresentedascontinuousactivationlevelsinanumberofelements[Contrast:local]
Distributional:Meaningisrepresentedbycontextsofuse[Contrast:denotational]
Advantages of the neural word embedding approach
11
Comparedtoothermethods,neuralwordembeddingscanbecomemoremeaningfulthroughaddingsupervisionfromoneormultipletasks
Forinstance,sentimentisusuallynotcapturedinunsupervisedwordembeddingsbutcanbeinneuralwordvectors
Wecanbuildcompositionalvectorrepresentationsforlongerphrases(nextlecture)
Unsupervised word vector learning
Part1.4:TheBasics
12
The mainstream methods for neural word vector learning in 2015
1. word2vec• https://code.google.com/p/word2vec/• Codefortwodifferent algorithms (Skipgram withnegativesampling–
SGNS,andContinuousBagofWords– CBOW)• Usessimple, fastbilinearmodelstolearnverygoodwordrepresentations• Mikolov,Sutskever, Chen,Corrado, andDean.NIPS2013
2. GloVe• http://nlp.stanford.edu/projects/glove/• Anon-linearmatrixfactorization: Startswithcountmatrixandfactorizes
itwithanexplicitlossfunction• SortofsimilartoSGNS,really,butfromStanfordJ• Pennington,Socher,andManning.EMNLP2014
13
But we won’t look at either, but … Contrastive Estimation of Word Vectors
Aneuralnetworkforlearningwordvectors(Collobert etal.JMLR2011)
Idea:Awordanditscontextisapositivetrainingsample;arandomwordinthatsamecontextgivesanegativetrainingsample:
catchillson amat catchillsOhioamat14
A neural network for learning word vectors
15
Howdoweformalizethisidea? Askthatscore(catchillsonamat)>score(catchillsOhioamat)
Howdowecomputethescore?
• Withaneuralnetwork• Eachwordisassociatedwithan
n-dimensionalvector
Word embedding matrix
• Initializeallwordvectorsrandomly toformawordembeddingmatrix
|V|
L = … n
thecatmat…• Thesearethewordfeatureswewanttolearn• Alsocalledalook-uptable
• Mathematicallyyougetaword’svectorbymultiplyingLwithaone-hotvectore:x =Le
[]
16
• score(catchillsonamat)• Todescribeaphrase,retrieve(viaindex)thecorresponding
vectorsfromL
catchillsonamat
• Thenconcatenatethemtoforma5n vector:• x =[ ]• Howdowethencomputescore(x)?
Word vectors as input to a neural network
17
Scoring a Single Layer Neural Network
• Asinglelayerisacombinationofalinearlayerandanonlinearity:
• Theneuralactivationscanthenbeusedtocomputesomefunction.
• Forinstance,thescorewecareabout:
18
Summary: Feed-forward Computation
19
Computingawindow’sscorewitha3-layerNeuralNet:s=score(catchillsonamat)
catchillsonamat
Summary: Feed-forward Computation
• s =score(catchillsonamat)• sc =score(catchillsOhioamat)
• Ideafortrainingobjective:makescoreoftruewindowlargerandcorruptwindow’sscorelower(untilthey’resufficientlyseparated).Minimize
• Thisiscontinuous,canperformSGD• Lookatafewexamples,nudgeweightstomakeJ smaller
20
The Backpropagation Algorithm
• Backpropagation isawayofcomputinggradientsofexpressionsefficiently throughrecursiveapplicationof chainrule.• EitherRumelhart,Hinton&McClelland1986orWerbos 1974orLinnainmaa1970
• Thederivativeofaloss(anobjectivefunction)oneachvariabletellsyouthesensitivityofthelosstoitsvalue.
• Anindividualstepofthechainruleislocal.Ittellsyouhowthesensitivityoftheobjectivefunctionismodulatedbythatfunctioninthenetwork• Theinputchangesatsomerateforavariable• Thefunctionatthenodescalesthatrate
21
Training the parameters of the model
IfcostJ is<0(or=0!),thederivativeis0.Donothing.AssumingcostJis>0,itissimpletoseethatwecancomputethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x.Thentakediffences forJ.
22
Training with Backpropagation
• Let’sconsiderthederivativeofasingleweightWij
• Thisonlyappearsinsideai
• Forexample:W23 isonlyusedtocomputea2
x1 x2x3 +1
a1 a2
s U2
W23
23
Training with Backpropagation
DerivativeofweightWij:
24x1 x2x3 +1
a1 a2
s U2
W23
whereforlogistic f
Training with Backpropagation
DerivativeofsingleweightWij :
Localerrorsignal
Localinputsignal
25x1 x2x3 +1
a1 a2
s U2
W23
• Wewantallcombinationsofi =1,2 and j=1,2,3
• Solution:Outerproduct:whereisthe“responsibility”comingfromeachactivationa
Training with Backpropagation
• FromsingleweightWij tofullW:
26x1 x2x3 +1
a1 a2
s U2
W23
δ1x1 δ1x2 δ1x3δ2x1 δ2x2 δ2x3
Training with Backpropagation
• Forbiasesb,weget:
27x1 x2x3 +1
a1 a2
s U2
W23
Training with Backpropagation
28
That’salmostbackpropagationIt’ssimplytakingderivativesandusingthechainrule!
Remainingtrick:Efficiency
wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers
Example:lastderivativesofmodel,thewordvectorsinx
Training with Backpropagation
• Takederivativeofscorewithrespecttosinglewordvector(forsimplicitya1dvector,butsameifitwerelonger)
• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:
Re-used partofpreviousderivative29
Learning word-level classifiers: POS and NER
Part1.6:TheBasics
30
The Model(Collobert &Weston2008;Collobert etal.2011)
• SimilartowordvectorlearningbutreplacesthesinglescalarscorewithaSoftmax/Maxentclassifier
• Trainingisagaindoneviabackpropagationwhichgivesanerrorsimilarto(butnotthesameas!)thescoreintheunsupervisedwordvectorlearningmodel
31
x1 x2x3 +1
a1 a2
s U2
W23
The Model - Training
• Wealreadyknowsoftmax/MaxEnt andhowtooptimizeit• Theinterestingtwistindeeplearningisthattheinputfeatures
arealsolearned,similartolearningwordvectorswithascore:
Sc1 c2 c3
x1 x2x3 +1
a1 a2
s U2
W23
x1 x2x3 +1
a1 a2
32
Training with Backpropagation: softmax
33
Whatisthemajorbenefitoflearnedwordvectors?
Abilitytoalsopropagatelabeledinformationintothem,viasoftmax/maxent andhiddenlayer:
Sc1 c2 c3
x1 x2x3 +1
a1 a2P(c | d,λ) = eλ
T f (c,d )
eλT f ( !c ,d )
!c∑
POSWSJ (acc.)
NERCoNLL (F1)
State-of-the-art* 97.24 89.31Supervised NN 96.37 81.47Unsupervised pre-trainingfollowedbysupervised NN**
97.20 88.87
+hand-crafted features*** 97.29 89.59*ResultsusedinCollobert &Weston(2011).Representative systems:POS:(Toutanova etal.2003),NER: (Ando&Zhang2005)**130,000-wordembedding trainedonWikipediaandReuters with11wordwindow,100unithiddenlayer– for7weeks! – thensupervised tasktraining***Features arecharacter suffixes forPOSandagazetteer forNER
For small supervised data sets, unsupervised pre-training helps a lot
34
POSWSJ (acc.)
NERCoNLL (F1)
SupervisedNN 96.37 81.47NNwithBrownclusters 96.92 87.15Fixedembeddings* 97.10 88.87C&W 2011** 97.29 89.59
*Samearchitecture asC&W2011,butwordembeddings arekeptconstantduringthesupervised trainingphase**C&Wisunsupervised pre-train +supervised NN+features modeloflastslide
Supervised refinement of the unsupervised word representation helps
35
Backpropagation TrainingPart1.5:TheBasics
36
Back-Prop
• Computegradientofexample-wiselosswrtparameters
• Simplyapplyingthederivativechainrulewisely
• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient
37
Simple Chain Rule
38
Multiple Paths Chain Rule
39
Multiple Paths Chain Rule - General
…
40
Chain Rule in Flow Graph
…
…
…
Flowgraph:anydirectedacyclicgraphnode=computationresultarc=computationdependency
=successors of
41
Back-Prop in Multi-Layer Net
…
…
42
h = sigmoid(Vx)
Back-Prop in General Flow Graph
…
…
…
=successors of
1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors
2. Bprop:- initialize outputgradient=1- visitnodes inreverseorder:
Computegradientwrt eachnodeusinggradientwrt successors
Singlescalaroutput
43
Automatic Differentiation
• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.
• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.
• Easyandfastprototyping• SeeTheano (Python)
44
Sharing statistical strengthPart1.7
45
Sharing Statistical Strength
• Besidesveryfastprediction,themainadvantageofdeeplearningisstatistical
• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training&Multi-tasklearning• Semi-supervisedlearningà
46
Semi-Supervised Learning
• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)
purelysupervised
47
Semi-Supervised Learning
• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)
semi-supervised
48
Multi-Task Learning
• GeneralizingbettertonewtasksiscrucialtoapproachAI
• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks
• Goodrepresentationsmakesenseformanytasks
raw input x
task 1 output y1
task 3 output y3
task 2output y2
shared intermediate representation h
49