deep learning for nlp - stanford university...the standard word representation the vast majority of...
TRANSCRIPT
DeepLearningforNLP
CS224NChristopherManning
(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)
Machine Learning and NLP
Usuallymachinelearningworkswellbecauseofhuman-designedrepresentationsandinputfeatures
Machinelearning becomesjustoptimizingweightstobestmakeafinalprediction
PA1andPA3?
Representationlearningattemptstoautomatically learngoodfeaturesorrepresentations
NER WordNet
SRL Parser
2
What is Deep Learning?
Threesensesof“deeplearning”:• #1 Learningadata-dependenthierarchyofmultiple
intermediatelevelsofrepresentationofincreasingabstraction
• #2 Useofdistributedrepresentationsandneuralnetlearningmethods,evenwhenthearchitectureisn’ttechnicallydeep
• #3 Anythingtodowithartificialintelligence“Deeplearningistheideathatmachinescouldeventuallylearnbythemselvestoperformfunctionslikespeechrecognition”http://readwrite.com/2014/01/29/
Deep Learning and its ArchitecturesDeeplearningattemptstolearnmultiplelevelsofrepresentationFocus:Multi-layerneuralnetworks
OutputlayerHerepredictingasupervised target
HiddenlayersThese learnmoreabstractrepresentations asyouheadup
InputlayerRawsensory inputs(roughly)4
Advantages of Deep Learning(Part 1)
Part1.1:TheBasics
6
#1 Learning representations
7
Handcraftingfeaturesistime-consuming
Thefeaturesareoftenbothover-specifiedandincomplete
Theworkhastobedoneagainforeachtask/domain/…
Humansdeveloprepresentationsforlearningandreasoning
Ourcomputersshoulddothesame
DeeplearningprovidesawayofdoingthisMachineLearning
#2 The need for distributed representations
CurrentNLPsystemsareincrediblyfragilebecauseoftheiratomicsymbolrepresentations
Crazysententialcomplement,suchasfor“likes[(being)crazy]”8
Distributed representations deal with the curse of dimensionality
Generalizing locally (e.g.,nearestneighbors) requires representativeexamples forallrelevant variations!Classicsolutions:• Manualfeaturedesign• Assuminga smoothtarget
function(e.g.,linearmodels)• Kernelmethods (linear interms
of kernelbasedondatapoints)Neuralnetworks parameterize andlearna“similarity” kernel
9
#3 Learning multiple levels of representation
Successivemodellayerslearndeeperintermediaterepresentations
Layer1
Layer2
Layer3High-level
linguisticrepresentations
[Leeetal.ICML2009;Leeetal.NIPS2009]
10
From logistic regression to neural nets
Part1.2:TheBasics
11
Demystifying neural networks
Terminology
Mathissimilartomaxent/logisticregression
AsingleneuronAcomputational unitwithn(3)inputs
and1outputandparametersW,b
Activationfunction
Inputs
Biasunitcorrespondstointercept term
Output
12
From Maxent Classifiers to Neural Networks
Refresher:Maxent/softmax classifier(fromlastlecture)
Supervised learning givesusadistribution fordatumdoverclasses inC
Vectorform:
Suchaclassifierisusedas-isinaneuralnetwork(“asoftmax layer”)• Oftenasthetoplayer
Butfornowwe’llderiveatwo-classlogisticmodelforoneneuron
P(c | d,λ) =exp λi fi (c,d)i∑exp λi fi ( "c ,d)i∑"c ∈C∑
P(c | d,λ) = eλT f (c,d )
eλT f ( !c ,d )
!c∑
13
From Maxent Classifiers to Neural Networks
Vectorform:
Maketwoclass:
P(c1 | d,λ) =eλ
T f (c1,d )
eλT f (c1,d ) + eλ
T f (c2 ,d )=
eλT f (c1,d )
eλT f (c1,d ) + eλ
T f (c2 ,d )⋅e−λ
T f (c1,d )
e−λT f (c1,d )
=1
1+ eλT [ f (c2 ,d )− f (c1,d )]
= for x = f (c1,d)− f (c2,d)1
1+ e−λTx
14
= f (λ Tx)
P(c | d,λ) = eλT f (c,d )
eλT f ( !c ,d )
!c∑
forf(z) =1/(1+exp(−z)),thelogistic function– asigmoidnon-linearity.
This is exactly what an artificial “neuron” computes
hw,b(x) = f (wTx + b)
f (z) = 11+ e−z
w,b aretheparametersofthisneuroni.e.,this logistic regressionmodel15
b:Wecanhavean“alwayson”feature, whichgivesaclassprior,orseparate itout,asabiasterm
A neural network = running several logistic regressions at the same time
Ifwefeedavectorofinputsthroughabunchoflogisticregressionfunctions,thenwegetavectorofoutputs
Butwedon’thavetodecideaheadoftimewhatvariablestheselogisticregressionsaretryingtopredict!
16
A neural network = running several logistic regressions at the same time
…whichwecanfeedintoanotherlogisticregressionfunction
anditisthetrainingcriterionthatwilldecidewhatthoseintermediatebinaryvariablesshouldbe,soastodoagoodjobofpredictingthetargetsforthenextlayer,etc.
17
A neural network = running several logistic regressions at the same time
Beforeweknowit,wehaveamultilayerneuralnetwork….
18
Matrix notation for a layer
Wehave
Inmatrixnotation
wheref isappliedelement-wise:
a1
a2
a3
a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.
z =Wx + ba = f (z)
f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]19
W12
b3
How do we train the weights W?
• Forasupervisedneuralnet,wecantrainthemodeljustlikeamaxentmodel– wecalculateandusegradients• Stochasticgradientdescent(SGD)• Usual
• ConjugategradientorL-BFGS• Sometimes; slightlydangerous sincespacemaynotbeconvex
• Wecanusethesameideasandtechniques• Justwithoutguarantees…
• Thisleadsto“backpropagation”,whichwecoverlater
20
Non-linearities: Why they’re needed
• Forlogisticregression:maptoprobabilities• Here:functionapproximation,
e.g.,regressionorclassification• Withoutnon-linearities, deepneuralnetworkscan’tdoanythingmorethanalinear transform• Extralayerscouldjustbecompileddownintoasinglelinear transform
• Peopleoftenuseothernon-linearities,suchastanh,ReLU
21
Non-linearities: What’s used
logistic(“sigmoid”)tanh
tanh isjustarescaledandshiftedsigmoidtanh isoftenusedandoftenperformsbetterfordeepnets
• It’soutputissymmetricaround0
tanh(z) = 2logistic(2z)−1
22
Non-linearities: Other choices
hardtanh softsign linearrectifier(ReLU)
• hardtanh similarbutcomputationallycheaper thantanh andsaturates hard.• [Glorot andBengioAISTATS 2010,2011]discuss softsign andrectifier
• Rectifiedlinearunit isnowcommon– transfers linear signalifactive
rect(z) =max(z, 0)softsign(z) = a1+ a
23
SummaryKnowing the meaning of words!
Younowunderstandthebasicsandtherelationtoothermodels• Neuron=logisticregressionorsimilarfunction• Inputlayer=inputtraining/testvector• Biasunit=interceptterm/alwaysonfeature• Activation=response/output• Activationfunctionisalogisticorothernonlinearity• Backpropagation=runningstochasticgradientdescentthrougha
multilayernetworkefficiently• Weightdecay=regularization/Bayesianpriorsmoothing
24
Word RepresentationsPart1.3:TheBasics
25
The standard word representation
Thevastmajorityofrule-basedand statisticalNLPworkregardswordsasatomicsymbols:hotel, conference, walk
Invectorspaceterms,thisisavectorwithone1andalotofzeroes
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]Dimensionality:20K(speech) – 50K(PTB)– 500K(bigvocab)– 13M(Google1T)
Wecallthisa“one-hot”representation.Itsproblem:
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] ANDhotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0
26
Distributional similarity based representations
Youcangetalotofvaluebyrepresentingawordbymeansofitsneighbors
“Youshallknowawordbythecompanyitkeeps”(J.R.Firth1957:11)
OneofthemostsuccessfulideasofmodernstatisticalNLP
government debt problems turning into banking crises as has happened in
saying that Europe needs unified banking regulation to replace the hodgepodge
ë Thesewordswillrepresentbankingì
27Youcanvarywhetheryouuse localorlargecontexttogetamoresyntacticorsemantic clustering
Two traditional word representations:Class-based and soft clustering
Classbasedmodelslearnwordclassesofsimilarwordsbasedondistributionalinformation(~classHMM)• Brownclustering(Brownetal.1992,Liang2005)• Exchangeclustering(Martinetal.1998,Clark2003)
Softclusteringmodelslearnforeachcluster/topicadistributionoverwordsofhowlikelythatwordisineachcluster• LatentSemanticAnalysis(LSA/LSI),Randomprojections• LatentDirichlet Analysis(LDA),HMMclustering
28
Neural word embeddings as a distributed representation
Similaridea
Combinevectorspacesemanticswiththepredictionofprobabilisticmodels(Bengio etal.2003,Collobert &Weston2008,Huangetal.2012)
Inalloftheseapproaches,includingdeeplearningmodels,awordisrepresentedasadensevector
linguistics=
29
0.2860.792−0.177−0.1070.109−0.5420.3490.271
Neural word embeddings -visualization
30
Kulkarni, Al-Rfou, Perozzi, & Skiena (2015)
Advantages of the neural word embedding approach
32
Comparedtoothermethods,neuralwordembeddingscanbecomemoremeaningfulthroughaddingsupervisionfromoneormultipletasks
Forinstance,sentimentisusuallynotcapturedinunsupervisedwordembeddingsbutcanbeinneuralwordvectors
Wecanbuildcompositionalvectorrepresentationsforlongerphrases(nextlecture)