deep learning for nlp - stanford university...the standard word representation the vast majority of...

DeepLearningforNLP

CS224NChristopherManning

(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)

Machine Learning and NLP

Usuallymachinelearningworkswellbecauseofhuman-designedrepresentationsandinputfeatures

Machinelearning becomesjustoptimizingweightstobestmakeafinalprediction

PA1andPA3?

Representationlearningattemptstoautomatically learngoodfeaturesorrepresentations

NER WordNet

SRL Parser

2

What is Deep Learning?

Threesensesof“deeplearning”:• #1 Learningadata-dependenthierarchyofmultiple

intermediatelevelsofrepresentationofincreasingabstraction

• #2 Useofdistributedrepresentationsandneuralnetlearningmethods,evenwhenthearchitectureisn’ttechnicallydeep

• #3 Anythingtodowithartificialintelligence“Deeplearningistheideathatmachinescouldeventuallylearnbythemselvestoperformfunctionslikespeechrecognition”http://readwrite.com/2014/01/29/

Deep Learning and its ArchitecturesDeeplearningattemptstolearnmultiplelevelsofrepresentationFocus:Multi-layerneuralnetworks

OutputlayerHerepredictingasupervised target

HiddenlayersThese learnmoreabstractrepresentations asyouheadup

InputlayerRawsensory inputs(roughly)4

Advantages of Deep Learning(Part 1)

Part1.1:TheBasics

6

#1 Learning representations

7

Handcraftingfeaturesistime-consuming

Thefeaturesareoftenbothover-specifiedandincomplete

Theworkhastobedoneagainforeachtask/domain/…

Humansdeveloprepresentationsforlearningandreasoning

Ourcomputersshoulddothesame

DeeplearningprovidesawayofdoingthisMachineLearning

#2 The need for distributed representations

CurrentNLPsystemsareincrediblyfragilebecauseoftheiratomicsymbolrepresentations

Crazysententialcomplement,suchasfor“likes[(being)crazy]”8

Distributed representations deal with the curse of dimensionality

Generalizing locally (e.g.,nearestneighbors) requires representativeexamples forallrelevant variations!Classicsolutions:• Manualfeaturedesign• Assuminga smoothtarget

function(e.g.,linearmodels)• Kernelmethods (linear interms

of kernelbasedondatapoints)Neuralnetworks parameterize andlearna“similarity” kernel

9

#3 Learning multiple levels of representation

Successivemodellayerslearndeeperintermediaterepresentations

Layer1

Layer2

Layer3High-level

linguisticrepresentations

[Leeetal.ICML2009;Leeetal.NIPS2009]

10

From logistic regression to neural nets

Part1.2:TheBasics

11

Demystifying neural networks

Terminology

Mathissimilartomaxent/logisticregression

AsingleneuronAcomputational unitwithn(3)inputs

and1outputandparametersW,b

Activationfunction

Inputs

Biasunitcorrespondstointercept term

Output

12

From Maxent Classifiers to Neural Networks

Refresher:Maxent/softmax classifier(fromlastlecture)

Supervised learning givesusadistribution fordatumdoverclasses inC

Vectorform:

Suchaclassifierisusedas-isinaneuralnetwork(“asoftmax layer”)• Oftenasthetoplayer

Butfornowwe’llderiveatwo-classlogisticmodelforoneneuron

P(c | d,λ) =exp λi fi (c,d)i∑exp λi fi ( "c ,d)i∑"c ∈C∑

P(c | d,λ) = eλT f (c,d )

eλT f ( !c ,d )

!c∑

13

From Maxent Classifiers to Neural Networks

Vectorform:

Maketwoclass:

P(c1 | d,λ) =eλ

T f (c1,d )

eλT f (c1,d ) + eλ

T f (c2 ,d )=

eλT f (c1,d )

eλT f (c1,d ) + eλ

T f (c2 ,d )⋅e−λ

T f (c1,d )

e−λT f (c1,d )

=1

1+ eλT [ f (c2 ,d )− f (c1,d )]

= for x = f (c1,d)− f (c2,d)1

1+ e−λTx

14

= f (λ Tx)

P(c | d,λ) = eλT f (c,d )

eλT f ( !c ,d )

!c∑

forf(z) =1/(1+exp(−z)),thelogistic function– asigmoidnon-linearity.

This is exactly what an artificial “neuron” computes

hw,b(x) = f (wTx + b)

f (z) = 11+ e−z

w,b aretheparametersofthisneuroni.e.,this logistic regressionmodel15

b:Wecanhavean“alwayson”feature, whichgivesaclassprior,orseparate itout,asabiasterm

A neural network = running several logistic regressions at the same time

Ifwefeedavectorofinputsthroughabunchoflogisticregressionfunctions,thenwegetavectorofoutputs

Butwedon’thavetodecideaheadoftimewhatvariablestheselogisticregressionsaretryingtopredict!

16


…whichwecanfeedintoanotherlogisticregressionfunction

anditisthetrainingcriterionthatwilldecidewhatthoseintermediatebinaryvariablesshouldbe,soastodoagoodjobofpredictingthetargetsforthenextlayer,etc.

17


Beforeweknowit,wehaveamultilayerneuralnetwork….

18

Matrix notation for a layer

Wehave

Inmatrixnotation

wheref isappliedelement-wise:

a1

a2

a3

a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.

z =Wx + ba = f (z)

f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]19

W12

b3

How do we train the weights W?

• Forasupervisedneuralnet,wecantrainthemodeljustlikeamaxentmodel– wecalculateandusegradients• Stochasticgradientdescent(SGD)• Usual

• ConjugategradientorL-BFGS• Sometimes; slightlydangerous sincespacemaynotbeconvex

• Wecanusethesameideasandtechniques• Justwithoutguarantees…

• Thisleadsto“backpropagation”,whichwecoverlater

20

Non-linearities: Why they’re needed

• Forlogisticregression:maptoprobabilities• Here:functionapproximation,

e.g.,regressionorclassification• Withoutnon-linearities, deepneuralnetworkscan’tdoanythingmorethanalinear transform• Extralayerscouldjustbecompileddownintoasinglelinear transform

• Peopleoftenuseothernon-linearities,suchastanh,ReLU

21

Non-linearities: What’s used

logistic(“sigmoid”)tanh

tanh isjustarescaledandshiftedsigmoidtanh isoftenusedandoftenperformsbetterfordeepnets

• It’soutputissymmetricaround0

tanh(z) = 2logistic(2z)−1

22

Non-linearities: Other choices

hardtanh softsign linearrectifier(ReLU)

• hardtanh similarbutcomputationallycheaper thantanh andsaturates hard.• [Glorot andBengioAISTATS 2010,2011]discuss softsign andrectifier

• Rectifiedlinearunit isnowcommon– transfers linear signalifactive

rect(z) =max(z, 0)softsign(z) = a1+ a

23

SummaryKnowing the meaning of words!

Younowunderstandthebasicsandtherelationtoothermodels• Neuron=logisticregressionorsimilarfunction• Inputlayer=inputtraining/testvector• Biasunit=interceptterm/alwaysonfeature• Activation=response/output• Activationfunctionisalogisticorothernonlinearity• Backpropagation=runningstochasticgradientdescentthrougha

multilayernetworkefficiently• Weightdecay=regularization/Bayesianpriorsmoothing

24

Word RepresentationsPart1.3:TheBasics

25

The standard word representation

Thevastmajorityofrule-basedand statisticalNLPworkregardswordsasatomicsymbols:hotel, conference, walk

Invectorspaceterms,thisisavectorwithone1andalotofzeroes

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]Dimensionality:20K(speech) – 50K(PTB)– 500K(bigvocab)– 13M(Google1T)

Wecallthisa“one-hot”representation.Itsproblem:

motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] ANDhotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0

26

Distributional similarity based representations

Youcangetalotofvaluebyrepresentingawordbymeansofitsneighbors

“Youshallknowawordbythecompanyitkeeps”(J.R.Firth1957:11)

OneofthemostsuccessfulideasofmodernstatisticalNLP

government debt problems turning into banking crises as has happened in

saying that Europe needs unified banking regulation to replace the hodgepodge

ë Thesewordswillrepresentbankingì

27Youcanvarywhetheryouuse localorlargecontexttogetamoresyntacticorsemantic clustering

Two traditional word representations:Class-based and soft clustering

Classbasedmodelslearnwordclassesofsimilarwordsbasedondistributionalinformation(~classHMM)• Brownclustering(Brownetal.1992,Liang2005)• Exchangeclustering(Martinetal.1998,Clark2003)

Softclusteringmodelslearnforeachcluster/topicadistributionoverwordsofhowlikelythatwordisineachcluster• LatentSemanticAnalysis(LSA/LSI),Randomprojections• LatentDirichlet Analysis(LDA),HMMclustering

28

Neural word embeddings as a distributed representation

Similaridea

Combinevectorspacesemanticswiththepredictionofprobabilisticmodels(Bengio etal.2003,Collobert &Weston2008,Huangetal.2012)

Inalloftheseapproaches,includingdeeplearningmodels,awordisrepresentedasadensevector

linguistics=

29

0.2860.792−0.177−0.1070.109−0.5420.3490.271

Neural word embeddings -visualization

30

Kulkarni, Al-Rfou, Perozzi, & Skiena (2015)

Advantages of the neural word embedding approach

32

Comparedtoothermethods,neuralwordembeddingscanbecomemoremeaningfulthroughaddingsupervisionfromoneormultipletasks

Forinstance,sentimentisusuallynotcapturedinunsupervisedwordembeddingsbutcanbeinneuralwordvectors

Wecanbuildcompositionalvectorrepresentationsforlongerphrases(nextlecture)

deep learning for nlp - stanford university...the standard word representation the vast majority of...

Documents