cs224n deepnlp week7 lecture3 - stanford university · part 1.1: the basics 27 #4 unsupervised...

9
11/9/15 1 Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowedfrom ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Backpropagation Training Part 1.5: The Basics 2 Backprop Compute gradient of example-wise loss wrt parameters Simply applying the derivative chain rule wisely If computing the loss (example, parameters ) is O( n) computation, then so is computing the gradient 3 Simple Chain Rule 4 Multiple Paths Chain Rule 5 Multiple Paths Chain Rule - General 6

Upload: others

Post on 24-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised feature learning Today, most practical, good NLP& ML methods require labeled training

11/9/15

1

DeepLearningforNLPPart3

CS224NChristopherManning

(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)

Backpropagation TrainingPart1.5:The Basics

2

Backprop

• Computegradientofexample-wise losswrtparameters

• Simplyapplyingthederivativechainrulewisely

• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient

3

Simple Chain Rule

4

Multiple Paths Chain Rule

5

Multiple Paths Chain Rule - General

6

Page 2: CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised feature learning Today, most practical, good NLP& ML methods require labeled training

11/9/15

2

Chain Rule in Flow Graph

Flow graph:anydirected acyclicgraphnode =computation resultarc=computation dependency

=successors of

7

Back-Prop in Multi-Layer Net

8

h = sigmoid(Vx)

Back-Prop in General Flow Graph

=successors of

1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors

2. Bprop:- initialize outputgradient=1- visitnodesin reverseorder:

Computegradientwrt eachnodeusinggradientwrt successors

Singlescalaroutput

9

Automatic Differentiation

• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.

• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.

• Easyandfastprototyping.See:• Theano (Python),• TensorFlow (Python/C++),or• Autograd (Lua/C++forTorch)

10

Deep Learning General Strategy and Tricks

11

General Strategy1. Selectnetworkstructureappropriateforproblem

1. Structure:Singlewords,fixedwindowsvs.convolutionalvs.recurrent/recursivesentencebasedvs.bagofwords

2. Nonlinearities[coveredearlier]2. Checkforimplementationbugswithgradientchecks3. Parameterinitialization4. Optimization5. Checkifthemodelispowerfulenoughtooverfit

1. Ifnot,changemodelstructureormakemodel“larger”2. Ifyoucanoverfit:Regularize

12

Page 3: CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised feature learning Today, most practical, good NLP& ML methods require labeled training

11/9/15

3

Gradient Checks are Awesome!

• Allowsyoutoknowthattherearenobugsinyourneuralnetworkimplementation!(Butmakesitrunreallyslow.)

• Steps:1. Implementyourgradient2. Implementafinitedifferencecomputationbylooping

throughtheparametersofyournetwork,addingandsubtractingasmallepsilon(∼10-4)andestimatederivatives

3. Comparethetwoandmakesuretheyarealmostthesame13

,

Parameter Initialization

• ParameterInitializationcanbeveryimportantforsuccess!• Initializehiddenlayerbiasesto0andoutput(orreconstruction)

biasestooptimalvalueifweightswere0(e.g.,meantargetorinversesigmoidofmeantarget).

• Initializeweights∼Uniform(−r,r),r inverselyproportionaltofan-in(previouslayersize)andfan-out(nextlayersize):

fortanh units,and4xbiggerforsigmoidunits [Glorot AISTATS2010]• Makeinitializationslighly positiveforReLU – toavoiddeadunits

14

• Gradientdescentusestotalgradientoverallexamplesperupdate• Shouldn’tbeused.Veryslow.

• SGDupdatesaftereachexample:

• L = lossfunction,zt =currentexample,θ =parametervector,andεt = learningrate.

• Youprocessanexampleandthenmoveeachparameterasmalldistancebysubtractingafractionofthegradient

• εt shouldbesmall…moreinfollowingslide• Important:applyallSGDupdatesatonceafterbackproppass

Stochastic Gradient Descent (SGD)

15

• RatherthandoSGDonasingleexample,peopleusuallydoitonaminibatch of32,64,or128examples.

• Yousumthegradientsinminibatch (andscaledownlearningrate)• Minoradvantage:gradientestimateismuchmorerobustwhenestimatedonabunchofexamplesratherthanjustone.• Majoradvantage:codecanrunmuchfasteriff youcandoawholeminibatch atonceviamatrix-matrixmultiplies

• ThereisawholepanoplyoffancieronlinelearningalgorithmscommonlyusednowwithNNs.Goodonesinclude:• Adagrad• RMSprop• ADAM

Stochastic Gradient Descent (SGD)

16

Learning Rates

• Settingα correctlyistricky• Simplestrecipe:α fixedandsameforallparameters• Orstartwithlearningratejustsmallenoughtobestableinfirst

passthroughdata(epoch),thenhalveitonsubsequentepochs• Betterresultscanusuallybeobtainedbyusingacurriculumfor

decreasinglearningrates,typicallyinO(1/t)becauseoftheoreticalconvergenceguarantees,e.g.,withhyper-parametersε0 andτ

• Betteryet:Nohand-setlearningratesbyusingmethodslikeAdaGrad (Duchi,Hazan,&Singer2011)[butmayconvergetoosoon– tryresettingaccumulatedgradients]

17

Attempt to overfit training data

Assumingyoufoundtherightnetworkstructure,implementeditcorrectly,andoptimizeditproperly,youcanmakeyourmodeltotallyoverfit onyourtrainingdata(99%+accuracy)• Ifnot:

• Changearchitecture• Makemodelbigger(biggervectors,morelayers)• Fixoptimization

• Ifyes:• Now,it’stimetoregularizethenetwork

18

Page 4: CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised feature learning Today, most practical, good NLP& ML methods require labeled training

11/9/15

4

Prevent Overfitting: Model Size and Regularization

• Simplefirststep:Reducemodelsizebyloweringnumberofunitsandlayersandotherparameters

• StandardL1orL2regularizationonweights• EarlyStopping:Useparametersthatgavebestvalidationerror• Sparsity constraintsonhiddenactivations,e.g.,addtocost:

19

Prevent Feature Co-adaptation

Dropout (Hintonetal.2012)http://jmlr.org/papers/v15/srivastava14a.html

• Trainingtime:ateachinstanceofevaluation(inonlineSGD-training),randomlyset50%oftheinputstoeachneuronto0• Testtime:halvethemodelweights(nowtwiceasmany)• Thispreventsfeatureco-adaptation:Afeaturecannotonlybeusefulinthepresenceofparticularotherfeatures• Akindofmiddle-groundbetweenNaïveBayes(whereallfeatureweightsaresetindependently)andlogisticregressionmodels(whereweightsaresetinthecontextofallothers)• Canbethoughtofasaformofmodelbagging• Itactsasastrongregularizer;see(Wageretal.2013)

http://arxiv.org/abs/1307.149 320

Deep Learning Tricks of the Trade• Y.Bengio(2012),“PracticalRecommendationsforGradient-

BasedTrainingofDeepArchitectures”http://arxiv.org/abs/1206.5533• Unsupervisedpre-training• Stochasticgradientdescentandsettinglearningrates• Mainhyper-parameters• Learning rateschedule &early stopping ,Minibatches, Parameterinitialization , Number ofhidden units, L1orL2weight decay,…

• Y.Bengio,I.Goodfellow,andA.Courville (inpress),“DeepLearning”.MITPress,ms.http://goodfeli.github.io/dlbook/• Manychaptersondeeplearning,includingoptimizationtricks• Somemorerecentstuffthan2012paper

21

Sharing statistical strengthPart1.7

22

Sharing Statistical Strength

• Besidesveryfastprediction,themainadvantageofdeeplearning isstatistical

• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training•Multi-tasklearning• Semi-supervisedlearning

23

Multi-Task Learning• Generalizingbettertonew

tasksiscrucialtoapproachAI

• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks

• Goodrepresentationsmakesenseformanytasks

raw input x

task 1 output y1

task 3 output y3

task 2output y2

shared intermediate representation h

24

Page 5: CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised feature learning Today, most practical, good NLP& ML methods require labeled training

11/9/15

5

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

purelysupervised

25

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

semi-supervised

26

Advantages of Deep LearningPart 2

Part1.1:The Basics

27

#4 Unsupervised feature learning

Today,mostpractical,goodNLP&MLmethodsrequirelabeled trainingdata(i.e., supervisedlearning)

Butalmostalldataisunlabeled

MostinformationmustbeacquiredunsupervisedFortunately,agoodmodelofobserveddatacanreallyhelpyoulearnclassificationdecisions

Commentary:Thisismorethedreamthanthereality;mostoftherecentsuccessesofdeeplearninghavecomefromregularsupervisedlearningoververylargedatasets28

#5 Handling the recursivity of language

Humansentencesarecomposedfromwordsandphrases

Weneedcompositionality inourMLmodels

Recursion:thesameoperator(sameparameters)isappliedrepeatedlyondifferentcomponents

Recurrentmodels:Recursionalongatemporalsequence

Asmallcrowdquietlyentersthehistoricchurch

historicthe

quietlyenters

SVP

Det. Adj.

NPVP

Asmallcrowd

NP

NP

church

N.

SemanticRepresentations

xt−1 xt xt+1

zt−1 zt zt+1

29

#6 Why now?

Despitepriorinvestigationandunderstandingofmanyofthealgorithmictechniques…

Before2006trainingdeeparchitectureswasunsuccessfulL

Whathaschanged?• Newmethodsforunsupervisedpre-training(Restricted

BoltzmannMachines=RBMs,autoencoders,contrastiveestimation,etc.)anddeepmodeltrainingdeveloped

• Moreefficientparameterestimationmethods• Betterunderstandingofmodelregularization• More data andmorecomputationalpower

Page 6: CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised feature learning Today, most practical, good NLP& ML methods require labeled training

11/9/15

6

Deep Learning models have already achieved impressive results for HLT

NeuralLanguageModel[Mikolov etal. Interspeech 2011]

MSRMAVISSpeechSystem[Dahletal. 2012;Seide etal. 2011;followingMohamedetal. 2011]

“Thealgorithmsrepresentthefirsttimeacompanyhasreleasedadeep-neural-networks(DNN)-basedspeech-recognitionalgorithminacommercialproduct.”

Model\ WSJASRtask EvalWERKN5Baseline 17.2DiscriminativeLM 16.9Recurrent NNcombination 14.4

Acousticmodel&training

Recog\ WER

RT03SFSH

Hub5SWB

GMM40-mix,BMMI,SWB309h

1-pass−adapt

27.4 23.6

DBN-DNN7layerx2048,SWB309h

1-pass−adapt

18.5(−33%)

16.1(−32%)

GMM72-mix,BMMI,FSH2000h

k-pass+adapt

18.6 17.131

Deep Learn Models Have Interesting Performance Characteristics

Deeplearningmodelscannowbeveryfastinsomecircumstances

• SENNA[Collobert etal. 2011]candoPOSorNERfasterthanotherSOTAtaggers(16xto122x),using25xlessmemory

• WSJPOS97.29% acc;CoNLL NER89.59% F1;CoNLL Chunking 94.32% F1

Changesincomputingtechnologyfavordeeplearning• InNLP,speedhastraditionallycomefromexploitingsparsity• Butwithmodernmachines,branchesandwidelyspacedmemoryaccessesarecostly• Uniformparalleloperationsondensevectorsarefaster

Thesetrendsareevenstrongerwithmulti-coreCPUsandGPUs32

TREE STRUCTURES WITH CONTINUOUS VECTORS

34

Compositionality We need more than word vectors and bags! What of larger semantic units?

Howcanweknowwhen largerunitsaresimilarinmeaning?

• Thesnowboarderisleapingoverthemogul

• Apersononasnowboardjumpsintotheair

Peopleinterpret themeaningoflarger textunits–entities,descriptiveterms, facts,arguments,stories– bysemanticcomposition ofsmallerelements

Page 7: CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised feature learning Today, most practical, good NLP& ML methods require labeled training

11/9/15

7

Socher, Manning, NgSoc her, Ng, Manning

Representing PhrasesasVectors

x2

x10 1 2 3 4 5 6 7 8 9 10

5

4

3

2

1

Monday92

Tuesday 9.51.5

Ifthevectorspacecapturessyntacticandsemanticinformation,thevectorscanbeusedasfeaturesforparsingandinterpretation

15

1.14

thecountryofmybirththeplacewhereI wasborn

Canweextendideasofwordvectorspacestophrases?

France 22.5

Germany 13

Vector for single words are useful as features but limited

Socher, Manning, NgSoc her, Ng, Manning

How should wemapphrasesinto avector space?

the country of my birth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

Use theprinciple ofcompositionality!

Themeaning (vector)ofasentenceisdetermined by(1) themeanings ofits words and(2) the rules thatcombine them.

Modeljointlylearnscompositionalvectorrepresentationsandtreestructure.

x2

x10 1 2 3 4 5 6 7 8 9 10

5

4

3

2

1

the country of my birththe place where I was born

Monday

Tuesday

FranceGermany

Socher, Manning, NgSoc her, Ng, Manning

TreeRecursiveNeuralNetworks(TreeRNNs)

on the mat.

91

43

33

83

Computationalunit:SimpleNeuralNetworklayer,appliedrecursively

85

33

NeuralNetwork

83

1.3

85

(Goller &Küchler 1996,Costa etal. 2003, Socheretal. ICML, 2011)

Socher, Manning, NgSoc her, Ng, Manning

Version 1:Simple concatenation TreeRNN

Onlyasingleweightmatrix=composition function!

No reallyinteractionbetweentheinputwords!

Notadequate for human language composition function

W

c1 c2

pWscore s

p=tanh(W +b),

wheretanh:

score=VT p

c1c2

Socher, Manning, NgSoc her, Ng, Manning

Version 4:RecursiveNeuralTensorNetwork

Allowsthetwowordorphrasevectorstointeractmultiplicatively

Socher, Manning, NgSoc her, Ng, Manning

Beyondthe bagof words:Sentiment detection

Isthetoneofapieceoftextpositive,negative,orneutral?

• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments~90%,BUT

……loved……………great………………impressed………………marvelous…………

Page 8: CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised feature learning Today, most practical, good NLP& ML methods require labeled training

11/9/15

8

Socher, Manning, NgSoc her, Ng, Manning

Stanford Sentiment Treebank

• 215,154 phraseslabeled in11,855 sentences• Canactually trainandtest compositions

http://nlp.stanford.edu:8080/sentiment/Socher, Manning, NgSoc her, Ng, Manning

BetterDataset Helped All Models

• Hardnegationcasesarestillmostlyincorrect• Wealsoneedamorepowerfulmodel!

75

76

77

78

79

80

81

82

83

84

TrainingwithSentenceLabels

TrainingwithTreebank

BiNB

RNN

MV-RNN

Socher, Manning, NgSoc her, Ng, Manning

Version 4:RecursiveNeuralTensorNetwork

Idea:Allowbothadditiveandmediatedmultiplicativeinteractionsofvectors

Socher, Manning, NgSoc her, Ng, Manning

RecursiveNeuralTensor Network

Socher, Manning, NgSoc her, Ng, Manning

RecursiveNeuralTensor Network

Socher, Manning, NgSoc her, Ng, Manning

RecursiveNeuralTensor Network

• Useresultingvectorsintreeasinputtoaclassifierlikelogisticregression

• Trainallweightsjointlywithgradientdescent

Page 9: CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised feature learning Today, most practical, good NLP& ML methods require labeled training

11/9/15

9

Socher, Manning, NgSoc her, Ng, Manning

Positive/Negative Results on Treebank

74

76

78

80

82

84

86

TrainingwithSentenceLabels TrainingwithTreebank

BiNBRNNMV-RNNRNTN

ClassifyingSentences:Accuracyimprovesto85.4

Note:formorerecentwork,seeLe&Mikolov (2014),Irsoy &Cardie (2014),Taietal. (2015)Socher, Manning, NgSoc her, Ng, Manning

Experimental Resultson Treebank• RNTNcancaptureconstructionslikeXbutY• RNTNaccuracyof72%,comparedtoMV-RNN(65%),biword NB(58%)andRNN(54%)

Socher, Manning, NgSoc her, Ng, Manning

Negation ResultsWhennegatingnegatives,positiveactivationshouldincrease!

Demo:http://nlp.stanford.edu:8080/sent ime nt/

Conclusion

Developingintelligentmachinesinvolvesbeingabletorecognizeandexploitcompositionalstructure

Italsoinvolvesotherthingsliketop-downprediction,ofcourse

Workisnowunderwayonhowtodomorecomplextasksthanstraightclassificationinsidedeeplearningsystems