cs224n deepnlp week7 lecture3 - stanford university · part 1.1: the basics 27 #4 unsupervised...

11/9/15

1

DeepLearningforNLPPart3

CS224NChristopherManning

(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)

Backpropagation TrainingPart1.5:The Basics

2

Backprop

• Computegradientofexample-wise losswrtparameters

• Simplyapplyingthederivativechainrulewisely

• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient

3

Simple Chain Rule

4

Multiple Paths Chain Rule

5

Multiple Paths Chain Rule - General

…

6

11/9/15

2

Chain Rule in Flow Graph

…

…

…

Flow graph:anydirected acyclicgraphnode =computation resultarc=computation dependency

=successors of

7

Back-Prop in Multi-Layer Net

…

…

8

h = sigmoid(Vx)

Back-Prop in General Flow Graph

…

…

…

=successors of

1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors

2. Bprop:- initialize outputgradient=1- visitnodesin reverseorder:

Computegradientwrt eachnodeusinggradientwrt successors

Singlescalaroutput

9

Automatic Differentiation

• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.

• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.

• Easyandfastprototyping.See:• Theano (Python),• TensorFlow (Python/C++),or• Autograd (Lua/C++forTorch)

10

Deep Learning General Strategy and Tricks

11

General Strategy1. Selectnetworkstructureappropriateforproblem

1. Structure:Singlewords,fixedwindowsvs.convolutionalvs.recurrent/recursivesentencebasedvs.bagofwords

2. Nonlinearities[coveredearlier]2. Checkforimplementationbugswithgradientchecks3. Parameterinitialization4. Optimization5. Checkifthemodelispowerfulenoughtooverfit

1. Ifnot,changemodelstructureormakemodel“larger”2. Ifyoucanoverfit:Regularize

12

11/9/15

3

Gradient Checks are Awesome!

• Allowsyoutoknowthattherearenobugsinyourneuralnetworkimplementation!(Butmakesitrunreallyslow.)

• Steps:1. Implementyourgradient2. Implementafinitedifferencecomputationbylooping

throughtheparametersofyournetwork,addingandsubtractingasmallepsilon(∼10-4)andestimatederivatives

3. Comparethetwoandmakesuretheyarealmostthesame13

,

Parameter Initialization

• ParameterInitializationcanbeveryimportantforsuccess!• Initializehiddenlayerbiasesto0andoutput(orreconstruction)

biasestooptimalvalueifweightswere0(e.g.,meantargetorinversesigmoidofmeantarget).

• Initializeweights∼Uniform(−r,r),r inverselyproportionaltofan-in(previouslayersize)andfan-out(nextlayersize):

fortanh units,and4xbiggerforsigmoidunits [Glorot AISTATS2010]• Makeinitializationslighly positiveforReLU – toavoiddeadunits

14

• Gradientdescentusestotalgradientoverallexamplesperupdate• Shouldn’tbeused.Veryslow.

• SGDupdatesaftereachexample:

• L = lossfunction,zt =currentexample,θ =parametervector,andεt = learningrate.

• Youprocessanexampleandthenmoveeachparameterasmalldistancebysubtractingafractionofthegradient

• εt shouldbesmall…moreinfollowingslide• Important:applyallSGDupdatesatonceafterbackproppass

Stochastic Gradient Descent (SGD)

15

• RatherthandoSGDonasingleexample,peopleusuallydoitonaminibatch of32,64,or128examples.

• Yousumthegradientsinminibatch (andscaledownlearningrate)• Minoradvantage:gradientestimateismuchmorerobustwhenestimatedonabunchofexamplesratherthanjustone.• Majoradvantage:codecanrunmuchfasteriff youcandoawholeminibatch atonceviamatrix-matrixmultiplies

• ThereisawholepanoplyoffancieronlinelearningalgorithmscommonlyusednowwithNNs.Goodonesinclude:• Adagrad• RMSprop• ADAM

Stochastic Gradient Descent (SGD)

16

Learning Rates

• Settingα correctlyistricky• Simplestrecipe:α fixedandsameforallparameters• Orstartwithlearningratejustsmallenoughtobestableinfirst

passthroughdata(epoch),thenhalveitonsubsequentepochs• Betterresultscanusuallybeobtainedbyusingacurriculumfor

decreasinglearningrates,typicallyinO(1/t)becauseoftheoreticalconvergenceguarantees,e.g.,withhyper-parametersε0 andτ

• Betteryet:Nohand-setlearningratesbyusingmethodslikeAdaGrad (Duchi,Hazan,&Singer2011)[butmayconvergetoosoon– tryresettingaccumulatedgradients]

17

Attempt to overfit training data

Assumingyoufoundtherightnetworkstructure,implementeditcorrectly,andoptimizeditproperly,youcanmakeyourmodeltotallyoverfit onyourtrainingdata(99%+accuracy)• Ifnot:

• Changearchitecture• Makemodelbigger(biggervectors,morelayers)• Fixoptimization

• Ifyes:• Now,it’stimetoregularizethenetwork

18

11/9/15

4

Prevent Overfitting: Model Size and Regularization

• Simplefirststep:Reducemodelsizebyloweringnumberofunitsandlayersandotherparameters

• StandardL1orL2regularizationonweights• EarlyStopping:Useparametersthatgavebestvalidationerror• Sparsity constraintsonhiddenactivations,e.g.,addtocost:

19

Prevent Feature Co-adaptation

Dropout (Hintonetal.2012)http://jmlr.org/papers/v15/srivastava14a.html

• Trainingtime:ateachinstanceofevaluation(inonlineSGD-training),randomlyset50%oftheinputstoeachneuronto0• Testtime:halvethemodelweights(nowtwiceasmany)• Thispreventsfeatureco-adaptation:Afeaturecannotonlybeusefulinthepresenceofparticularotherfeatures• Akindofmiddle-groundbetweenNaïveBayes(whereallfeatureweightsaresetindependently)andlogisticregressionmodels(whereweightsaresetinthecontextofallothers)• Canbethoughtofasaformofmodelbagging• Itactsasastrongregularizer;see(Wageretal.2013)

http://arxiv.org/abs/1307.149 320

Deep Learning Tricks of the Trade• Y.Bengio(2012),“PracticalRecommendationsforGradient-

BasedTrainingofDeepArchitectures”http://arxiv.org/abs/1206.5533• Unsupervisedpre-training• Stochasticgradientdescentandsettinglearningrates• Mainhyper-parameters• Learning rateschedule &early stopping ,Minibatches, Parameterinitialization , Number ofhidden units, L1orL2weight decay,…

• Y.Bengio,I.Goodfellow,andA.Courville (inpress),“DeepLearning”.MITPress,ms.http://goodfeli.github.io/dlbook/• Manychaptersondeeplearning,includingoptimizationtricks• Somemorerecentstuffthan2012paper

21

Sharing statistical strengthPart1.7

22

Sharing Statistical Strength

• Besidesveryfastprediction,themainadvantageofdeeplearning isstatistical

• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training•Multi-tasklearning• Semi-supervisedlearning

23

Multi-Task Learning• Generalizingbettertonew

tasksiscrucialtoapproachAI

• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks

• Goodrepresentationsmakesenseformanytasks

raw input x

task 1 output y1

task 3 output y3

task 2output y2

shared intermediate representation h

24

11/9/15

5

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

purelysupervised

25

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

semi-supervised

26

Advantages of Deep LearningPart 2

Part1.1:The Basics

27

#4 Unsupervised feature learning

Today,mostpractical,goodNLP&MLmethodsrequirelabeled trainingdata(i.e., supervisedlearning)

Butalmostalldataisunlabeled

MostinformationmustbeacquiredunsupervisedFortunately,agoodmodelofobserveddatacanreallyhelpyoulearnclassificationdecisions

Commentary:Thisismorethedreamthanthereality;mostoftherecentsuccessesofdeeplearninghavecomefromregularsupervisedlearningoververylargedatasets28

#5 Handling the recursivity of language

Humansentencesarecomposedfromwordsandphrases

Weneedcompositionality inourMLmodels

Recursion:thesameoperator(sameparameters)isappliedrepeatedlyondifferentcomponents

Recurrentmodels:Recursionalongatemporalsequence

Asmallcrowdquietlyentersthehistoricchurch

historicthe

quietlyenters

SVP

Det. Adj.

NPVP

Asmallcrowd

NP

NP

church

N.

SemanticRepresentations

xt−1 xt xt+1

zt−1 zt zt+1

29

#6 Why now?

Despitepriorinvestigationandunderstandingofmanyofthealgorithmictechniques…

Before2006trainingdeeparchitectureswasunsuccessfulL

Whathaschanged?• Newmethodsforunsupervisedpre-training(Restricted

BoltzmannMachines=RBMs,autoencoders,contrastiveestimation,etc.)anddeepmodeltrainingdeveloped

• Moreefficientparameterestimationmethods• Betterunderstandingofmodelregularization• More data andmorecomputationalpower

11/9/15

6

Deep Learning models have already achieved impressive results for HLT

NeuralLanguageModel[Mikolov etal. Interspeech 2011]

MSRMAVISSpeechSystem[Dahletal. 2012;Seide etal. 2011;followingMohamedetal. 2011]

“Thealgorithmsrepresentthefirsttimeacompanyhasreleasedadeep-neural-networks(DNN)-basedspeech-recognitionalgorithminacommercialproduct.”

Model\ WSJASRtask EvalWERKN5Baseline 17.2DiscriminativeLM 16.9Recurrent NNcombination 14.4

Acousticmodel&training

Recog\ WER

RT03SFSH

Hub5SWB

GMM40-mix,BMMI,SWB309h

1-pass−adapt

27.4 23.6

DBN-DNN7layerx2048,SWB309h

1-pass−adapt

18.5(−33%)

16.1(−32%)

GMM72-mix,BMMI,FSH2000h

k-pass+adapt

18.6 17.131

Deep Learn Models Have Interesting Performance Characteristics

Deeplearningmodelscannowbeveryfastinsomecircumstances

• SENNA[Collobert etal. 2011]candoPOSorNERfasterthanotherSOTAtaggers(16xto122x),using25xlessmemory

• WSJPOS97.29% acc;CoNLL NER89.59% F1;CoNLL Chunking 94.32% F1

Changesincomputingtechnologyfavordeeplearning• InNLP,speedhastraditionallycomefromexploitingsparsity• Butwithmodernmachines,branchesandwidelyspacedmemoryaccessesarecostly• Uniformparalleloperationsondensevectorsarefaster

Thesetrendsareevenstrongerwithmulti-coreCPUsandGPUs32

TREE STRUCTURES WITH CONTINUOUS VECTORS

34

Compositionality We need more than word vectors and bags! What of larger semantic units?

Howcanweknowwhen largerunitsaresimilarinmeaning?

• Thesnowboarderisleapingoverthemogul

• Apersononasnowboardjumpsintotheair

Peopleinterpret themeaningoflarger textunits–entities,descriptiveterms, facts,arguments,stories– bysemanticcomposition ofsmallerelements

11/9/15

7

Socher, Manning, NgSoc her, Ng, Manning

Representing PhrasesasVectors

x2

x10 1 2 3 4 5 6 7 8 9 10

5

4

3

2

1

Monday92

Tuesday 9.51.5

Ifthevectorspacecapturessyntacticandsemanticinformation,thevectorscanbeusedasfeaturesforparsingandinterpretation

15

1.14

thecountryofmybirththeplacewhereI wasborn

Canweextendideasofwordvectorspacestophrases?

France 22.5

Germany 13

Vector for single words are useful as features but limited


How should wemapphrasesinto avector space?

the country of my birth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

Use theprinciple ofcompositionality!

Themeaning (vector)ofasentenceisdetermined by(1) themeanings ofits words and(2) the rules thatcombine them.

Modeljointlylearnscompositionalvectorrepresentationsandtreestructure.

x2

x10 1 2 3 4 5 6 7 8 9 10

5

4

3

2

1

the country of my birththe place where I was born

Monday

Tuesday

FranceGermany


TreeRecursiveNeuralNetworks(TreeRNNs)

on the mat.

91

43

33

83

Computationalunit:SimpleNeuralNetworklayer,appliedrecursively

85

33

NeuralNetwork

83

1.3

85

(Goller &Küchler 1996,Costa etal. 2003, Socheretal. ICML, 2011)


Version 1:Simple concatenation TreeRNN

Onlyasingleweightmatrix=composition function!

No reallyinteractionbetweentheinputwords!

Notadequate for human language composition function

W

c1 c2

pWscore s

p=tanh(W +b),

wheretanh:

score=VT p

c1c2


Version 4:RecursiveNeuralTensorNetwork

Allowsthetwowordorphrasevectorstointeractmultiplicatively


Beyondthe bagof words:Sentiment detection

Isthetoneofapieceoftextpositive,negative,orneutral?

• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments~90%,BUT

……loved……………great………………impressed………………marvelous…………

11/9/15

8


Stanford Sentiment Treebank

• 215,154 phraseslabeled in11,855 sentences• Canactually trainandtest compositions

http://nlp.stanford.edu:8080/sentiment/Socher, Manning, NgSoc her, Ng, Manning

BetterDataset Helped All Models

• Hardnegationcasesarestillmostlyincorrect• Wealsoneedamorepowerfulmodel!

75

76

77

78

79

80

81

82

83

84

TrainingwithSentenceLabels

TrainingwithTreebank

BiNB

RNN

MV-RNN


Version 4:RecursiveNeuralTensorNetwork

Idea:Allowbothadditiveandmediatedmultiplicativeinteractionsofvectors


RecursiveNeuralTensor Network





• Useresultingvectorsintreeasinputtoaclassifierlikelogisticregression

• Trainallweightsjointlywithgradientdescent

11/9/15

9


Positive/Negative Results on Treebank

74

76

78

80

82

84

86

TrainingwithSentenceLabels TrainingwithTreebank

BiNBRNNMV-RNNRNTN

ClassifyingSentences:Accuracyimprovesto85.4

Note:formorerecentwork,seeLe&Mikolov (2014),Irsoy &Cardie (2014),Taietal. (2015)Socher, Manning, NgSoc her, Ng, Manning

Experimental Resultson Treebank• RNTNcancaptureconstructionslikeXbutY• RNTNaccuracyof72%,comparedtoMV-RNN(65%),biword NB(58%)andRNN(54%)


Negation ResultsWhennegatingnegatives,positiveactivationshouldincrease!

Demo:http://nlp.stanford.edu:8080/sent ime nt/

Conclusion

Developingintelligentmachinesinvolvesbeingabletorecognizeandexploitcompositionalstructure

Italsoinvolvesotherthingsliketop-downprediction,ofcourse

Workisnowunderwayonhowtodomorecomplextasksthanstraightclassificationinsidedeeplearningsystems

cs224n deepnlp week7 lecture3 - stanford university · part 1.1: the basics 27 #4 unsupervised...

Documents