deep learning - penn engineering · • supervised training of deep models (e.g. many-layered...

DeepLearning:RestrictedBoltzmannMachines

&DeepBeliefNets

BasedonslidesbyGeoffreyHinton,SueBecker,YannLeCun,Yoshua Bengio,FrankWood

Robot Image Credit: Viktoriya Sukhanova © 123RF.com

These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric.

NeuralNetworks

2

Compare outputs with correct answer to get

error signal

outputs

inputs

hidden layers

Back-propagateerrorsignaltogetderivativesfor

learning

Whatiswrongwithback-propagation?• Itrequireslabeledtrainingdata

– Almostalldataisunlabeled

• Thelearningtimedoesnotscalewell– Itisveryslowinnetswithmultiplehiddenlayers

• Itcangetstuckinpoorlocaloptima– Theseareoftenquitegood,butfordeepnetstheyarefarfromoptimal

3

Motivations• Supervisedtrainingofdeepmodels(e.g.many-layeredNNets)isdifficult(optimizationproblem)

• Shallowmodels(SVMs,one-hidden-layerNNets,boosting,etc…)areunlikelycandidatesforlearninghigh-levelabstractionsneededforAI

• Unsupervisedlearningcoulddo“local-learning” (eachmoduletriesitsbesttomodelwhatitsees)

• Inference(+learning)isintractableindirectedgraphicalmodelswithmanyhiddenvariables

• Currentunsupervisedlearningmethodsdon’teasilyextendtolearnmultiplelevelsofrepresentation

BeliefNets• Abeliefnetisadirectedacyclic

graphcomposedofstochasticvariables.

• Canobservesomeofthevariablesandwewouldliketosolvetwoproblems:

• Theinferenceproblem:Inferthestatesoftheunobservedvariables.

• Thelearningproblem:Adjusttheinteractionsbetweenvariablestomakethenetworkmorelikelytogeneratetheobserveddata.

stochastichiddencause

visibleeffect

Usenetscomposedoflayersofstochasticbinaryvariableswithweightedconnections.Later,wewillgeneralizetoothertypesofvariable.

Explainingaway(JudeaPearl)• Eveniftwohiddencausesareindependent,theycanbecome

dependentwhenweobserveaneffectthattheycanbothinfluence.– Ifwelearnthattherewasanearthquakeitreducestheprobabilitythatthehousejumpedbecauseofatruck.

truckhitshouse earthquake

housejumps

P(J|T)=0.9

P(T)=e-10 P(E)=e-10

P(J|E)=0.9

P(J)=e-20

Whymultilayerlearningishardinasigmoidbeliefnet

• TolearnΘ,weneedtheposteriordistributioninthefirsthiddenlayer.

• Problem1:Theposterioristypicallyintractablebecauseof“explainingaway”.

• Problem2: Theposteriordependsonthepriorcreatedbyhigherlayersaswellasthelikelihood.– SotolearnΘ,weneedtoknowtheweightsinhigherlayers,evenifweareonlyapproximatingtheposterior.Alltheweightsinteract.

• Problem3: Weneedtointegrateoverallpossibleconfigurationsofthehighervariablestogetthepriorforfirsthiddenlayer.Yuk!

data

hiddenvariables

hiddenvariables

hiddenvariables

likelihood Θ

prior

StochasticbinaryneuronsHaveastateof1or0,whichisastochasticfunctionoftheneuron’sbias b andtheinputstates itreceivesfromotherneurons.

0.5

00

1

p(ai = 1) =

1

1 + exp(�bi �P

j xj⇥ji)

p(ai = 1) =

1

1 + exp(�bi �P

j sj⇥ji)

bi +X

j

sj⇥ji

P (ai = 1) =

1

1 + exp(�P

j sj⇥ji/T )=

1

1 + exp(��Ei/T )

Energy gap = �Ei = E(ai = 0)� E(ai = 1)

StochasticunitsReplacethebinarythresholdunitsbybinarystochasticunitsthatmakebiasedrandomdecisions.

– Thetemperaturecontrolstheamountofnoise– Decreasingalltheenergygapsbetweenconfigurationsisequivalenttoraisingthenoiselevel

temperature

RestrictedBoltzmannMachines

• Restricttheconnectivitytomakelearningeasier– Onlyonelayerofhiddenunits

• Dealwithmorelayerslater

– Noconnectionsbetweenhiddenunits• InanRBM,thehiddenunitsareconditionally

independentgiventhevisiblestates– Socanquicklygetanunbiasedsamplefromtheposteriordistributionwhengivenadata-vector

– Thisisabigadvantageoverdirectedbeliefnets

hidden

i

j

visible

Theenergyofajointconfiguration(ignoringbiasterms)

weightbetweenunitsi andj

Energywithconfigurationvonthevisibleunitsandhonthehiddenunits

binarystateofvisibleuniti

E(v,h) = �X

i,j

vihj⇥ij

binarystateofhiddenunitj

�@E(v,h)

@⇥ij= vihj

Weightsà Energiesà Probabilities

• Eachpossiblejointconfigurationofthevisibleandhiddenunitshasanenergy– Theenergyisdeterminedbytheweightsandbiases

• Theenergyofajointconfigurationofthevisibleandhiddenunitsdeterminesitsprobability:

• Theprobabilityofaconfigurationoverthevisibleunitsisfoundbysummingtheprobabilitiesofallthejointconfigurationsthatcontainit.

P (v,h) / e�E(v,h)

Usingenergiestodefineprobabilities

• Probabilityofajointconfigurationoverbothvisibleandhiddenunits

• Probabilityofaparticularconfigurationofthevisibleunits

P (v,h) =e�E(v,h)

Pu,g e

�E(u,g)

P (v) =

Ph e�E(v,h)

Pu,g e

�E(u,g)

ApictureoftheBoltzmannmachinelearningalgorithmforanRBM

i

t=0

Startwithatrainingvectoronthevisibleunits.

Thenalternatebetweenupdatingallthehiddenunitsinparallelandupdatingallthevisibleunitsinparallel.

afantasy

@ logP (v)

@⇥ij= E0(vihj)� E1(vihj)

j

E0(vihj)

i

j

E1(vihj)

i

j

E1(vihj)

i

j

t=1 t=2 t=∞

visible

hidden

Averysurprisingfact• Everythingthatoneweightneedstoknowabouttheotherweightsandthedatainordertodomaximumlikelihoodlearningiscontainedinthedifferenceoftwocorrelations.

Derivativeoflogprobabilityofonetrainingvector

Expectedvalueofproductofstatesatthermalequilibriumwhenthetrainingvectorisclampedonthevisibleunits

Expectedvalueofproductofstatesatthermalequilibriumwhennothingisclamped

@ logP (v)

@⇥ij= E0(vihj)� E1(vihj)

ApictureoftheBoltzmannmachinelearningalgorithmforanRBM

i

j

i

j

i

j

i

j

t=0t=1t=2t=infinity

Problem:thisMarkovchainmaytakeaverylongtimetoconverge!

Solution:ContrastiveDivergence

E0(vihj) E1(vihj) E1(vihj)

ContrastiveDivergenceLearning:AquickwaytolearnanRBM

i

j

i

j

t=0t=1

Startwithatrainingvectoronthevisibleunits.

Updateallthehiddenunitsinparallel

Updatetheallthevisibleunitsinparalleltogeta“reconstruction”.

Updatethehiddenunitsagain.

Thisisnotfollowingthegradientoftheloglikelihood.Butitworkswell.

Itisapproximatelyfollowingthegradientofanotherobjectivefunction(Carreira-Perpinan &Hinton,2005).

reconstructiondata

E0(vihj) E1(vihj)

�⇥ij = ✏[E0(vihj)� E1(vihj)]

Howtolearnasetoffeaturesthataregoodforreconstructingimagesofthedigit2

50binaryfeatureneurons

16x16pixelimage

50binaryfeatureneurons

16x16pixelimage

Increment weightsbetweenanactivepixelandanactivefeature

Decrementweightsbetweenanactivepixelandanactivefeature

data(reality)

reconstruction(betterthanreality)

Eachneurongrabsadifferentfeature.

TheFinal50 x256Weights

ReconstructionfromactivatedbinaryfeaturesData

ReconstructionfromactivatedbinaryfeaturesData

Howwellcanwereconstructthedigitimagesfromthebinaryfeatureactivations?

Newtestimagesfromthedigitclassthatthemodelwastrainedon

Imagesfromanunfamiliardigitclass(thenetworktriestoseeeveryimageasa2)

UsinganRBMtolearnamodelofadigitclass

Reconstructionsbymodeltrainedon2’s

Reconstructionsbymodeltrainedon3’s

Data

i

j

i

j

reconstructiondata

256visibleunits(pixels)

100hiddenunits(features)

E0(vihj) E1(vihj)

TrainingaDeepBeliefNetwork(themainreasonRBM’sareinteresting)

• Firsttrainalayeroffeaturesthatreceiveinputdirectlyfromthepixels.

• Thentreattheactivationsofthetrainedfeaturesasiftheywerepixelsandlearnfeaturesoffeaturesinasecondhiddenlayer.

• Itcanbeprovedthateachtimeweaddanotherlayeroffeaturesweimproveavariational lowerboundonthelogprobabilityofthetrainingdata.– Theproofisslightlycomplicated.– ButitisbasedonaneatequivalencebetweenanRBMandadeepdirectedmodel

TheGenerativeModelAfterLearning3Layers

Togeneratedata:1. Getanequilibriumsamplefromthe

top-levelRBMbyperformingalternatingGibbssamplingforalongtime.

2. Performatop-downpasstogetstatesforalltheotherlayers.

Sothelowerlevelbottom-upconnectionsarenot partofthegenerativemodel.Theyarejustusedforinference.

h2

data

h1

h3

⇥3

⇥2

⇥1

Whydoesgreedylearningwork?• EachRBMconvertsitsdatadistributioninto

anaggregatedposteriordistributionoveritshiddenunits.

• Thisdividesthetaskofmodelingitsdataintotwotasks:– Task1:Learngenerativeweightsthatcanconverttheaggregatedposteriordistributionoverthehiddenunitsbackintothedatadistribution.

– Task2:Learntomodeltheaggregatedposteriordistributionoverthehiddenunits.

– TheRBMdoesagoodjoboftask1andamoderatelygoodjoboftask2.

• Task2iseasier(forthenextRBM)thanmodelingtheoriginaldatabecausetheaggregatedposteriordistributionisclosertoadistributionthatanRBMcanmodelperfectly.

Aggregatedposterior

distributiononhiddenunits

Datadistributiononvisibleunits

P (v | h,⇥)

P (h | ⇥)

Task 1

Task 2

Whydoesgreedylearningwork?• TheweightsΘ inthebottomlevelRBMdefine

P(v | h) andtheyalso,indirectly,defineP(h).• SowecanexpresstheRBMmodelas

• IfweleaveP(v | h,Θ) aloneandimproveP(h|Θ),wewillimproveP(v).

• ToimproveP(h),weneedittobeabettermodeloftheaggregatedposteriordistributionoverhiddenvectorsproducedbyapplyingΘ tothedata.– Accomplishedbythenexthigherlayer

P (v) =X

h

P (v | h,⇥)P (h | ⇥)

Whygreedylearningworks• Eachtimewelearnanewlayer,theinferenceatthelayer

belowbecomesincorrect,butthevariational boundonthelogprob ofthedataimproves(onlytrueintheory)

• Sincetheboundstartsasanequality,learninganewlayerneverdecreasesthelogprob ofthedata,providedwestartthelearningfromthetiedweightsthatimplementthecomplementaryprior

• Nowthatwehaveaguaranteewecanloosentherestrictionsandstillfeelconfident– Allowlayerstovaryinsize– Donotstartthelearningateachlayerfromtheweightsinthelayerbelow

Aneuralnetworkmodelofdigitrecognition

2000top-levelunits

500units

500units

28x28pixelimage

10labelunits

Themodellearnsajointdensityforlabelsandimages.

Toperformrecognitionwecanstartwithaneutralstateofthelabelunitsanddooneortwoiterationsofthetop-levelRBM.

OrwecanjustcomputethefreeenergyoftheRBMwitheachofthe10labels

ThetoptwolayersformarestrictedBoltzmannmachinewhosefreeenergylandscapemodelsthelowdimensionalmanifoldsofthedigits.

Thevalleyshavenames:

Movieofthenetworkgeneratingdigits

(availableatwww.cs.toronto/~hinton)

Fine-tuningwithacontrastiveversionofthe“wake-sleep” algorithm

Afterlearningmanylayersoffeatures,wecanfine-tunethefeaturestoimprovegeneration.

1.Doastochasticbottom-uppass– Adjustthetop-downweightstobegoodatreconstructingthefeatureactivitiesinthelayerbelow.

2. DoafewiterationsofsamplinginthetoplevelRBM– Adjusttheweightsinthetop-levelRBM.

3. Doastochastictop-downpass– Adjustthebottom-upweightstobegoodatreconstructingthefeatureactivitiesinthelayerabove.

Notrequired!Buthelpstherecognitionrate.

LimitsoftheGenerativeModel

1.Designedforimageswherenon-binaryvaluescanbetreatedasprobabilities.

2. Top-downfeedbackonlyinthehighest(associative)layer.3. Nosystematicwaytodealwithinvariance.4. Assumessegmentationalreadyperformedanddoesnotlearn

toattendtothemostinformativepartsofobjects.

DeepNetActivationFunctions

OtherDeepArchitectures:ConvolutionalNeuralNetwork

[Image credit: http://timdettmers.com/2015/03/26/convolution-deep-learning/]

OtherDeepArchitectures:ConvolutionalNeuralNetwork

[Image credit: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/]

[Image credit: http://rnd.azoft.com/wp-content/uploads_rnd/2016/11/overall-1024x256.png]

OtherDeepArchitectures:LongShort-TermMemory(LSTM)

[Image credits: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

DeepLearningintheHeadlines

48

pixels

edges

object parts(combination of edges)

object models

DeepBeliefNetonFaceImages

BasedonmaterialsbyAndrewNg

49

Examplesoflearnedobjectpartsfromobjectcategories

LearningofObjectParts

Faces Cars Elephants Chairs

Slidecredit:AndrewNg50

TrainingonMultipleObjects

Trainedon4classes(cars,faces,motorbikes,airplanes).Secondlayer:Shared-featuresandobject-specificfeatures.Thirdlayer:Morespecificfeatures.


SceneLabelingviaDeepLearning

[Farabet etal.ICML2012,PAMI2013] 52

Inputimages

SamplesfromfeedforwardInference(control)

SamplesfromFullposteriorinference

InferencefromDeepLearnedModelsGeneratingposteriorsamplesfromfacesby“fillingin” experiments(cf.LeeandMumford,2003).Combinebottom-upandtop-downinference.


MachineLearninginAutomaticSpeechRecognition

ATypicalSpeechRecognitionSystem

MLusedtopredictofphonestatesfromthesoundspectrogram

Deeplearninghasstate-of-the-artresults

# HiddenLayers 1 2 4 8 10 12

WordErrorRate% 16.0 12.8 11.4 10.9 11.0 11.1

BaselineGMMperformance=15.4%[Zeiler etal.“Onrectifiedlinearunitsforspeechrecognition”ICASSP2013]

54

ImpactofDeepLearninginSpeechTechnology

Slidecredit:LiDeng,MSResearch55

deep learning - penn engineering · • supervised training of deep models (e.g. many-layered...

Documents