deep learning - penn engineering · • supervised training of deep models (e.g. many-layered...
TRANSCRIPT
DeepLearning:RestrictedBoltzmannMachines
&DeepBeliefNets
BasedonslidesbyGeoffreyHinton,SueBecker,YannLeCun,Yoshua Bengio,FrankWood
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric.
NeuralNetworks
2
Compare outputs with correct answer to get
error signal
outputs
inputs
hidden layers
Back-propagateerrorsignaltogetderivativesfor
learning
Whatiswrongwithback-propagation?• Itrequireslabeledtrainingdata
– Almostalldataisunlabeled
• Thelearningtimedoesnotscalewell– Itisveryslowinnetswithmultiplehiddenlayers
• Itcangetstuckinpoorlocaloptima– Theseareoftenquitegood,butfordeepnetstheyarefarfromoptimal
3
Motivations• Supervisedtrainingofdeepmodels(e.g.many-layeredNNets)isdifficult(optimizationproblem)
• Shallowmodels(SVMs,one-hidden-layerNNets,boosting,etc…)areunlikelycandidatesforlearninghigh-levelabstractionsneededforAI
• Unsupervisedlearningcoulddo“local-learning” (eachmoduletriesitsbesttomodelwhatitsees)
• Inference(+learning)isintractableindirectedgraphicalmodelswithmanyhiddenvariables
• Currentunsupervisedlearningmethodsdon’teasilyextendtolearnmultiplelevelsofrepresentation
BeliefNets• Abeliefnetisadirectedacyclic
graphcomposedofstochasticvariables.
• Canobservesomeofthevariablesandwewouldliketosolvetwoproblems:
• Theinferenceproblem:Inferthestatesoftheunobservedvariables.
• Thelearningproblem:Adjusttheinteractionsbetweenvariablestomakethenetworkmorelikelytogeneratetheobserveddata.
stochastichiddencause
visibleeffect
Usenetscomposedoflayersofstochasticbinaryvariableswithweightedconnections.Later,wewillgeneralizetoothertypesofvariable.
Explainingaway(JudeaPearl)• Eveniftwohiddencausesareindependent,theycanbecome
dependentwhenweobserveaneffectthattheycanbothinfluence.– Ifwelearnthattherewasanearthquakeitreducestheprobabilitythatthehousejumpedbecauseofatruck.
truckhitshouse earthquake
housejumps
P(J|T)=0.9
P(T)=e-10 P(E)=e-10
P(J|E)=0.9
P(J)=e-20
Whymultilayerlearningishardinasigmoidbeliefnet
• TolearnΘ,weneedtheposteriordistributioninthefirsthiddenlayer.
• Problem1:Theposterioristypicallyintractablebecauseof“explainingaway”.
• Problem2: Theposteriordependsonthepriorcreatedbyhigherlayersaswellasthelikelihood.– SotolearnΘ,weneedtoknowtheweightsinhigherlayers,evenifweareonlyapproximatingtheposterior.Alltheweightsinteract.
• Problem3: Weneedtointegrateoverallpossibleconfigurationsofthehighervariablestogetthepriorforfirsthiddenlayer.Yuk!
data
hiddenvariables
hiddenvariables
hiddenvariables
likelihood Θ
prior
StochasticbinaryneuronsHaveastateof1or0,whichisastochasticfunctionoftheneuron’sbias b andtheinputstates itreceivesfromotherneurons.
0.5
00
1
p(ai = 1) =
1
1 + exp(�bi �P
j xj⇥ji)
p(ai = 1) =
1
1 + exp(�bi �P
j sj⇥ji)
bi +X
j
sj⇥ji
P (ai = 1) =
1
1 + exp(�P
j sj⇥ji/T )=
1
1 + exp(��Ei/T )
Energy gap = �Ei = E(ai = 0)� E(ai = 1)
StochasticunitsReplacethebinarythresholdunitsbybinarystochasticunitsthatmakebiasedrandomdecisions.
– Thetemperaturecontrolstheamountofnoise– Decreasingalltheenergygapsbetweenconfigurationsisequivalenttoraisingthenoiselevel
temperature
RestrictedBoltzmannMachines
• Restricttheconnectivitytomakelearningeasier– Onlyonelayerofhiddenunits
• Dealwithmorelayerslater
– Noconnectionsbetweenhiddenunits• InanRBM,thehiddenunitsareconditionally
independentgiventhevisiblestates– Socanquicklygetanunbiasedsamplefromtheposteriordistributionwhengivenadata-vector
– Thisisabigadvantageoverdirectedbeliefnets
hidden
i
j
visible
Theenergyofajointconfiguration(ignoringbiasterms)
weightbetweenunitsi andj
Energywithconfigurationvonthevisibleunitsandhonthehiddenunits
binarystateofvisibleuniti
E(v,h) = �X
i,j
vihj⇥ij
binarystateofhiddenunitj
�@E(v,h)
@⇥ij= vihj
Weightsà Energiesà Probabilities
• Eachpossiblejointconfigurationofthevisibleandhiddenunitshasanenergy– Theenergyisdeterminedbytheweightsandbiases
• Theenergyofajointconfigurationofthevisibleandhiddenunitsdeterminesitsprobability:
• Theprobabilityofaconfigurationoverthevisibleunitsisfoundbysummingtheprobabilitiesofallthejointconfigurationsthatcontainit.
P (v,h) / e�E(v,h)
Usingenergiestodefineprobabilities
• Probabilityofajointconfigurationoverbothvisibleandhiddenunits
• Probabilityofaparticularconfigurationofthevisibleunits
P (v,h) =e�E(v,h)
Pu,g e
�E(u,g)
P (v) =
Ph e�E(v,h)
Pu,g e
�E(u,g)
ApictureoftheBoltzmannmachinelearningalgorithmforanRBM
i
t=0
Startwithatrainingvectoronthevisibleunits.
Thenalternatebetweenupdatingallthehiddenunitsinparallelandupdatingallthevisibleunitsinparallel.
afantasy
@ logP (v)
@⇥ij= E0(vihj)� E1(vihj)
j
E0(vihj)
i
j
E1(vihj)
i
j
E1(vihj)
i
j
t=1 t=2 t=∞
visible
hidden
Averysurprisingfact• Everythingthatoneweightneedstoknowabouttheotherweightsandthedatainordertodomaximumlikelihoodlearningiscontainedinthedifferenceoftwocorrelations.
Derivativeoflogprobabilityofonetrainingvector
Expectedvalueofproductofstatesatthermalequilibriumwhenthetrainingvectorisclampedonthevisibleunits
Expectedvalueofproductofstatesatthermalequilibriumwhennothingisclamped
@ logP (v)
@⇥ij= E0(vihj)� E1(vihj)
ApictureoftheBoltzmannmachinelearningalgorithmforanRBM
i
j
i
j
i
j
i
j
t=0t=1t=2t=infinity
Problem:thisMarkovchainmaytakeaverylongtimetoconverge!
Solution:ContrastiveDivergence
E0(vihj) E1(vihj) E1(vihj)
ContrastiveDivergenceLearning:AquickwaytolearnanRBM
i
j
i
j
t=0t=1
Startwithatrainingvectoronthevisibleunits.
Updateallthehiddenunitsinparallel
Updatetheallthevisibleunitsinparalleltogeta“reconstruction”.
Updatethehiddenunitsagain.
Thisisnotfollowingthegradientoftheloglikelihood.Butitworkswell.
Itisapproximatelyfollowingthegradientofanotherobjectivefunction(Carreira-Perpinan &Hinton,2005).
reconstructiondata
E0(vihj) E1(vihj)
�⇥ij = ✏[E0(vihj)� E1(vihj)]
Howtolearnasetoffeaturesthataregoodforreconstructingimagesofthedigit2
50binaryfeatureneurons
16x16pixelimage
50binaryfeatureneurons
16x16pixelimage
Increment weightsbetweenanactivepixelandanactivefeature
Decrementweightsbetweenanactivepixelandanactivefeature
data(reality)
reconstruction(betterthanreality)
Eachneurongrabsadifferentfeature.
TheFinal50 x256Weights
ReconstructionfromactivatedbinaryfeaturesData
ReconstructionfromactivatedbinaryfeaturesData
Howwellcanwereconstructthedigitimagesfromthebinaryfeatureactivations?
Newtestimagesfromthedigitclassthatthemodelwastrainedon
Imagesfromanunfamiliardigitclass(thenetworktriestoseeeveryimageasa2)
UsinganRBMtolearnamodelofadigitclass
Reconstructionsbymodeltrainedon2’s
Reconstructionsbymodeltrainedon3’s
Data
i
j
i
j
reconstructiondata
256visibleunits(pixels)
100hiddenunits(features)
E0(vihj) E1(vihj)
TrainingaDeepBeliefNetwork(themainreasonRBM’sareinteresting)
• Firsttrainalayeroffeaturesthatreceiveinputdirectlyfromthepixels.
• Thentreattheactivationsofthetrainedfeaturesasiftheywerepixelsandlearnfeaturesoffeaturesinasecondhiddenlayer.
• Itcanbeprovedthateachtimeweaddanotherlayeroffeaturesweimproveavariational lowerboundonthelogprobabilityofthetrainingdata.– Theproofisslightlycomplicated.– ButitisbasedonaneatequivalencebetweenanRBMandadeepdirectedmodel
TheGenerativeModelAfterLearning3Layers
Togeneratedata:1. Getanequilibriumsamplefromthe
top-levelRBMbyperformingalternatingGibbssamplingforalongtime.
2. Performatop-downpasstogetstatesforalltheotherlayers.
Sothelowerlevelbottom-upconnectionsarenot partofthegenerativemodel.Theyarejustusedforinference.
h2
data
h1
h3
⇥3
⇥2
⇥1
Whydoesgreedylearningwork?• EachRBMconvertsitsdatadistributioninto
anaggregatedposteriordistributionoveritshiddenunits.
• Thisdividesthetaskofmodelingitsdataintotwotasks:– Task1:Learngenerativeweightsthatcanconverttheaggregatedposteriordistributionoverthehiddenunitsbackintothedatadistribution.
– Task2:Learntomodeltheaggregatedposteriordistributionoverthehiddenunits.
– TheRBMdoesagoodjoboftask1andamoderatelygoodjoboftask2.
• Task2iseasier(forthenextRBM)thanmodelingtheoriginaldatabecausetheaggregatedposteriordistributionisclosertoadistributionthatanRBMcanmodelperfectly.
Aggregatedposterior
distributiononhiddenunits
Datadistributiononvisibleunits
P (v | h,⇥)
P (h | ⇥)
Task 1
Task 2
Whydoesgreedylearningwork?• TheweightsΘ inthebottomlevelRBMdefine
P(v | h) andtheyalso,indirectly,defineP(h).• SowecanexpresstheRBMmodelas
• IfweleaveP(v | h,Θ) aloneandimproveP(h|Θ),wewillimproveP(v).
• ToimproveP(h),weneedittobeabettermodeloftheaggregatedposteriordistributionoverhiddenvectorsproducedbyapplyingΘ tothedata.– Accomplishedbythenexthigherlayer
P (v) =X
h
P (v | h,⇥)P (h | ⇥)
Whygreedylearningworks• Eachtimewelearnanewlayer,theinferenceatthelayer
belowbecomesincorrect,butthevariational boundonthelogprob ofthedataimproves(onlytrueintheory)
• Sincetheboundstartsasanequality,learninganewlayerneverdecreasesthelogprob ofthedata,providedwestartthelearningfromthetiedweightsthatimplementthecomplementaryprior
• Nowthatwehaveaguaranteewecanloosentherestrictionsandstillfeelconfident– Allowlayerstovaryinsize– Donotstartthelearningateachlayerfromtheweightsinthelayerbelow
Aneuralnetworkmodelofdigitrecognition
2000top-levelunits
500units
500units
28x28pixelimage
10labelunits
Themodellearnsajointdensityforlabelsandimages.
Toperformrecognitionwecanstartwithaneutralstateofthelabelunitsanddooneortwoiterationsofthetop-levelRBM.
OrwecanjustcomputethefreeenergyoftheRBMwitheachofthe10labels
ThetoptwolayersformarestrictedBoltzmannmachinewhosefreeenergylandscapemodelsthelowdimensionalmanifoldsofthedigits.
Thevalleyshavenames:
Movieofthenetworkgeneratingdigits
(availableatwww.cs.toronto/~hinton)
Fine-tuningwithacontrastiveversionofthe“wake-sleep” algorithm
Afterlearningmanylayersoffeatures,wecanfine-tunethefeaturestoimprovegeneration.
1.Doastochasticbottom-uppass– Adjustthetop-downweightstobegoodatreconstructingthefeatureactivitiesinthelayerbelow.
2. DoafewiterationsofsamplinginthetoplevelRBM– Adjusttheweightsinthetop-levelRBM.
3. Doastochastictop-downpass– Adjustthebottom-upweightstobegoodatreconstructingthefeatureactivitiesinthelayerabove.
Notrequired!Buthelpstherecognitionrate.
LimitsoftheGenerativeModel
1.Designedforimageswherenon-binaryvaluescanbetreatedasprobabilities.
2. Top-downfeedbackonlyinthehighest(associative)layer.3. Nosystematicwaytodealwithinvariance.4. Assumessegmentationalreadyperformedanddoesnotlearn
toattendtothemostinformativepartsofobjects.
DeepNetActivationFunctions
OtherDeepArchitectures:ConvolutionalNeuralNetwork
[Image credit: http://timdettmers.com/2015/03/26/convolution-deep-learning/]
OtherDeepArchitectures:ConvolutionalNeuralNetwork
[Image credit: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/]
[Image credit: http://rnd.azoft.com/wp-content/uploads_rnd/2016/11/overall-1024x256.png]
OtherDeepArchitectures:LongShort-TermMemory(LSTM)
[Image credits: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
DeepLearningintheHeadlines
48
pixels
edges
object parts(combination of edges)
object models
DeepBeliefNetonFaceImages
BasedonmaterialsbyAndrewNg
49
Examplesoflearnedobjectpartsfromobjectcategories
LearningofObjectParts
Faces Cars Elephants Chairs
Slidecredit:AndrewNg50
TrainingonMultipleObjects
Trainedon4classes(cars,faces,motorbikes,airplanes).Secondlayer:Shared-featuresandobject-specificfeatures.Thirdlayer:Morespecificfeatures.
Slidecredit:AndrewNg51
SceneLabelingviaDeepLearning
[Farabet etal.ICML2012,PAMI2013] 52
Inputimages
SamplesfromfeedforwardInference(control)
SamplesfromFullposteriorinference
InferencefromDeepLearnedModelsGeneratingposteriorsamplesfromfacesby“fillingin” experiments(cf.LeeandMumford,2003).Combinebottom-upandtop-downinference.
Slidecredit:AndrewNg53
MachineLearninginAutomaticSpeechRecognition
ATypicalSpeechRecognitionSystem
MLusedtopredictofphonestatesfromthesoundspectrogram
Deeplearninghasstate-of-the-artresults
# HiddenLayers 1 2 4 8 10 12
WordErrorRate% 16.0 12.8 11.4 10.9 11.0 11.1
BaselineGMMperformance=15.4%[Zeiler etal.“Onrectifiedlinearunitsforspeechrecognition”ICASSP2013]
54
ImpactofDeepLearninginSpeechTechnology
Slidecredit:LiDeng,MSResearch55