lecun 20060816-ciar-01-energy based learning

1.ATutorialon EnergyBasedLearningYannLeCun,SumitChopra,RaiaHadsell, MarcAurelioRanzato,FuJieHuang TheCourantInstituteofMathematicalSciences NewYorkUniversityhttp://yann.lecun.comhttp://www.cs.nyu.edu/~yannYannLeCun

2. TwoProblemsinMachineLearning 1.TheDeepLearningProblemDeep architectures are necessary to solve the invariance problem invision (and perception in general)How do we train deep architectures with lots of non-linear stages 2.ThePartitionFunctionProblemGive high probability (or low energy) to good answersGive low probability (or high energy) to bad answersThere are too many bad answers! Thistutorialdiscussesproblem#2The partition function problem arises with probabilistic approachesNon-probabilistic approaches may allow us to get around it. EnergyBasedLearningprovidesaframeworkinwhichtodescribe probabilisticandnonprobabilisticapproachestolearningYannLeCun 3. EnergyBasedModelforDecisionMakingModel:MeasuresthecompatibilitybetweenanobservedvariableXandavariabletobepredictedYthroughanenergyfunctionE(Y,X).Inference:SearchfortheYthatminimizestheenergywithinasetIfthesethaslowcardinality,wecanuseexhaustivesearch.YannLeCun 4. ComplexTasks:InferenceisnontrivialWhenthecardinalityordimensionofYislarge,exhaustivesearchisimpractical.Weneedtouseasmartinferenceprocedure:minsum,Viterbi,.....YannLeCun 5. WhatQuestionsCanaModelAnswer? 1.Classification&DecisionMaking:which value of Y is most compatible with X?Applications: Robot navigation,.....Training: give the lowest energy to the correct answer 2.Ranking:Is Y1 or Y2 more compatible with X?Applications: Data-mining....Training: produce energies that rank the answers correctly 3.Detection:Is this value of Y compatible with X?Application: face detection....Training: energies that increase as the image looks less like a face. 4.ConditionalDensityEstimation:What is the conditional distribution P(Y|X)?Application: feeding a decision-making systemTraining: differences of energies must be just so.YannLeCun 6. DecisionMakingversusProbabilisticModeling EnergiesareuncalibratedThe energies of two separately-trained systems cannot be combinedThe energies are uncalibrated (measured in arbitrary untis) Howdowecalibrateenergies?We turn them into probabilities (positive numbers that sum to 1).Simplest way: Gibbs distributionOther ways can be reduced to Gibbs by a suitable redefinition of theenergy.PartitionfunctionInversetemperatureYannLeCun 7. ArchitectureandLossFunction Familyofenergyfunctions Trainingset Lossfunctional/LossfunctionMeasures the quality of an energy function Training Formofthelossfunctionalinvariant under permutations and repetitions of the samplesEnergysurface RegularizerPersampleDesired foragivenXi loss answer asYvariesYannLeCun 8. DesigningaLossFunctional Correctanswerhasthelowestenergy>LOWLOSS Lowestenergyisnotforthecorrectanswer>HIGHLOSSYannLeCun 9. DesigningaLossFunctional Pushdownontheenergyofthecorrectanswer Pullupontheenergiesoftheincorrectanswers,particularlyifthey aresmallerthanthecorrectoneYannLeCun 10. Architecture+InferenceAlgo+LossFunction=Model E(W,Y,X) 1.Designanarchitecture:aparticularformforE(W,Y,X).2.PickaninferencealgorithmforY:MAPorconditionaldistribution,beliefprop,mincut,variationalmethods, Wgradientdescent,MCMC,HMC.....3.Pickalossfunction:insuchawaythatminimizingitwithrespecttoWoveratrainingsetwillmaketheinferencealgorithmfindthecorrectYforagivenX. X Y4.Pickanoptimizationmethod. PROBLEM:Whatlossfunctionswillmakethemachineapproach thedesiredbehavior?YannLeCun 11. SeveralEnergySurfacescangivethesameanswersY X BothsurfacescomputeY=X^2 MINyE(Y,X)=X^2 MinimumenergyinferencegivesusthesameanswerYannLeCun 12. SimpleArchitectures RegressionBinaryClassification Multiclass ClassificationYannLeCun 13. SimpleArchitecture:ImplicitRegressionTheImplicitRegressionarchitecture allows multiple answers to have low energy. Encodes a constraint between X and Y rather than an explicit functional relationship This is useful for many applications Example: sentence completion: The cat ate the {mouse,bird,homework,...} [Bengio et al. 2003] But, inference may be difficult.YannLeCun 14. ExamplesofLossFunctions:EnergyLoss EnergyLossSimply pushes down on the energy of the correct answer!! S! KR O W !!! ESPS ALL OCYannLeCun 15. ExamplesofLossFunctions:PerceptronLoss PerceptronLoss[LeCunetal.1998],[Collins2002]Pushes down on the energy of the correct answerPulls up on the energy of the machines answerAlways positive. Zero when answer is correctNo margin: technically does not prevent the energy surface frombeing almost flat.Works pretty well in practice, particularly if the energyparameterization does not allow flat surfaces.YannLeCun 16. PerceptronLossforBinaryClassification Energy: Inference: Loss: LearningRule: IfGw(X)islinearinW:YannLeCun 17. ExamplesofLossFunctions:GeneralizedMarginLosses First,weneedtodefinetheMostOffendingIncorrectAnswer MostOffendingIncorrectAnswer:discretecase MostOffendingIncorrectAnswer:continuouscaseYannLeCun 18. ExamplesofLossFunctions:GeneralizedMarginLossesGeneralizedMarginLossQm increases with theenergy of the correctanswerQm decreases with theenergy of the mostoffending incorrectanswerwhenever it is less thanthe energy of thecorrect answer plus amargin m.YannLeCun 19. ExamplesofGeneralizedMarginLosses HingeLoss [Altun et al. 2003], [Taskar et al. 2003] With the linearly-parameterized binary classifier architecture, we get linear SVMs LogLoss soft hinge loss With the linearly-parameterized binary classifier architecture, we get linear Logistic RegressionYannLeCun 20. ExamplesofMarginLosses:SquareSquareLossSquareSquareLoss [LeCun-Huang 2005] Appropriate for positive energy functions LearningY=X^2!!E!PS A LLOC ONYannLeCun 21. OtherMarginLikeLossesLVQ2Loss[Kohonen,Oja],DriancourtBottou1991]MinimumClassificationErrorLoss[Juang,Chou,Lee1997]SquareExponentialLoss[Osadchy,Miller,LeCun2004]YannLeCun 22. NegativeLogLikelihoodLossConditionalprobabilityofthesamples(assumingindependence)Gibbsdistribution:WegettheNLLlossbydividingbyPandBeta:ReducestotheperceptronlosswhenBeta>infinityYannLeCun 23. NegativeLogLikelihoodLossPushesdownontheenergyofthecorrectanswerPullsupontheenergiesofallanswersinproportiontotheirprobabilityYannLeCun 24. NegativeLogLikelihoodLoss:BinaryClassificationBinaryClassifierArchitecture:LinearBinaryClassifierArchitecture:LearningRule:logisticregressionYannLeCun 25. WhatMakesaGoodLossFunctionGoodlossfunctionsmakethemachineproducethecorrectanswer Avoid collapses and flat energy surfacesYannLeCun 26. WhatMakeaGoodLossFunctionGoodandbadlossfunctionsYannLeCun 27. Advantages/DisadvantagesofvariouslossesLossfunctionsdifferinhowtheypickthepoint(s)whoseenergyispulledup,andhowmuchtheypullthemupLosseswithalogpartitionfunctioninthecontrastivetermpullupallthebadanswerssimultaneously. This may be good if the gradient of the contrastive term can be computed efficiently This may be bad if it cannot, in which case we might as well use a loss with a single point in the contrastive termVariationalmethodspullupmanypoints,butnotasmanyaswiththefulllogpartitionfunction.Efficiencyofaloss/architecture:howmanyenergiesarepulledupforagivenamountofcomputation? The theory for this is to be developedYannLeCun 28. LatentVariableModels TheenergyincludeshiddenvariablesZwhosevalueisnevergiventousYannLeCun 29. Whatcanthelatentvariablesrepresent? Variablesthatwouldmakethetaskeasieriftheywereknown:Face recognition: the gender of the person, the orientation of theface.Object recognition: the pose parameters of the object (location,orientation, scale), the lighting conditions.Parts of Speech Tagging: the segmentation of the sentence intosyntactic units, the parse tree.Speech Recognition: the segmentation of the sentence intophonemes or phones.Handwriting Recognition: the segmentation of the line intocharacters. Ingeneral,wewillsearchforthevalueofthelatentvariablethat allowsustogetananswer(Y)ofsmallestenergy.YannLeCun 30. ProbabilisticLatentVariableModelsMarginalizingoverlatentvariablesinsteadofminimizing.Equivalenttotraditionalenergybasedinferencewitharedefinedenergyfunction:ReducestotraditionalminimizationwhenBeta>infinityYannLeCun 31. FaceDetectionandPoseEstimationwithaConvolutionalEBM Training:52,850,32x32 greylevelimagesoffaces, 52,850selectednonfaces. Eachtrainingimagewasused 5timeswithrandomvariation E W , Z , X (energy) inscale,inplanerotation, brightnessandcontrast.G W X F Z 2 phase:halfoftheinitialnd negativesetwasreplacedby GW X FZ falsepositivesoftheinitialanalyticalconvolutional versionofthedetector.networkmappingontofacemanifold W(param)SmallE*(W,X):faceXZLargeE*(W,X):noface(image)(pose)[Osadchy,Miller,LeCun,NIPS2004]YannLeCun 32. FaceManifoldLowdimensionalspace ||G(X)min_zF(Z)||| G(X)FaceManifold F(Z)parameterizedbypose ApplyMapping:GImageX 33. ProbabilisticApproach:DensitymodelofjointP(face,pose)ProbabilitythatimageXisafacewithposeZGivenatrainingsetoffacesannotatedwithpose,findtheWthatmaximizesthelikelihoodofthedataunderthemodel:( )+ ( )Equivalently,minimizethenegativeloglikelihood:= =( ) COMPLICATED 34. EnergyBasedContrastiveLossFunction AttractthenetworkoutputGw(X)tothe)+ ) locationofthedesiredposeF(Z)onthemanifold( ( ==() RepelthenetworkoutputGw(X)away fromtheface/posemanifold 35. ConvolutionalNetworkArchitecture[LeCunetal.1988,1989,1998,2005] Hierarchyoflocalfilters(convolutionkernels), sigmoidpointwisenonlinearities,andspatialsubsampling Allthefiltercoefficientsarelearnedwithgradientdescent(backprop)YannLeCun 36. Simplecells AlternatedConvolutions Complexcells andPooling/SubsamplingLocalfeaturesareextractedpoolingeverywhere.Multiplesubsampling convolutionspooling/subsamplinglayerbuildsrobustnesstovariationsinfeaturelocations.Longhistoryinneuroscienceandcomputervision: Hubel/Wiesel1962, Fukushima197182, LeCun198806 Poggio,Riesenhuber,Serre0206 Ullman200206 Triggs,Lowe,....YannLeCun 37. BuildingaDetector/Recognizer:ReplicatedConv.Netsoutput:3x3 Window32x32input:40x40 TraditionalDetectors/Classifiersmustbeappliedtoevery locationonalargeinputimage,atmultiplescales. Convolutionalnetscanreplicatedoverlargeimagesvery cheaply. Thenetworkisappliedtomultiplescalesspacedbysqrt(2) NonmaximumsuppressionwithexclusionwindowYannLeCun 38. BuildingaDetector/Recognizer:ReplicatedConvolutionalNets Computationalcostforreplicatedconvolutionalnet:96x96>4.6millionmultiplyaccumulateoperations120x120>8.3millionmultiplyaccumulateoperations240x240>47.5millionmultiplyaccumulateoperations480x480>232millionmultiplyaccumulateoperations Computationalcostforanonconvolutionaldetectorofthe samesize,appliedevery12pixels:96x96>4.6millionmultiplyaccumulateoperations120x120>42.0millionmultiplyaccumulateoperations240x240>788.0millionmultiplyaccumulateoperations480x480>5,083millionmultiplyaccumulateoperations 96x96window 12pixelshift84x84overlap 39. FaceDetection:ResultsData Set-> TILTEDPROFILEMIT+CMU 4.42 26.9 0.47 3.36 0.5 1.28 False positives per image-> Our Detector90% 97% 67%83%83% 88% Jones & Viola (tilted)90% 95%xx Jones & Viola (profile)x70%83%x Rowley et al89% 96% x Schneiderman & Kanade 86%93%xYannLeCun 40. FaceDetectionandPoseEstimation:ResultsYannLeCun 41. FaceDetectionwithaConvolutionalNetYannLeCun 42. Performanceonstandarddataset DetectionPoseestimation Poseestimationisperformedonfaceslocatedautomaticallybythesystem whenthefacesarelocalizedbyhandweget:89%ofyawand100%ofinplane rotationswithin15degrees.YannLeCun 43. SynergyBetweenDetectionandPoseEstimation PoseEstimationImproves Detectionimproves DetectionposeestimationYannLeCun 44. EBMforFaceRecognitionXandYareimages E(W,X,Y)Yisadiscretevariablewithmany ||Gw(X)Gw(Y)|| possiblevaluesAll the people in our galleryExampleofarchitecture: Gw(X)Gw(Y)A function G(X) maps input imagesinto a low-dimensional space inwhich the Euclidean distancemeasures dissemblance.Inference:Find the Y in the gallery thatminimizes the energy (find the Ythat is most similar to X)Minimization through exhaustivesearch.YannLeCun 45. LearninganInvariantDissimilarityMetricwithEBM [Chopra,Hadsell,LeCunCVPR2005]E(W,X1,X2) Trainingaparameterized,invariantdissimilaritymetric||Gw(X1)Gw(X2)|| maybeasolutiontothemanycategoryproblem. FindamappingGw(X)suchthattheEuclideandistanceGw(X1)Gw(X2) ||Gw(X1)Gw(X2)||reflectsthesemanticdistancebetween X1andX2.X1 X2 Oncetrained,atrainabledissimilaritymetriccanbeusedto classifynewcategoriesusingaverysmallnumberof trainingsamples(usedasprototypes). E(W,X1,X2) Thisisanexamplewhereprobabilisticmodelsaretoo||Gw(X1)Gw(X2)|| constraining,becausewewouldhavetolimitourselvesto modelsthatcanbenormalizedoverthespaceofinputpairs. Gw(X1)Gw(X2) WithEBMs,wecanputwhatwewantinthebox(e.g.A convolutionalnet). SiameseArchitecture X1X2 Application:faceverification/recognition 46. LossFunctionE(W,X1,X2)||Gw(X1)Gw(X2)|| Gw(X1)Gw(X2)X1 X2 Siamesemodels:distancebetweentheoutputsoftwoidenticalcopiesofamodel. Energyfunction:E(W,X1,X2)=||Gw(X1)Gw(X2)|| IfX1andX2arefromthesamecategory(genuinepair),trainthetwocopiesofthemodel toproducesimilaroutputs(lowenergy) IfX1andX2arefromdifferentcategories(impostorpair),trainthetwocopiesofthe modeltoproducedifferentoutputs(highenergy) Lossfunction:increasingfunctionofgenuinepairenergy,decreasingfunctionof impostorpairenergy. 47. LossFunctionOurLossfunctionforasingletrainingpair(X1,X2):L W , X 1, X 2 =1Y L G E W X 1, X 2 YL I E W X 1, X 2 E W X 1, X 2 2 2 2.77 R=1Y E W X 1, X 2 Y 2R eRE W X 1, X 2 =G W X 1 GW X 2 L1AndRisthelargestpossiblevalueof E W X 1, X 2 Y=0foragenuinepair,andY=1for animpostorpair. 48. FaceVerificationdatasets:AT&T,FERET,andAR/PurdueTheAT&T/ORLdatasetTotalsubjects:40.Imagespersubject:10.Totalimages:400.Imageshadamoderatedegreeofvariationinpose,lighting,expressionandheadposition.Imagesfrom35subjectswereusedfortraining.Imagesfrom5remainingsubjectsfortesting.Trainingsetwastakenfrom:3500genuineand119000impostorpairs.Testsetwastakenfrom:500genuineand2000impostorpairs.http://www.uk.research.att.com/facedatabase.html AT&T/ORL Dataset 49. FaceVerificationdatasets:AT&T,FERET,andAR/PurdueTheFERETdataset.partofthedatasetwasusedonlyfortraining.Totalsubjects:96.Imagespersubject:6.Totalimages:1122.Imageshadhighdegreeofvariationinpose,lighting,expressionandheadposition.Theimageswereusedfortrainingonly.http://www.itl.nist.gov/iad/humanid/feret/FERETDataset 50. FaceVerificationdatasets:AT&T,FERET,andAR/PurdueTheAR/PurduedatasetTotalsubjects:136.Imagespersubject:26.Totalimages:3536.Eachsubjecthas2setsof13imagestaken14daysapart. Imageshadveryhighdegreeofvariationinpose,lighting,expressionandposition.Withineachsetof13,thereare4imageswithexpressionvariation,3withlightingvariation,3withdarksunglassesandlightingvariation,and3withfaceobscuringscarfsandlightingvariation.Imagesfrom96subjectswereusedfortraining.Theremaining40subjectswereusedfortesting.Trainingsetdrawnfrom:64896genuineand6165120impostorpairs.Testsetdrawnfrom:27040genuineand1054560impostorpairs.http://rv11.ecn.purdue.edu/aleix/aleix_face_DB.html 51. FaceVerificationdataset:AR/Purdue 52. Preprocessing The3datasetseachrequiredasmallamountofpreprocessing. FERET:Cropping,subsampling,andcentering(seebelow) AR/PURDUE:Croppingandsubsampling AT&T:Subsamplingonlycrop subsample center 53. CenteringwithaGaussianblurredfacetemplateCoarsecenteringwasdoneontheFERETdatabaseimages1. Constructatemplatebyblurringawellcenteredface.2. Convolvethetemplatewithanuncenteredimage.3. Choosethepeakoftheconvolutionasthecenteroftheimage.Convolvemaskwithpeakiscenter imageofimage 54. ArchitecturefortheMappingFunctionGw(X) Convolutionalnet Layer6 Layer3InputLayer1Layer2 Layer4 Layer5 Fullyconnectedimage15@50x40 45@20x15 15@25x2045@5x52502@56x4650units. Lowdimensional invariantrepresentation7x76x6 5x5 2x2 4x3 convolutionconvolutionsubsampling subsampling convolution(15kernels) (198kernels) (11250kernels) 55. Internalstateforgenuineandimpostorpairs 56. GaussianFaceModelintheoutputspaceAgaussianmodelconstructedfrom5X 2 genuineimagesoftheabovesubject.X 2 impostor 57. DatasetforVerificationVerificationResults testedonAT&TandAR/PurdueTheAT&TdatasetTheAR/Purduedataset AT&Tdataset False Accept False Reject False Accept False Reject Numberofsubjects:510.00% 0.00%10.00% 11.00% 7.50% 1.00% 7.50% 14.60% Images/subject:10 5.00% 1.00% 5.00% 19.00% Images/Model:5 Totaltestsize: 5000 NumberofGenuine: 500 NumberofImpostors: 4500 Purdue/ARdataset Numberofsubjects:40 Images/subject:26 Images/Model:13 Totaltestsize: 5000 NumberofGenuine: 500 NumberofImpostors: 4500 58. ClassificationExamples Example:Correctlyclassifiedgenuinepairs energy:0.3159 energy:0.0043energy:0.0046Example:Correctlyclassifiedimpostorpairsenergy:32.7897energy:5.7186 energy:20.1259 Example:Misclassified pairs energy:10.3209 energy:2.8243 59. InternalState 60. 1 1 Asimilaridea Lsimilar = D22 w Ldissimilar = {max 0, m D W }22 forLearningMargin aManifoldm withInvariance PropertiesDWDWLossfunction: G W x 1 G w x 2 G W x 1 G w x 2 Pay quadratically for making outputs GW x 1 GW x 2 GW x 1 GW x 2 of neighbors far apart x1x2x1x2 Pay quadratically for making outputs of non-neighbors smaller than a margin mYannLeCun 61. AManifoldwithInvariancetoShifts Trainingset:30004and 30009fromMNIST. Eachdigitisshifted horizontallyby6,3,3, and6pixels Neighborhoodgraph:5 nearestneighborsin Euclideandistance,and shiftedversionsofselfand nearestneighbors OutputDimension:2 Testset(shown)10004 and10009YannLeCun 62. AutomaticDiscoveryoftheViewpointManifoldwithInvarianttoIlluminationYannLeCun 63. EfficientInference:EnergyBasedFactorGraphs Graphicalmodelshavebroughtusefficientinferencealgorithms,such asbeliefpropagationanditsnumerousvariations. Traditionally,graphicalmodelsareviewedasprobabilisticmodels Atfirstglance,isseemsdifficulttodissociategraphicalmodelsfromthe probabilisticview EnergyBasedFactorGraphsareanextensionofgraphicalmodelsto nonprobabilisticsettings. AnEBFGisanenergyfunctionthatcanbewrittenasasumoffactor functionsthattakedifferentsubsetsofvariablesasinputs.YannLeCun 64. EfficientInference:EnergyBasedFactorGraphsTheenergyisasumoffactorfunctions Example:Z1, Z2, Y1 are binaryZ2 is ternaryA nave exhaustive inferencewould require 2x2x2x3energy evaluations (= 96factor evaluations)BUT: Ea only has 2 possibleinput configurations, Eb andEc have 4, and Ed 6.Hence, we can precomputethe 16 factor values, and putthem on the arcs in a graph.A path in the graph is aconfig of variableThe cost of the path is theenergy of the configYannLeCun 65. EnergyBasedBeliefProp Thepreviouspictureshowsachaingraphoffactorswith2 inputs. Theextensionofthisproceduretotrees,withfactorsthatcan havemorethan2inputstheminsumalgorithm(anon probabilisticformofbeliefpropagation) Basically,itisthesumproductalgorithmwithadifferentsemi ringalgebra(mininsteadofsum,suminsteadofproduct),and nonormalizationstep. [Kschischang, Frey, Loeliger, 2001][McKays book]YannLeCun 66. FeedForward,Causal,andBidirectionalModelsEBFGareallundirected,butthearchitecturedeterminesthecomplexityoftheinferenceincertaindirections Distance Complicated functionXYXY XYFeedForwardCausalBidirectional Predicting Y Predicting Y X->Y and Y->X are both from X is easy from X is hard hard if the two factors Predicting X Predicting X dont agree. from Y is hard from Y is easy They are both easy if the factors agreeYannLeCun 67. TwotypesofdeeparchitecturesFactorsaredeep/graphisdeep XYX YYannLeCun 68. ShallowFactors/DeepGraphLinearlyParameterizedFactorswiththeNLLLoss: Laffertys Conditional Random FieldwithHingeLoss: Taskars Max Margin Markov NetswithPerceptronLoss Collinss sequence labeling modelWithLogLoss: Altun/Hofmann sequence labeling modelYannLeCun 69. DeepFactors/DeepGraph:ASRwithTDNN/DTWTrainableAutomaticSpeechRecognitionsystemwithconvolutionalnets(TDNN)anddynamictimewarping(DTW)Trainingthefeatureextractoraspartofthewholeprocess.withtheLVQ2Loss: Driancourt and Bottous speech recognizer (1991)withNLL: Bengios speech recognizer (1992) Haffners speech recognizer (1993)YannLeCun 70. DeepFactors/DeepGraph:ASRwithTDNN/HMMDiscriminativeAutomaticSpeechRecognitionsystemwithHMMandvariousacousticmodels Training the acoustic model (feature extractor) and a (normalized) HMM in an integrated fashion.WithMinimumEmpiricalErrorloss Ljolje and Rabiner (1990)withNLL: Bengio (1992) Haffner (1993) Bourlard (1994)WithMCE Juang et al. (1997)Latenormalizationscheme(unnormalizedHMM) Bottou pointed out the label bias problem (1991) Denker and Burges proposed a solution (1995)YannLeCun 71. ReallyDeepFactors/ReallyDeepGraphHandwritingRecognitionwithGraphTransformerNetworksUnnormalizedhierarchicalHMMs Trained with Perceptron loss [LeCun, Bottou, Bengio, Haffner 1998] Trained with NLL loss [Bengio, LeCun 1994], [LeCun, Bottou, Bengio, Haffner 1998]Answer=sequenceofsymbolsLatentvariable=segmentationYannLeCun

lecun 20060816-ciar-01-energy based learning

Technology

low energy

high energy

lowest energy

energy surface frombeing

insuchawaythat x y4

value of y compatible

combinedthe energies

differences of energies