classification using discriminative restricted boltzmann...

24
Classification using Discriminative Restricted Boltzmann Machines Hugo Larochelle and Yoshua Bengio Presented by: Bhargav Mangipudi December 2, 2016 Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi) IE598 - Inference in Graphical Models December 2, 2016 1 / 24

Upload: others

Post on 19-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Classification using Discriminative RestrictedBoltzmann Machines

Hugo Larochelle and Yoshua Bengio

Presented by: Bhargav Mangipudi

December 2, 2016

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 1 / 24

Page 2: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Outline

1 Introduction

2 Modeling Classification using RBMs

3 Discriminative Restricted Boltzmann Machines (DRBM)

4 Hybrid Discriminative Restricted Boltzmann Machines (HDRBM)

5 Semi-Supervised Learning

6 ExperimentsCharacter RecognitionDocument Classification

7 Further Work

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 2 / 24

Page 3: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Restricted Boltzmann Machines

Restricted Boltzmann Machines are generative stochastic models that canmodel a probability distribution over its set of inputs using a set of hidden (orlatent) units.

Figure 1: Restricted Boltzmann Machine

They are represented as a bi-partitie graphical model where the visible layer isthe observed data and the hidden layer models latent features.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 3 / 24

Page 4: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

RBM as feature extractors

RBMs were predominantly used in an unsupervised setting to extract betterfeatures from the input data.

By trying to reconstruct the input vector, we can make the RBM learn adifferent representation of the input data. This representation is then used ina higher classification algorithm.

By using features extracted from RBM and using them in an upstream task,we introduce large number of hyper-parameters: Parameters for theRBM/DBNs + Parameters from the upstream classification procedure.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 4 / 24

Page 5: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Using RBM for NN/DBN initialization

RBMs have been used as a building block in Deep Belief Networks.

Pre-training: The stacked RBMs are trained in a layer-by-layer procedure in agreedy manner by fixing the previously computed layer weights.

Discriminative fine-tuning: When used for a classification task, fine-tuning ofthe weights is performed by adding a final layer of output nodes andpropagating the gradients of the loss.

This was one of the earliest methods to tractably train deep neural networkarchitectures.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 5 / 24

Page 6: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Main Ideas

RBMs can be used as a self-contained framework for non-linear classificationtasks.

Introduces an objective function to model discriminative training of RBMs.

Extension to support semi-supervised learning setting.

Experimentation to show competitive results with other techniques.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 6 / 24

Page 7: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Modeling Classification using RBMs

Assume that the training set is represented as Dtrain = {(xi , yi )} where xirepresents the input vector for the i-th example and yi ∈ {1, . . .C} is thecorresponding target class.

The labels yi is encoded as a one-hot representation (~y) over the set of Cclasses and treated as observed variables.

Thus, we have P(y , x , h) ∝ e−E(y ,x,h)

where E (y , x , h) = −hTWx − bT x − cTh − dT~y − hTU~ywith parameters Θ = (W,b, c,d,U).

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 7 / 24

Page 8: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Modeling Classification using RBMs - Continued

Thus, we see the following properties based on the structure and modeling

P(x |h) =∏i

P(xi |h)

P(xi = 1|h) = sigm(bi +∑j

Wj,ihj)

P(y |h) =edy+

∑j Uj,yhj∑

y∗ edy∗+

∑j Uj,y∗hj

Similarly for the hidden units, we get:

P(h|y , x) =∏j

P(hj |y , x)

P(hj = 1|y , x) = sigm(cj + Uj,y +∑i

Wj,ixi )

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 8 / 24

Page 9: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Training RBMs

To train a generative model on the training set, we minimize the negativelog-likelihood:

Lgen(Dtrain) = −|Dtrain|∑i=1

log P(yi , xi )

Calculating the exact gradient of log P(yi , xi ) for any parameter θ ∈ Θ, weget:

∂log P(yi , xi )

∂θ= −Eh|yi,xi

[ ∂∂θ

E(yi, xi,h)]

+ Ey,x,h

[ ∂∂θ

E(y, x,h)]

Gradient can be estimated using Contrastive Divergence

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 9 / 24

Page 10: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Training RBMs - Continued

After learning the model’s parameters iteratively by stochastic gradientdescent with a suitable learning rate, we can perform classification byestimating:

p(y |x) =edy∏n

j=1(1 + e(cj+Uj,y+∑

i Wj,ixi ))∑y∗ e

dy∗∏n

j=1(1 + e(cj+Uj,y∗+∑

i Wj,ixi ))

While training in an unsupervised setting, the features representations learnedby the hidden layer is not guaranteed to be useful for the classification task.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 10 / 24

Page 11: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Discriminative Restricted Boltzmann Machines

For the discriminative setting, we directly optimize P(y |x) instead of thejoint-distribution for p(y , x). We get the following log-likelihood expression:

Ldisc(Dtrain) = −|Dtrain|∑i=1

logP(yi |xi )

Training Procedure:

∂logP(yi |xi )∂θ

=∑j

sigm(oy ,j(xi ))∂oy ,j(xi )

∂θ

−∑j,y∗

sigm(oy∗,j(xi ))P(y∗|xi )∂oy∗,j(xi )

∂θ

where oy ,j(x) = cj +∑k

Wj,kxk + Uj,y

This gradient can be computed efficiently using stochastic gradient descentoptimization.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 11 / 24

Page 12: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Hybrid Discriminative Restricted Boltzmann Machines

Observation: Smaller training sets tend to favor generative learning andbigger training sets favor discriminative learning.

We adopt a hybrid generative/discriminative approach by combining therespective objective terms.

Lhybrid(Dtrain) = Ldisc(Dtrain) + αLgen(Dtrain)

α is a hyper-parameter controlling the generative criterion.

Ldisc(Dtrain) and Lgen(Dtrain) are calculated using the techniques describedabove.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 12 / 24

Page 13: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Extension for Semi-Supervised Learning

Motivation: Labeled/supervised data is not readily available.

Assume Dunlab represents the set of unlabelled data.

Lunsup(Dunlab) = −|Dunlab|∑i=1

logP(xi )

Training:

∂log P(xi )

∂θ= −Eh,yi|xi

[ ∂∂θ

E(yi, xi,h)]

+ Ey,x,h

[ ∂∂θ

E(y, x,h)]

The first term is different from the Lgen gradient estimation. This canefficiently calculated by taking an expectation w.r.t P(y |x) of the term fromLLgen or sampling for a single yi and doing a point-estimate.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 13 / 24

Page 14: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Extension for Semi-Supervised Learning - Continued

Semi-supervised setting:

Lsemi−sup(Dtrain,Dunlab) = LTYPE (Dtrain) + βLunsup(Dunlab)

where TYPE ∈ {gen, disc, hybrid}

Thus, online training by stochastic gradient descent corresponds to applyingtwo gradient updates: one for the objective LTYPE and the other one for theunlabelled data objective Lunsup.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 14 / 24

Page 15: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Character (Digit) Recognition - MNIST Dataset

MNIST Dataset for classifying images of digits.

λ = learning rate and n = number of training iterations.

Likelihood objective for HDRBM (after hyper-parameter tuning):

Lhybrid(Dtrain) = Ldisc(Dtrain) + 0.01Lgen(Dtrain)

Sparse HDRBM - Small value δ is subtracted from the biases c in the hiddenlayer after each update.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 15 / 24

Page 16: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Document Classification - 20-newsgroup Dataset

Task: Classify newsgroup documents into 20 target classes.

Features used: 5000 most common works across the data set as binary inputfeatures.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 16 / 24

Page 17: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Document Classification - 20-newsgroup Dataset -HDRBM analysis

Figure 2: Similarity Matrix for weights U·,yj

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 17 / 24

Page 18: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Semi-supervised learning results

Compare the semi-supervised setting against other non-parametrizedsemi-supervised learning algorithms.

Figure 3: Percentage error across different datasets for semi-supervised learning

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 18 / 24

Page 19: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Further Work

Louradour, Jerome and Larochelle, HugoClassification of Sets using Restricted Boltzmann Machines.UAI 2011

Larochelle, Hugo and Mandel, Michael and Pascanu, Razvan and Bengio,YoshuaLearning algorithms for the classification restricted boltzmann machine.JMLR 2012

van der Maaten, LaurensDiscriminative restricted Boltzmann machines are universal approximators fordiscrete data.Technical Report EWI-PRB TR 2011001, Delft University of Technology

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 19 / 24

Page 20: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Classification of Sets using RBMs

In this scenario, each input instance xi is represented by a set of vectors x(0)i ,

x(1)i . . . x

(s)i . For example: Each input instance could be different snippets of

a document (mail) or different regions of an image. The is still modelled amulti-class classification task.

One common approach to the problem is to use multiple-instance learning(MIL) model, where of the instance segments are individually classifiedassigned either the majority label or a conservative voting mechanism.

The drawback with MIL is that it treats each instance segment equally andthis implicit assumption might not always hold. If each segment can onlyprovide partial class information, we won’t capture that correctly in MIL. Thispaper tackles the problem by adding redundant hidden units (layers) to theRBM architecture.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 20 / 24

Page 21: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

ClassSetRBM - Mutually exclusive hidden units (XOR)

Figure 4: ClassSetRBMXOR architecture

X represents the set of instance segments for each data instance, i.e. |X | =number of segments per training instance.

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 21 / 24

Page 22: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

ClassSetRBM - Mutually exclusive hidden units (XOR) -Continued

In this scenario, all hidden units are mutually exclusive across differentsegments, i.e.

|X |∑x=1

h(s)j ∈ {0, 1} ∀j = 1 . . .H

Other distributions are slightly modified to accommodate the redundanthidden layers.

P(y = yi |X ) =exp(−FXOR(X , yi ))∑y∗ exp(−FXOR(X , y∗))

where free energy FXOR(X , y) = −dT y −H∑j=1

softplus(softmaxj(X ) + Uj,y ))

and softmaxj(X ) = log(

|X |∑s=1

exp(cj + Wjx(s)))

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 22 / 24

Page 23: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

ClassSetRBM - Redundant Evidence (OR)

Figure 5: ClassSetRBMOR architecture.

In the paper, they argue that the assumption of mutual exclusivity overhidden unit may be too strong and try to relax it by adding additional”hidden” layer copies G connected only to y and by removing the directconnection from H to y .

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 23 / 24

Page 24: Classification using Discriminative Restricted Boltzmann ...swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf · Outline 1 Introduction 2 Modeling Classi cation

Thank You!

Questions?

Hugo Larochelle and Yoshua Bengio (Presented by: Bhargav Mangipudi)IE598 - Inference in Graphical Models December 2, 2016 24 / 24