restricted boltzmann machines and deep belief … implementation results on mnist ... the energy of...

RESTRICTED BOLTZMANN MACHINES AND

DEEP BELIEF NETWORKS ON MULTI-CORE

PROCESSORS

Noel Lopes Bernardete Ribeiro Joao Goncalves

University of CoimbraPolytechnic Institute of Guarda

June 11, 2012WCCI–IJCNN

DEEP BELIEF NETWORKS (DBNS)

“Deep belief nets are probabilistic generative models that arecomposed of multiple layers of stochastic latent variables. Thelatent variables typically have binary values and are often calledhidden units or feature detectors. [...] The lower layers receivetop-down, directed connections from the layers above. The statesof the units in the lowest layer represent a data vector.”

Geoffrey E. Hinton [Hinton et al., 2006]

OUTLINE

Motivation

Deep Belief Networks

Restricted Boltzmann Machines

GPU implementation

Results on MNIST Handwritten Database

Conclusions and Future Work

MOTIVATION

The robustness and efficiency by which humans canrecognize objects has ever been an intriguing challenge incomputational intelligence.

Theoretical results suggest that deep architectures arefundamental to learn complex functions that can representhigh-level abstractions (e.g. vision, language) [Bengio, 2009]

Empirical results show their successful application:classification, regression, dimensionality reduction, objectrecognition, information retrieval, robotics, and collaborativefiltering etc. [Larochelle et al., 2007, Swersky et al., 2010].

DEEP VERSUS SHALLOW

ARCHITECTURES

model inputs (x)

level 1

low-orderfeatures

level 2

· · ·

high-orderfeatures

level d

model outputs (y)

deep architecture

model inputs (x)

non-linear operations

model outputs (y)

shallow architecture

DEEP BELIEF NETWORKS

DBNs are composed of several Restricted BoltzmannMachines (RBMs) stacked on top of each other.

x· · ·

h1· · ·

h2· · ·

h3· · ·

RESTRICTED BOLTZMANN MACHINES

An RBM is an energy-based generative model that consists of alayer of binary visible units, v, and a layer of binary hidden units, h.

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1

deco

der

enco

der

visible units

hidden units

bias

bias


Given an observed state, the energy of the joint configuration ofthe visible and hidden units (v,h) is given by (1):

E(v,h) = −I∑

i=1

aivi −J∑

j=1

bjhj −J∑

j=1

I∑i=1

Wjivihj , (1)

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1


The RBM defines a joint probability over (v,h):

p(v,h) =e−E(v,h)

Z, (2)

where Z is the partition function, obtained by summing the energyof all possible (v,h) configurations:

Z =∑v,h

e−E(v,h) . (3)

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1


Given a random input configuration v, the state of the hidden unit jis set to 1 with probability:

p(hj = 1|v) = σ(bj +

I∑i=1

viWji) , (4)

Similarly, given a random hidden vector, h, the state of the visibleunit i can be set to 1 with probability:

p(vi = 1|h) = σ(ai +J∑

j=1

hjWji) . (5)

TRAINING AN RBM

The following learning rule performs stochastic steepest ascent inthe log probability of the training data:

∂ log p(v,h)

∂Wji= 〈vihj〉0 − 〈vihj〉∞ (6)

where 〈·〉0 denotes the expectations for the data distribution (p0)and 〈·〉∞ denotes the expectations under the model distribution

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1

GIBBS SAMPLING

v(0) = x

i · · ·

h(0)

· · · j

〈vihj〉0

p(hj = 1|v) = σ(bj +∑I

i=1 viWji)

ALTERNATING GIBBS SAMPLING

v(0) = x

i · · ·

h(0)

· · · j

〈vihj〉0

v(1)

i · · ·

p(vi = 1|h) = σ(ai +∑J

j=1 hjWji)

ALTERNATING GIBBS SAMPLING

v(0) = x

i · · ·

h(0)

· · · j

〈vihj〉0

v(1)

i · · ·

h(1)

· · · j

v(2)

i · · ·

h(2)

· · · j

v(∞)

i · · ·

h(∞)

· · · j

〈vihj〉∞

CONTRASTIVE DIVERGENCE (CD–k)

Hinton proposed the Contrastive Divergence (CD) algorithm

CD–k replaces 〈.〉∞ by 〈·〉k for small values of k.

CONTRASTIVE DIVERGENCE (CD–k)

v(0) ← x

Compute the binary (features) states of the hidden units, h(0),using v(0)

for n← 1 to kCompute the “reconstruction” states for the visible units, v(n),using h(n−1)

Compute the “reconstruction” states for the hidden units, h(n),using v(n)

end for

Update the weights and biases, according to:

∆Wji = γ(〈vihj〉0 − 〈vihj〉k) (7)

∆bj = γ(〈hj〉0 − 〈hj〉k) (8)

∆ai = γ(〈vi〉0 − 〈vi〉k) (9)

DEEP BELIEF NETWORKS (DBN)

x· · ·

h1· · ·

p(x|h1)p(h1|x)


x· · ·

h1· · ·

h2· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)


x· · ·

h1· · ·

h2· · ·

h3· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

p(h2|h3)p(h3|h2)


x· · ·

h1· · ·

h2· · ·

h3· · ·

low-level features

high-level features(concepts)

GPU IMPLEMENTATION

Training a DBN is a computationally expensive task thatinvolves training several RBMs and may require aconsiderable amount of time.

Solution?GPU Parallel implementation

GPU IMPLEMENTATION

Training a DBN is a computationally expensive task thatinvolves training several RBMs and may require aconsiderable amount of time.Solution?

GPU Parallel implementation

CUDA – DEVICE ARCHITECTURE

Device

Device Memory

Streaming Multiprocessor SMN

· · ·Streaming Multiprocessor SM2

Streaming Multiprocessor SM1

Shared Memory

· · ·Processor

1

Processor

2

Processor

M

Instruction

Unit

CUDA – LAUNCHING A KERNEL GRID

GridBlock(0,0) Block(1,0) Block(2,0) Block(3,0)

Block(0,1) Block(1,1) Block(2,1) Block(3,1)

Block(3,0)Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0)

Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1)

Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)

Threads within a block can share information.

However blocks are required to run independently.

To address scalability the tasks should be partitioned.

CUDA – SCALABILITY

GridBlock(0,0) Block(1,0) Block(2,0) Block(3,0)


Device with 2 SMs Device with 4 SMs

execution

SM 0


SM 1


SM 0

Block(0,0) Block(0,1)

SM 1


SM 2


SM 3


KERNELS

vdata ∈ IRN×I

RBM inputs (x)

ComputeStatusHiddenUnits

Step 1. Compute hdata

hdata ∈ IRN×J

RBM outputs (data)

ComputeStatusVisibleUnits

Step 2. Compute vrecon

vrecon ∈ IRN×I

reconstructed inputs

ComputeStatusHiddenUnits

Step 3. Compute hrecon

hrecon ∈ IRN×J

reconstructed outputs

w ∈ IRJ×I

weights

a ∈ IRI

visible units bias

b ∈ IRJ

hidden units bias

CorrectWeights

Step 4. Correct weights

COMPUTESTATUSHIDDENUNITS AND

COMPUTESTATUSVISIBLEUNITS

KERNELS

Each thread represents a connectionMultiplies the clamped input by the weightStores the weight in the shared memory

Each block represents a neuronUses fast shared memory to sum up the values computed byeach thread

Block (Neuron)

Connection 1 Connection 2 Connection 3. . .

Connection J

STORING THE CONNECTION WEIGHTS

ComputeStatusHiddenUnits - Coalesced access

ComputeStatusVisibleUnits - Uncoalesced access

w11 w12 w13 w14 w15 · · · w1I

w21 w22 w23 w24 w25 · · · w2I

w31 w32 w33 w34 w35 · · · w3I

w41 w42 w43 w44 w45 · · · w4I

w51 w52 w53 w54 w55 · · · w5I

· · · · · · · · · · · · · · · · · · · · ·

wJ1 wJ2 wJ3 wJ4 wJ5 · · · wJI

w13

w23

w33

w43

w53

· · ·

wJ3

w31 w32 w33 w34 w35 · · · w3I

CORRECTWEIGHTS KERNEL

FIRST APPROACH

Each thread gathers and sums up the values for one or moresamples

Each block corrects the weight of a connection

Block (Connection)

Sample 1 Sample 2 Sample 3. . .

Sample N

PROBLEMS?

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1

∆Wji = γ(〈vihj〉0 − 〈vihj〉k)

∆bj = γ(〈hj〉0 − 〈hj〉k)

PROBLEMS?

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1

∆Wji = γ(〈vihj〉0 − 〈vihj〉k)

∆ai = γ(〈vi〉0 − 〈vi〉k)


IMPROVED APPROACH

Each block has 16× 16 threads.

Each thread within a block must now process all the samples.

Block 0 (16×16 connections)

Connection (0,0) Connection (0,1) Connection (0,2). . .

Connection (0, 15)


Connection (1, 15)


Connection (2, 15)

· · · · · · · · · · · · · · ·


Connection (15, 15)


IMPROVED APPROACH

But we can access vi and hj variables in a coalesced way andstore them in shared memory for faster accesses.

Although, this new approach has a smaller number of blocks,it performs much better than our first approach (≈ 15× faster).

Block 0 (16×16 connections)


Connection (0, 15)


Connection (1, 15)


Connection (2, 15)

· · · · · · · · · · · · · · ·


Connection (15, 15)

TIME SPENT IN EACH TASK

FIRST AND SECOND APPROACHES

Generate random numbers (cuRAND)ComputeStatusHiddenUnits kernelComputeStatusVisibleUnits kernelCorrectWeights kernel

5.53%

10.24%

17.09%

14.86%

27.50%

45.91%

67.14%

11.73%

Firstapproach

Improvedapproach

EXPERIMENTAL SETUP

We tested our approach with the MNIST database

Each sample has 28× 28 pixel image of a hand-written digit(784 inputs)Hardware

CPU: Intel dual-core i5-2410M (8GB Memory)GPU: NVIDIA GeForce 460 GTX

NVIDIA GEFORCE 460 GTX

Number of Streaming Multiprocessors 7Number of cores 336Peak performance (GFLOPS) 940.8Device Memory (GB) 1Memory bandwidth (GB/sec) 112.5Shading clock speed (GHz) 1.4

RESULTS (1,000 SAMPLES)

0.01

0.1

1

10

100

0 100 200 300 400 500 600 700 800 900

Tim

e(s

)

Hidden units

23.26×

23.13×

21.86×24.46×

29.79×

GTX 460 (GPU)dual-core i5 (CPU)


0.1

1

10

100

1000

0 100 200 300 400 500 600 700 800 900

Tim

e(s

)

Hidden units

32.83×

30.29×

28.59×29.47×

38.16×



1

10

100

1000

10000

0 100 200 300 400 500 600 700 800 900

Tim

e(s

)

Hidden units

42.73×

43.46×

38.64×41.83×

46.07×


ADAPTIVE STEP SIZES

Training images

Reconstructionafter 10 epochs






Adaptive Step Sizes Fixed (optimized) learning rate 0.1

CONCLUSIONS AND FUTURE WORK

Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task that involvestraining several Restricted Boltzmann Machines (RBMs)upholding considerable efforts.

Our work has demonstrated that by taking a non-trivialapproach of GPU implementation we attained significantspeedups.

The adaptive step-size procedure for tuning the learning ratehas been incorporated in the learning model with excellingresults.

Future work will test this approach with other databases inparticular in real-world problems.

RESTRICTED BOLTZMANN MACHINES AND

DEEP BELIEF NETWORKS ON MULTI-CORE

PROCESSORS

Noel Lopes and Bernardete Ribeiro and Joao Goncalves

University of CoimbraPolytechnic Institute of Guarda

June 11, 2012WCCI–IJCNN

Bengio, Y. (2009).Learning deep architectures for AI.Foundations and Trends in Machine Learning, 2(1):1–127.

Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006).A fast learning algorithm for deep belief nets.Neural Computation, 18(7):1527–1554.

Larochelle, H., Erhan, D., Courville, A., Bergstra, J., andBengio, Y. (2007).An empirical evaluation of deep architectures on problems withmany factors of variation.In Proceedings of the 24th international conference onMachine learning (ICML 2007), pages 473–480. ACM.

Swersky, K., Chen, B., Marlin, B., and de Freitas, N. (2010).A tutorial on stochastic approximation algorithms for trainingrestricted boltzmann machines and deep belief nets.In Information Theory and Applications Workshop.

restricted boltzmann machines and deep belief … implementation results on mnist ... the energy of...

Documents