restricted boltzmann machines and deep belief … implementation results on mnist ... the energy of...
TRANSCRIPT
RESTRICTED BOLTZMANN MACHINES AND
DEEP BELIEF NETWORKS ON MULTI-CORE
PROCESSORS
Noel Lopes Bernardete Ribeiro Joao Goncalves
University of CoimbraPolytechnic Institute of Guarda
June 11, 2012WCCI–IJCNN
DEEP BELIEF NETWORKS (DBNS)
“Deep belief nets are probabilistic generative models that arecomposed of multiple layers of stochastic latent variables. Thelatent variables typically have binary values and are often calledhidden units or feature detectors. [...] The lower layers receivetop-down, directed connections from the layers above. The statesof the units in the lowest layer represent a data vector.”
Geoffrey E. Hinton [Hinton et al., 2006]
OUTLINE
Motivation
Deep Belief Networks
Restricted Boltzmann Machines
GPU implementation
Results on MNIST Handwritten Database
Conclusions and Future Work
MOTIVATION
The robustness and efficiency by which humans canrecognize objects has ever been an intriguing challenge incomputational intelligence.
Theoretical results suggest that deep architectures arefundamental to learn complex functions that can representhigh-level abstractions (e.g. vision, language) [Bengio, 2009]
Empirical results show their successful application:classification, regression, dimensionality reduction, objectrecognition, information retrieval, robotics, and collaborativefiltering etc. [Larochelle et al., 2007, Swersky et al., 2010].
MOTIVATION
The robustness and efficiency by which humans canrecognize objects has ever been an intriguing challenge incomputational intelligence.
Theoretical results suggest that deep architectures arefundamental to learn complex functions that can representhigh-level abstractions (e.g. vision, language) [Bengio, 2009]
Empirical results show their successful application:classification, regression, dimensionality reduction, objectrecognition, information retrieval, robotics, and collaborativefiltering etc. [Larochelle et al., 2007, Swersky et al., 2010].
MOTIVATION
The robustness and efficiency by which humans canrecognize objects has ever been an intriguing challenge incomputational intelligence.
Theoretical results suggest that deep architectures arefundamental to learn complex functions that can representhigh-level abstractions (e.g. vision, language) [Bengio, 2009]
Empirical results show their successful application:classification, regression, dimensionality reduction, objectrecognition, information retrieval, robotics, and collaborativefiltering etc. [Larochelle et al., 2007, Swersky et al., 2010].
DEEP VERSUS SHALLOW
ARCHITECTURES
model inputs (x)
level 1
low-orderfeatures
level 2
· · ·
high-orderfeatures
level d
model outputs (y)
deep architecture
model inputs (x)
non-linear operations
model outputs (y)
shallow architecture
DEEP BELIEF NETWORKS
DBNs are composed of several Restricted BoltzmannMachines (RBMs) stacked on top of each other.
x· · ·
h1· · ·
h2· · ·
h3· · ·
RESTRICTED BOLTZMANN MACHINES
An RBM is an energy-based generative model that consists of alayer of binary visible units, v, and a layer of binary hidden units, h.
h1 h2 h3 · · · hj · · · hJ 1
v1 v2 · · · vi · · · vI 1
deco
der
enco
der
visible units
hidden units
bias
bias
RESTRICTED BOLTZMANN MACHINES
Given an observed state, the energy of the joint configuration ofthe visible and hidden units (v,h) is given by (1):
E(v,h) = −I∑
i=1
aivi −J∑
j=1
bjhj −J∑
j=1
I∑i=1
Wjivihj , (1)
h1 h2 h3 · · · hj · · · hJ 1
v1 v2 · · · vi · · · vI 1
RESTRICTED BOLTZMANN MACHINES
The RBM defines a joint probability over (v,h):
p(v,h) =e−E(v,h)
Z, (2)
where Z is the partition function, obtained by summing the energyof all possible (v,h) configurations:
Z =∑v,h
e−E(v,h) . (3)
h1 h2 h3 · · · hj · · · hJ 1
v1 v2 · · · vi · · · vI 1
RESTRICTED BOLTZMANN MACHINES
Given a random input configuration v, the state of the hidden unit jis set to 1 with probability:
p(hj = 1|v) = σ(bj +
I∑i=1
viWji) , (4)
Similarly, given a random hidden vector, h, the state of the visibleunit i can be set to 1 with probability:
p(vi = 1|h) = σ(ai +J∑
j=1
hjWji) . (5)
TRAINING AN RBM
The following learning rule performs stochastic steepest ascent inthe log probability of the training data:
∂ log p(v,h)
∂Wji= 〈vihj〉0 − 〈vihj〉∞ (6)
where 〈·〉0 denotes the expectations for the data distribution (p0)and 〈·〉∞ denotes the expectations under the model distribution
h1 h2 h3 · · · hj · · · hJ 1
v1 v2 · · · vi · · · vI 1
GIBBS SAMPLING
v(0) = x
i · · ·
h(0)
· · · j
〈vihj〉0
p(hj = 1|v) = σ(bj +∑I
i=1 viWji)
ALTERNATING GIBBS SAMPLING
v(0) = x
i · · ·
h(0)
· · · j
〈vihj〉0
v(1)
i · · ·
p(vi = 1|h) = σ(ai +∑J
j=1 hjWji)
ALTERNATING GIBBS SAMPLING
v(0) = x
i · · ·
h(0)
· · · j
〈vihj〉0
v(1)
i · · ·
h(1)
· · · j
v(2)
i · · ·
h(2)
· · · j
v(∞)
i · · ·
h(∞)
· · · j
〈vihj〉∞
CONTRASTIVE DIVERGENCE (CD–k)
Hinton proposed the Contrastive Divergence (CD) algorithm
CD–k replaces 〈.〉∞ by 〈·〉k for small values of k.
CONTRASTIVE DIVERGENCE (CD–k)
v(0) ← x
Compute the binary (features) states of the hidden units, h(0),using v(0)
for n← 1 to kCompute the “reconstruction” states for the visible units, v(n),using h(n−1)
Compute the “reconstruction” states for the hidden units, h(n),using v(n)
end for
Update the weights and biases, according to:
∆Wji = γ(〈vihj〉0 − 〈vihj〉k) (7)
∆bj = γ(〈hj〉0 − 〈hj〉k) (8)
∆ai = γ(〈vi〉0 − 〈vi〉k) (9)
DEEP BELIEF NETWORKS (DBN)
x· · ·
h1· · ·
p(x|h1)p(h1|x)
DEEP BELIEF NETWORKS (DBN)
x· · ·
h1· · ·
h2· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
DEEP BELIEF NETWORKS (DBN)
x· · ·
h1· · ·
h2· · ·
h3· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
p(h2|h3)p(h3|h2)
DEEP BELIEF NETWORKS (DBN)
x· · ·
h1· · ·
h2· · ·
h3· · ·
low-level features
high-level features(concepts)
GPU IMPLEMENTATION
Training a DBN is a computationally expensive task thatinvolves training several RBMs and may require aconsiderable amount of time.
Solution?GPU Parallel implementation
GPU IMPLEMENTATION
Training a DBN is a computationally expensive task thatinvolves training several RBMs and may require aconsiderable amount of time.Solution?
GPU Parallel implementation
CUDA – DEVICE ARCHITECTURE
Device
Device Memory
Streaming Multiprocessor SMN
· · ·Streaming Multiprocessor SM2
Streaming Multiprocessor SM1
Shared Memory
· · ·Processor
1
Processor
2
Processor
M
Instruction
Unit
CUDA – LAUNCHING A KERNEL GRID
GridBlock(0,0) Block(1,0) Block(2,0) Block(3,0)
Block(0,1) Block(1,1) Block(2,1) Block(3,1)
Block(3,0)Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0)
Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1)
Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)
Threads within a block can share information.
However blocks are required to run independently.
To address scalability the tasks should be partitioned.
CUDA – LAUNCHING A KERNEL GRID
GridBlock(0,0) Block(1,0) Block(2,0) Block(3,0)
Block(0,1) Block(1,1) Block(2,1) Block(3,1)
Block(3,0)Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0)
Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1)
Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)
Threads within a block can share information.
However blocks are required to run independently.
To address scalability the tasks should be partitioned.
CUDA – LAUNCHING A KERNEL GRID
GridBlock(0,0) Block(1,0) Block(2,0) Block(3,0)
Block(0,1) Block(1,1) Block(2,1) Block(3,1)
Block(3,0)Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0)
Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1)
Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)
Threads within a block can share information.
However blocks are required to run independently.
To address scalability the tasks should be partitioned.
CUDA – SCALABILITY
GridBlock(0,0) Block(1,0) Block(2,0) Block(3,0)
Block(0,1) Block(1,1) Block(2,1) Block(3,1)
Device with 2 SMs Device with 4 SMs
execution
SM 0
Block(0,0) Block(1,0) Block(2,0) Block(3,0)
SM 1
Block(0,1) Block(1,1) Block(2,1) Block(3,1)
SM 0
Block(0,0) Block(0,1)
SM 1
Block(1,0) Block(1,1)
SM 2
Block(2,0) Block(2,1)
SM 3
Block(3,0) Block(3,1)
KERNELS
vdata ∈ IRN×I
RBM inputs (x)
ComputeStatusHiddenUnits
Step 1. Compute hdata
hdata ∈ IRN×J
RBM outputs (data)
ComputeStatusVisibleUnits
Step 2. Compute vrecon
vrecon ∈ IRN×I
reconstructed inputs
ComputeStatusHiddenUnits
Step 3. Compute hrecon
hrecon ∈ IRN×J
reconstructed outputs
w ∈ IRJ×I
weights
a ∈ IRI
visible units bias
b ∈ IRJ
hidden units bias
CorrectWeights
Step 4. Correct weights
COMPUTESTATUSHIDDENUNITS AND
COMPUTESTATUSVISIBLEUNITS
KERNELS
Each thread represents a connectionMultiplies the clamped input by the weightStores the weight in the shared memory
Each block represents a neuronUses fast shared memory to sum up the values computed byeach thread
Block (Neuron)
Connection 1 Connection 2 Connection 3. . .
Connection J
STORING THE CONNECTION WEIGHTS
ComputeStatusHiddenUnits - Coalesced access
ComputeStatusVisibleUnits - Uncoalesced access
w11 w12 w13 w14 w15 · · · w1I
w21 w22 w23 w24 w25 · · · w2I
w31 w32 w33 w34 w35 · · · w3I
w41 w42 w43 w44 w45 · · · w4I
w51 w52 w53 w54 w55 · · · w5I
· · · · · · · · · · · · · · · · · · · · ·
wJ1 wJ2 wJ3 wJ4 wJ5 · · · wJI
w13
w23
w33
w43
w53
· · ·
wJ3
w31 w32 w33 w34 w35 · · · w3I
CORRECTWEIGHTS KERNEL
FIRST APPROACH
Each thread gathers and sums up the values for one or moresamples
Each block corrects the weight of a connection
Block (Connection)
Sample 1 Sample 2 Sample 3. . .
Sample N
PROBLEMS?
h1 h2 h3 · · · hj · · · hJ 1
v1 v2 · · · vi · · · vI 1
∆Wji = γ(〈vihj〉0 − 〈vihj〉k)
∆bj = γ(〈hj〉0 − 〈hj〉k)
PROBLEMS?
h1 h2 h3 · · · hj · · · hJ 1
v1 v2 · · · vi · · · vI 1
∆Wji = γ(〈vihj〉0 − 〈vihj〉k)
∆ai = γ(〈vi〉0 − 〈vi〉k)
CORRECTWEIGHTS KERNEL
IMPROVED APPROACH
Each block has 16× 16 threads.
Each thread within a block must now process all the samples.
Block 0 (16×16 connections)
Connection (0,0) Connection (0,1) Connection (0,2). . .
Connection (0, 15)
Connection (1,0) Connection (1,1) Connection (1,2). . .
Connection (1, 15)
Connection (2,0) Connection (2,1) Connection (2,2). . .
Connection (2, 15)
· · · · · · · · · · · · · · ·
Connection (15,0) Connection (15,1) Connection (15,2). . .
Connection (15, 15)
CORRECTWEIGHTS KERNEL
IMPROVED APPROACH
But we can access vi and hj variables in a coalesced way andstore them in shared memory for faster accesses.
Although, this new approach has a smaller number of blocks,it performs much better than our first approach (≈ 15× faster).
Block 0 (16×16 connections)
Connection (0,0) Connection (0,1) Connection (0,2). . .
Connection (0, 15)
Connection (1,0) Connection (1,1) Connection (1,2). . .
Connection (1, 15)
Connection (2,0) Connection (2,1) Connection (2,2). . .
Connection (2, 15)
· · · · · · · · · · · · · · ·
Connection (15,0) Connection (15,1) Connection (15,2). . .
Connection (15, 15)
TIME SPENT IN EACH TASK
FIRST AND SECOND APPROACHES
Generate random numbers (cuRAND)ComputeStatusHiddenUnits kernelComputeStatusVisibleUnits kernelCorrectWeights kernel
5.53%
10.24%
17.09%
14.86%
27.50%
45.91%
67.14%
11.73%
Firstapproach
Improvedapproach
EXPERIMENTAL SETUP
We tested our approach with the MNIST database
Each sample has 28× 28 pixel image of a hand-written digit(784 inputs)Hardware
CPU: Intel dual-core i5-2410M (8GB Memory)GPU: NVIDIA GeForce 460 GTX
NVIDIA GEFORCE 460 GTX
Number of Streaming Multiprocessors 7Number of cores 336Peak performance (GFLOPS) 940.8Device Memory (GB) 1Memory bandwidth (GB/sec) 112.5Shading clock speed (GHz) 1.4
RESULTS (1,000 SAMPLES)
0.01
0.1
1
10
100
0 100 200 300 400 500 600 700 800 900
Tim
e(s
)
Hidden units
23.26×
23.13×
21.86×24.46×
29.79×
GTX 460 (GPU)dual-core i5 (CPU)
RESULTS (10,000 SAMPLES)
0.1
1
10
100
1000
0 100 200 300 400 500 600 700 800 900
Tim
e(s
)
Hidden units
32.83×
30.29×
28.59×29.47×
38.16×
GTX 460 (GPU)dual-core i5 (CPU)
RESULTS (60,000 SAMPLES)
1
10
100
1000
10000
0 100 200 300 400 500 600 700 800 900
Tim
e(s
)
Hidden units
42.73×
43.46×
38.64×41.83×
46.07×
GTX 460 (GPU)dual-core i5 (CPU)
ADAPTIVE STEP SIZES
Training images
Reconstructionafter 10 epochs
Reconstructionafter 100 epochs
Reconstructionafter 250 epochs
Reconstructionafter 500 epochs
Reconstructionafter 750 epochs
Reconstructionafter 1000 epochs
Adaptive Step Sizes Fixed (optimized) learning rate 0.1
CONCLUSIONS AND FUTURE WORK
Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task that involvestraining several Restricted Boltzmann Machines (RBMs)upholding considerable efforts.
Our work has demonstrated that by taking a non-trivialapproach of GPU implementation we attained significantspeedups.
The adaptive step-size procedure for tuning the learning ratehas been incorporated in the learning model with excellingresults.
Future work will test this approach with other databases inparticular in real-world problems.
CONCLUSIONS AND FUTURE WORK
Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task that involvestraining several Restricted Boltzmann Machines (RBMs)upholding considerable efforts.
Our work has demonstrated that by taking a non-trivialapproach of GPU implementation we attained significantspeedups.
The adaptive step-size procedure for tuning the learning ratehas been incorporated in the learning model with excellingresults.
Future work will test this approach with other databases inparticular in real-world problems.
CONCLUSIONS AND FUTURE WORK
Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task that involvestraining several Restricted Boltzmann Machines (RBMs)upholding considerable efforts.
Our work has demonstrated that by taking a non-trivialapproach of GPU implementation we attained significantspeedups.
The adaptive step-size procedure for tuning the learning ratehas been incorporated in the learning model with excellingresults.
Future work will test this approach with other databases inparticular in real-world problems.
CONCLUSIONS AND FUTURE WORK
Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task that involvestraining several Restricted Boltzmann Machines (RBMs)upholding considerable efforts.
Our work has demonstrated that by taking a non-trivialapproach of GPU implementation we attained significantspeedups.
The adaptive step-size procedure for tuning the learning ratehas been incorporated in the learning model with excellingresults.
Future work will test this approach with other databases inparticular in real-world problems.
RESTRICTED BOLTZMANN MACHINES AND
DEEP BELIEF NETWORKS ON MULTI-CORE
PROCESSORS
Noel Lopes and Bernardete Ribeiro and Joao Goncalves
University of CoimbraPolytechnic Institute of Guarda
June 11, 2012WCCI–IJCNN
Bengio, Y. (2009).Learning deep architectures for AI.Foundations and Trends in Machine Learning, 2(1):1–127.
Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006).A fast learning algorithm for deep belief nets.Neural Computation, 18(7):1527–1554.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., andBengio, Y. (2007).An empirical evaluation of deep architectures on problems withmany factors of variation.In Proceedings of the 24th international conference onMachine learning (ICML 2007), pages 473–480. ACM.
Swersky, K., Chen, B., Marlin, B., and de Freitas, N. (2010).A tutorial on stochastic approximation algorithms for trainingrestricted boltzmann machines and deep belief nets.In Information Theory and Applications Workshop.