large-scale deep unsupervised learning using graphics processors

Large-scale Deep Unsupervised Learning using Graphics Processors

Rajat RainaAnand MadhavanAndrew Y. Ng

Stanford UniversityLarge-scale Deep Unsupervised Learning using Graphics Processors1Learning from unlabeled data

vs.Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. NgClassifycarmotorcycleInput spaceHigher-level representation

Unlabeled examplesLearn higher-level representationDeep Belief NetworksSparse Coding2The promise of unsupervised learningLarge-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. NgUse large amounts of unlabeled data to learncomplex/deep models, possibly with many parameters.

3Some recent work on DBNsPublished SourceDomainNumber of free parametersHinton et al.Handwritten digits1.6 millionHinton & SalakhutdinovFace images3 millionSalakhutdinov & HintonInformation retrieval2.6 millionRanzato & SzummerText documents3.6 millionOur DBN model over images100 millionLarge-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng(Similar situation for sparse coding.)4

Large-scale learning [Banko & Brill, 2001]Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng5Large-scale unsupervised learningCurrent models: 1000s of input dimensions, 1000s of hidden units. 106 parameters.

Our desired model: 108 parameters

Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng6Graphics Processors

RAMCPUGraphics Card (GPU)MotherboardLarge-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng7Why graphics processors?

Peak Gflops(billion ops / sec)1000

750

500

250

0NVIDIA GPU2003 2004 2005 2006 2007 2008(Source: NVIDIA CUDA Programming Guide)Intel CPULarge-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng8Why graphics processors?Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

IBM ASCI White SupercomputerCost: $110 millionSpace: 2 basketball courts

13 graphics cards9GPU Schematic(Note: Some additional features not displayed.)MPShared Memory(16K)SPSPSPSPSPSPSPSPRegistersGlobal Memory (~1GB)30 MPsMPShared Memory(16K)SPSPSPSPSPSPSPSPMPShared Memory(16K)SPSPSPSPSPSPSPSP100 GB/s(coalesced)1000GB/sRegistersRegistersSlow transfer from RAMRAM10Two-level parallelismSplit task into blocks, blocks into threads.Access to global memory (not RAM).Restrictions on memory access patterns.

Main bottleneck:Getting data into GPU memory, and accessing it in efficient ways.

NVIDIA CUDAHigh-level routines to allocate/copy GPU memory.Good GPU matrix libraries that suffice for many machine learning tasks.

GPU ProgrammingLarge-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. NgGlobal Memory (~1GB)MPShared MemorySPSPSPSPSPSPSPSPMPShared MemorySPSPSPSPSPSPSPSPMPShared MemorySPSPSPSPSPSPSPSPRAM11Unsupervised learning on GPUsInitialize parameters in global memory.while convergence criterion is not satisfiedPeriodically transfer a large number of unlabeled examples into global memory.Pick a few of the unlabeled examples at a time, and compute the updates in parallel using the GPU's two-level parallelism (blocks and threads) or GPU matrix libraries.endTransfer learnt parameters from global memory.Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng12Deep Belief NetworksLearning Large DBNs using Graphics Processors Rajat Raina, Andrew Y. Ng Contrastive divergence learning via conditional distributions:. . .vv1v2v3. . .hh1h2Restricted Boltzmann Machine (RBM)

Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng14Experimental setupSingle graphics card: Nvidia GTX 2801GB on-board memory, 240 cores.Current price: US $250.

CPU:Two cores, each @3.16GHz.Learning Large RBMs5 hours2 weeksGPUDual-core CPULearning time for 10 million examples (log scale)Millions of parameters 1 18 36 458 hours hour2 hours35 hours 1 hour 1 day 1 weekLarge-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng72x faster16Overlapping patches DBNHidden UnitsBHidden UnitsAInput imagePatch APatch BWA, bA, cAWB, bB, cB. . . . . .Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng17110 million parameters.Overlapping patches DBN example

. . 20736 units (144x144)32768 units(128 units per 24x24 patch)15680units8192units2048unitsAll layers can be learnt in about 1 day on a GPU.Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng18Sparse CodingLarge-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. NgSparse codingGiven unlabeled data x(i), obtain b by solving:

Alternating minimizationKeep a fixed, find optimal b.Keep b fixed, find optimal a.

Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

= 0.8 * + 0.3 * + 0.5 *

x = 0.8 * b87 + 0.3 * b376 + 0.5 * b411

Activations aBasis vectors bInput20Parallel Sparse CodingAlternating minimizationKeep a fixed, find optimal b. Easy on GPU (projected grad descent).Keep b fixed, find optimal a. Not as straightforward.

Need to parallelize:

Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

21Parallel Sparse CodingEasy to optimize for one coordinate (keeping the others fixed). (Friedman et al., 2007)

One iteration of our algorithm:

Descent direction

Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng22Sparse coding with 106 parameters1 day 6 hours19 daysGPUDual-core CPULearning time (days) with 10 million examplesSparsity3% nonzero 10% nonzeroLarge-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng15x faster23SummaryLarge-scale unsupervised learning.Ten-times more data might transform an OK algorithm into a good algorithm.Working at smaller-scale risks confounding the effects of the model itself, with the effect of scale.GPUs are a powerful tool for machine learning.Easy to program (no low-level programming).Especially useful for stochastic learning methods.

Learning algorithms for DBNs and sparse coding can be an order-of-magnitude faster.

Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng24THE END

Why graphics processors?Bandwidth from memory to processor(GB/s) 120

100

80

60

40

20

0Intel CPU2003 2004 2005 2006 2007NVIDIA GPU(Source: NVIDIA CUDA Programming Guide)Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. NgIn modern computing, sometimes the bottleneck is getting data into the processor from the memory.26__global__ void vecAdd(float* A, float* B){int my = threadIdx.x + blockIdx.x * 128;A[my]=A[my]+B[my];}

int main(int argc, char** argv){float A[SIZE], B[SIZE];float* d_A, * d_B;cudaMalloc((void**)&d_A,SIZE_BYTES); cudaMalloc((void**)&d_B,SIZE_BYTES);cudaMemcpy(d_A,A,SIZE_BYTES,cudaMemcpyHostToDevice);cudaMemcpy(d_B,B,SIZE_BYTES,cudaMemcpyHostToDevice);

vecAdd(d_A,d_B);cudaThreadSynchronize();cudaMemcpy(A,d_A,SIZE_BYTES,cudaMemcpyDeviceToHost);}

GPU Programming: A=A+BGPUCPU(Adapted from http://www.cs.technion.ac.il/~marks/docs/LinuxClubGPGPU.pdf)http://www.cs.technion.ac.il/~marks/docs/LinuxClubGPGPU.pdf27

large-scale deep unsupervised learning using graphics processors

Documents

learning banko brill

machine learning tasks

allocatecopy gpu memory

ngglobal memory

gpu programminglargescale

memory access patterns

ng8why graphics processors

ng7why graphics processors