geeps: scalable deep learning on distributed gpus with a gpu...

47
GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Henggang Cui Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing PARALLEL DATA LABORATORY Carnegie Mellon University

Upload: others

Post on 08-Oct-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

GeePS: Scalable deep learningon distributed GPUs with a

GPU-specialized parameter serverHenggang Cui

Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. XingPARALLEL DATA LABORATORY

Carnegie Mellon University

Page 2: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Image classification w/ deep learning

Henggang Cui © June 17http://www.pdl.cmu.edu/ 2

Machine learning programTraining data:

images w/ labels

read/updateparams

Deep neural network:interconnected neurons

Eagle

Vulture

Accipiter

Osprey

Model parameters:connection weights

(solution)

Page 3: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

DistributedML workers

Distributed deep learning

Henggang Cui © June 17http://www.pdl.cmu.edu/ 3

PartitionedTraining data

Sharedmodel parameters

Eagle

Vulture

Accipiter

Osprey

read/updateparams

Parameter server

Page 4: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

DistributedGPU ML workers

Distributed deep learning

Henggang Cui © June 17http://www.pdl.cmu.edu/ 4

PartitionedTraining data

Sharedmodel parameters

Eagle

Vulture

Accipiter

Osprey

Parameter server

for GPUs

read/updateparams

Page 5: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Outline• Background

• Deep learning with GPUs• Parallel ML using parameter servers

• GeePS: GPU-specialized parameter server

• Experiment results

Henggang Cui © June 17http://www.pdl.cmu.edu/ 5

Page 6: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

A machine with no GPU

Henggang Cui © June 17http://www.pdl.cmu.edu/ 6

DRAM(CPU memory)

NIC

NetworkCPU cores

...Local

storage

Page 7: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

GPU device

GPUmemory

(a few GB)

GPU cores

A machine with a GPU device

Henggang Cui © June 17http://www.pdl.cmu.edu/ 7

DRAM(CPU memory)

NIC

NetworkCPU cores

...Local

storage

• Small GPU memory• Expensive to copy between GPU/CPU mem

Page 8: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Single GPU machine learning

Henggang Cui © June 17http://www.pdl.cmu.edu/ 8

Inputdata

GPU memoryCPU memory

Intermediatedata

a mini-batch of training data

Parameter data

Staging memoryfor input data batch

Input data file(training data)

Page 9: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Multi-GPU ML via CPU param. serv.

Henggang Cui © June 17http://www.pdl.cmu.edu/ 9

Parameter server shard 0

Inputdata

GPU memoryCPU memory

Intermediatedata

Parameter cache Param working copy

Staging memoryfor input data batch

Network

Input data file(training data)

1. Expensive CPU/GPUdata transfer

2. Only works whendata fits in GPU memory

Page 10: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Outline• Background

• Deep learning with GPUs• Parallel ML using parameter servers

• GeePS: GPU-specialized parameter server• Maintaining the parameter cache in GPU memory• Batch access with GPU cores for higher throughput• Managing limited GPU device memory

• Experiment results

Henggang Cui © June 17http://www.pdl.cmu.edu/ 10

Page 11: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Parameter cache

Parameter cache

Inputdata

Intermediatedata

Parameter server shard 0

Staging memory for parameter cache

Multi-GPU ML via GeePS

Henggang Cui © June 17http://www.pdl.cmu.edu/ 11

GPU memoryCPU memory

CPU/GPU transferin the background

• PS access through GPU memory• Higher PS throughputNetwork

Param working copy

Staging memoryfor input data batch

Input data file(training data)

Page 12: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Outline• Background

• GeePS: GPU-specialized parameter server• Maintaining the parameter cache in GPU memory• Batch access with GPU cores for higher throughput• Managing limited GPU device memory

• Experiment results

Henggang Cui © June 17http://www.pdl.cmu.edu/ 12

Page 13: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Layer-by-layer computation for DNN

Henggang Cui © June 17http://www.pdl.cmu.edu/ 13

Class probabilities

Training images

• For each iteration (mini-batch)• A forward pass• Then a backward pass

• Each time only data of two layers are used

Page 14: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Layer-by-layer computation for DNN• For each iteration (mini-batch)

• A forward pass• Then a backward pass

• Each time only data of two layers are used

Henggang Cui © June 17http://www.pdl.cmu.edu/ 14

Class probabilities

Training images

Page 15: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Layer-by-layer computation for DNN• For each iteration (mini-batch)

• A forward pass• Then a backward pass

• Each time only data of two layers are used

Henggang Cui © June 17http://www.pdl.cmu.edu/ 15

Class probabilities

Training images

Page 16: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Layer-by-layer computation for DNN• For each iteration (mini-batch)

• A forward pass• Then a backward pass

• Each time only data of two layers are used

Henggang Cui © June 17http://www.pdl.cmu.edu/ 16

Class probabilities

Training images

Page 17: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Layer-by-layer computation for DNN• For each iteration (mini-batch)

• A forward pass• Then a backward pass

• Each time only data of two layers are used

Henggang Cui © June 17http://www.pdl.cmu.edu/ 17

Class probabilities

Training images

Page 18: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Layer-by-layer computation for DNN• For each iteration (mini-batch)

• A forward pass• Then a backward pass

• Each time only data of two layers are used

Henggang Cui © June 17http://www.pdl.cmu.edu/ 18

• Use GPU mem as a cache to keep actively used data• Store the remaining in CPU mem

Class probabilities

Training images

Page 19: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

GPU memory management

Henggang Cui © June 17http://www.pdl.cmu.edu/ 19

Parameter cache

Inputdata

Intermediatedata

Parameter server shard 0

Staging memory for parameter cache

GPU memoryCPU memoryNetwork

Param working copy

Staging memoryfor input data batch

Input data file(training data)

GPU mem owned by app

Page 20: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Access buffer pool

GeePS-managed buffers

Henggang Cui © June 17http://www.pdl.cmu.edu/ 20

Parameter cache

Inputdata

Intermediatedata

Parameter server shard 0

Staging memory for parameter cache

GPU memoryCPU memoryNetwork

GeePS managed buffers

Staging memoryfor input data batch

Input data file(training data)

Page 21: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Access buffer pool

Local data

GeePS manages local data also

Henggang Cui © June 17http://www.pdl.cmu.edu/ 21

Parameter cacheParameter

server shard 0Staging memory for

parameter cacheGPU memoryCPU memory

Network

Staging memoryfor input data batch

Input data file(training data)

GeePS managed local data

Page 22: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Access buffer poolStaging memory for

parameter cache Parameter cache

Local data

Use CPU memory when not fit

Henggang Cui © June 17http://www.pdl.cmu.edu/ 22

Staging memoryfor input data batch

Parameter server shard 0

GPU memoryCPU memoryNetwork

Local data(CPU part)

Parameter cache(CPU part)

Input data file(training data)

GPU mem owned by GeePS

Page 23: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Use CPU memory when not fit

Henggang Cui © June 17http://www.pdl.cmu.edu/ 23

Staging memoryfor input data batch

Parameter server shard 0

GPU memoryCPU memoryNetwork

Access buffer pool

Local data(CPU part)

Parameter cache(CPU part)

Input data file(training data)

2x the size of largest layer

Page 24: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Pinned local data

Pinned param cache

Use CPU memory when not fit

Henggang Cui © June 17http://www.pdl.cmu.edu/ 24

Staging memoryfor input data batch

Parameter server shard 0

Staging memory for parameter cache

GPU memoryCPU memoryNetwork

Access buffer pool

Local data(CPU part)

Parameter cache(CPU part)

Input data file(training data)

Page 25: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Outline• Background

• GeePS: GPU-specialized parameter server• Maintaining the parameter cache in GPU memory• Batch access with GPU cores for higher

throughput• Managing limited GPU device memory

• Experiment results

Henggang Cui © June 17http://www.pdl.cmu.edu/ 25

Page 26: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Experimental setups• Cluster information

• Tesla K20C GPUs with 5 GB GPU memory

• Dataset and model• ImageNet: 7 million training images in 22,000 classes• Model: AlexNet

– 25 layers, 2.4 billion conns– total memory consumption 4.5 GB

Henggang Cui © June 17http://www.pdl.cmu.edu/ 26

Page 27: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

System setups• GeePS-Caffe setups

• Caffe: single-machine GPU deep learning system• GeePS-Caffe: Caffe linked with GeePS

• Baselines• The original unmodified Caffe• Caffe linked with CPU-based PS (IterStore [Cui SoCC’14])

Henggang Cui © June 17http://www.pdl.cmu.edu/ 27

Page 28: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Training throughput

Henggang Cui © June 17http://www.pdl.cmu.edu/ 28

Page 29: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Training throughput

Henggang Cui © June 17http://www.pdl.cmu.edu/ 29

• GeePS scales close to linear with more machines• with 16 machines, it runs 13x faster than Caffe• only 8% GPU stall time

Page 30: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Training throughput

Henggang Cui © June 17http://www.pdl.cmu.edu/ 30

• GeePS is much faster than CPU-based PS• 2.6x higher throughput• reduces GPU stall time from 65% to 8%

Page 31: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

More results in the paper• Good scalability and convergence speed for

• GoogLeNet network• RNN network for video classification

• Handle problems larger than GPU memory• Only 27% reduction in throughput with 35% memory

– 3x bigger problems with little overhead• Handle models as large as 20 GB• Support 4x longer videos for video classification

Henggang Cui © June 17http://www.pdl.cmu.edu/ 31

Page 32: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Conclusion• GPU-specialized parameter server for GPU ML

• 13x throughput speedup using 16 machines• 2x faster compared to CPU-based PS

• Managing limited GPU memory• By managing GPU memory inside GeePS as a cache• Efficiently handle problems larger than GPU memory

Enable use of data-parallel PS model

Henggang Cui © June 17http://www.pdl.cmu.edu/ 32

Page 33: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

References• [IterStore] H. Cui, A. Tumanov, J. Wei, L. Xu, W. Dai, J. Haber-

Kucharsky, Q. Ho, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting iterative-ness for parallel ML computations. In ACM SoCC, 2014.

• [Caffe] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

• [ImageNet] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE CVPR, 2009.

• [ProjectAdam] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project Adam: Building an efficient and scalable deep learning training system. In USENIX OSDI, 2014.

Henggang Cui © June 17http://www.pdl.cmu.edu/ 33

Page 34: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Additional related work• T. Chen, et al. MXNet: A flexible and efficient machine learning library

for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.

• H. Zhang, et al. Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arXiv preprint arXiv:1512.06216, 2015.

• A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.

• J. Dean, et al. Large scale distributed deep networks. In NIPS, 2012.• C. Szegedy, et al. Going deeper with convolutions. arXiv preprint

arXiv:1409.4842, 2014.• R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep image: Scaling

up image recognition. arXiv preprint arXiv:1501.02876, 2015.• A. Coates, B. Huval, T.Wang, D.Wu, B. Catanzaro, and N. Andrew.

Deep learning with COTS HPC systems. In ICML, 2013.

Henggang Cui © June 17http://www.pdl.cmu.edu/ 34

Page 35: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Backup Slides

Page 36: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Interface to GeePS-managed buffer• Read

– Buffer “allocated” by GeePS– Data copied to buffer

• PostRead– Buffer reclaimed

• PreUpdate– Buffer “allocated” by GeePS

• Update– Updates applied to data– Buffer reclaimed

Henggang Cui © June 17http://www.pdl.cmu.edu/ 36

Page 37: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Data placement policy

Henggang Cui © June 17http://www.pdl.cmu.edu/ 37

- Pin as much local data as can in GPU memory- Select local data that causes peak usage

GPU memory

Accessbuffer

peak

Page 38: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Data placement policy

Henggang Cui © June 17http://www.pdl.cmu.edu/ 38

GPU memory

Accessbuffer

(smaller)

to GPU memory

Intermediate states

Pin as much local data as can in GPU memorySelect local data that causes peak usage

Page 39: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Image classification accuracy

Henggang Cui © June 17http://www.pdl.cmu.edu/ 39

• To reach 10% classification accuracy:• 6x faster with 8 machines• 8x faster with 16 machines

Page 40: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Training throughput (more)

Henggang Cui © June 17http://www.pdl.cmu.edu/ 40

Page 41: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Per-layer memory usage

Henggang Cui © June 17http://www.pdl.cmu.edu/ 41

Page 42: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Throughput vs. memory budget

Henggang Cui © June 17http://www.pdl.cmu.edu/ 42

All data in GPU memory

Only buffer pool in GPU memoryTwice the peak size for double buffering

• Only 27% reduction in throughput with 35% memory• Can do 3x bigger problems with little overhead

Page 43: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Larger models

Henggang Cui © June 17http://www.pdl.cmu.edu/ 43

• Models up to 20 GB

Page 44: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Layer-by-layer computation for DNN• For each iteration (mini-batch)

• A forward pass• Then a backward pass

• Each time only data of two layers are used

Henggang Cui © June 17http://www.pdl.cmu.edu/ 44

Class probabilities

Training images

Page 45: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Computation vs. stall times

Henggang Cui © June 17http://www.pdl.cmu.edu/ 45

• Even for slack 0, updates of a layer can be sent to other machines before the updates of other layers finish

• CPU-PS has much overhead of transferring data between GPU/CPU memory in the foreground

Page 46: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Computation vs. stall times (more)

Henggang Cui © June 17http://www.pdl.cmu.edu/ 46

• GeePS and CPU-PS: updates of each layer are sent in distinct batches

• Single-table: updates of all layers are sent in a single batch

Page 47: GeePS: Scalable deep learning on distributed GPUs with a GPU ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter

Convergence with data staleness

Henggang Cui © June 17http://www.pdl.cmu.edu/ 47

• The data staleness sweet spot is Slack 0• because the GPUs perform huge amount of computation every clock