practical scaling techniques - github pages › presentations › practical... · practical scaling...

24
Ujval Kapasi | Dec 9, 2017 PRACTICAL SCALING TECHNIQUES

Upload: others

Post on 28-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

Ujval Kapasi | Dec 9, 2017

PRACTICAL SCALING TECHNIQUES

Page 2: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

2

DNN TRAINING ON MULTIPLE GPUSMaking DL training times shorter

parameters

mini-batch

gradients

Page 3: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

3

DNN TRAINING ON MULTIPLE GPUSMaking DL training times shorter

parameters

mini-batch per gpu

gradients

local gradients

parameters

mini-batch per gpu

gradients

local gradients

parameters

mini-batch per gpu

gradients

local gradients

Allreduce : Sum gradients across GPUs

parameters

mini-batch per gpu

gradients

local gradients

Data parallelism : split batch across multiple GPUs

Page 4: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

4

DNN TRAINING ON MULTIPLE GPUSMaking DL training times shorter

parameters

mini-batch per gpu

gradients

local gradients

parameters

mini-batch per gpu

gradients

local gradients

parameters

mini-batch per gpu

gradients

local gradients

Allreduce : Sum gradients across GPUs

parameters

mini-batch per gpu

gradients

local gradients

Data parallelism : split batch across multiple GPUs

Page 5: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

5

DNN TRAINING ON MULTIPLE GPUSMaking DL training times shorter

parameters

mini-batch per gpu

gradients

local gradients

parameters

mini-batch per gpu

gradients

local gradients

parameters

mini-batch per gpu

gradients

local gradients

Allreduce : Sum gradients across GPUs

parameters

mini-batch per gpu

gradients

local gradients

Data parallelism : split batch across multiple GPUs

Page 6: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

6

DNN TRAINING ON MULTIPLE GPUSMaking DL training times shorter

parameters

mini-batch per gpu

gradients

local gradients

parameters

mini-batch per gpu

gradients

local gradients

parameters

mini-batch per gpu

gradients

local gradients

Allreduce : Sum gradients across GPUs

parameters

mini-batch per gpu

gradients

local gradients

Data parallelism : split batch across multiple GPUs

Page 7: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

7

OPTIMIZE THE INPUT PIPELINE

Page 8: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

8

TYPICAL TRAINING PIPELINE

• Device/Training limited

• Host/IO limited

read&decode augment

train

Host

Host

Device

Device

Page 9: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

9

EXAMPLE: CNTK ON DGX-1

• Device/Training limited (ResNet 50)

• Host/IO limited (AlexNet)

Sync overhead

Page 10: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

10

THROUGHPUT COMPARISON

IMAGES/SECOND

File I/O ~10,000 290x550 images~1600 MB/s with LMDB on DGX-1 V100

Image Decoding ~ 10,000 – 15,000 290x550 imageslibjpeg-turbo, OMP_PROC_BIND=true on DGX-1 V100

Training >6,000 (Resnet-50)>14,000 (Resnet-18) Synthetic dataset, DGX-1 V100

Page 11: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

11

RN-18RN-34RN-50RN-101

RN-152

inception v3VGG

0

5000

10000

15000

0 5000 10000 15000

Achi

eved

Imag

es/s

Training only Images/s

Compute Bound 352 480 480 I/O 352 I/O

ROOFLINE ANALYSIS

Horizontal lines show I/Opipeline throughputs

352: 9800 images/s

480: 4900 images/s

Measurements collected using the Caffe2 framework

Page 12: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

12

RECOMMENDATIONSOptimize Image I/O:

Use fast data and file loading mechanisms such as LMDB or RecordIO

When loading from filesconsider mmap instead of fopen/fread

Use fast image decoding libraries such as libjpeg-turbo

Using OpenCVs imread function relinquishescontrol over these optimizations and sacrifices performance

Optimize Augmentation

Allow augmentation on GPU for I/O limited networks

0

0.5

1

1.5

2

1 4 16

Seco

nds

Threads

JPEG read and decoding time for 1024 files (290x550)

decoding time

mmap+decode

OpenCV imread

Page 13: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

13

OPTIMIZE COMMUNICATION

Page 14: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

14

DATA PARALLEL AND NCCL

NCCL uses rings to move data across all GPUs and perform reductions.Ring Allreduce is bandwidth optimal and adapts to many topologies

DGX-1 : 4 unidirectional rings

Page 15: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

15

NCCL 2 MULTI-NODE SCALING

https://github.com/uber/horovod

Page 16: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

16

Page 17: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

17

TRAIN WITH LARGE BATCHES

Page 18: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

18

DIFFICULTIES OF LARGE-BATCH TRAININGIt’s difficult to keep the test accuracy, while increasing batch size.

Recipe from [Goyal, 2017]:

a linear scaling of learning rate 𝛾𝛾 as a function of batch size B

a learning rate “warm-up” to prevent divergence during initial training phase

Resnet-50

Optimization is not a problem if you get right hyper-parameters

Priya Goyal, Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2017

18

Page 19: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

19

LARGE-BATCH TRAINING

Sam Smith, et al. “Don’t Decay the Learning Rate, Increase the Batchsize”

Keskar, et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”

Akiba, et al. “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”

19

Page 20: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

20

LAYER-WISE ADAPTIVE RATE SCALING (LARS)

20

Use local LR 𝜆𝜆𝑙𝑙 for each layer 𝑙𝑙∆ 𝒘𝒘𝒕𝒕

𝒍𝒍 = 𝜸𝜸 ∗ 𝝀𝝀𝒍𝒍 ∗ 𝜵𝜵𝜵𝜵(𝒘𝒘𝒕𝒕𝒍𝒍)

where: 𝛾𝛾 – global LR, 𝛻𝛻𝛻𝛻(𝑤𝑤𝑡𝑡𝑙𝑙) – stochastic gradient of loss function 𝛻𝛻 wrt 𝑤𝑤𝑡𝑡𝑙𝑙

𝜆𝜆𝑙𝑙 – local LR for layer 𝑙𝑙

Page 21: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

21

RESNET-50 WITH LARS: B 32K21

More details on LARS: https://arxiv.org/abs/1708.03888

Page 22: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

22

SUMMARY1) Larger batches allow scaling to larger number of nodes while maintaining high

utilization of each GPU

2) The key difficulties in large batch training is numerical optimization

3) The existing approach, based on using large learning rates, can lead to divergence, especially during the initial phase, even with warm-up

4) With “Layer-wise Adaptive Rate Scaling”(LARS) we scaled up Resnet-50 to B=16K

22

Page 23: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter

23

FUTURE CHALLENGES

Hybrid Model parallel and Data parallel

Disk I/O for large datasets that can't fit in system memory or on-node SSDs

23

Page 24: PRACTICAL SCALING TECHNIQUES - GitHub Pages › Presentations › practical... · PRACTICAL SCALING TECHNIQUES. 2. DNN TRAINING ON MULTIPLE GPUS. Making DL training times shorter