unified memory for data analytics and deep learning · 2019-03-29 · unified memory for data...

46
Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING

Upload: others

Post on 03-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING

Page 2: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

2

RAPIDSCUDA-accelerated Data Science Libraries

CUDA

PYTHON

APACHE ARROW on GPU Memory

DASK /

SPARK

cuDNNcuMLcuDF

DL

FRAMEWORKSRAPIDS

cuGraph

Page 3: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

3

MORTGAGE PIPELINE: ETLhttps://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb

CSV

DF

Arrow

read CSV

filterjoin

groupby

Page 4: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

4

MORTGAGE PIPELINE: PREP + MLhttps://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb

Arrow Arrow Arrow

DF

concat

DMatrix

convert

XGboost

Page 5: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

5

GTC EU KEYNOTE RESULTS ON DGX-1

0

20

40

60

80

100

120

140

ETL PREP ML

Mortage workflow time breakdown on DGX-1 (s)

Page 6: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

6

MAXIMUM MEMORY USAGE ON DGX-1

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8

GB

GPU ID

Tesla V100 limit – 32GB

Page 7: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

7

ETL INPUT

CSV

CSV

CSV

CSV

CSV

CSV

original input set 112 quarters (~2-3GB)

CSV

CSV

CSV

CSV

https://rapidsai.github.io/demos/datasets/mortgage-data

CSV

240 quarters (1GB)

CSV

CSV

CSV

CSV

CSV

CSV

CSV

Page 8: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

8

CAN WE AVOID INPUT SPLITTING?

0

5

10

15

20

25

30

35

40

1

1388

2775

4162

5549

6936

8323

9710

11097

12484

13871

15258

16645

18032

19419

20806

22193

23580

24967

26354

27741

29128

30515

31902

33289

34676

36063

37450

GPU memory usage (GB) - ETL (112 parts)

0

5

10

15

20

25

30

35

40

1

75

149

223

297

371

445

519

593

667

741

815

889

963

1037

1111

1185

1259

1333

1407

1481

1555

1629

1703

1777

1851

1925

1999

GPU memory usage (GB) - ETL (original dataset)

OOMCRASH

Tesla V100 limit – 32GB

Page 9: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

9

ML INPUT

Arrow Arrow Arrow

DF

concat

DMatrix

convert

XGboost

Some # of quarters are used for ML training

Page 10: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

10

CAN WE TRAIN ON MORE DATA?

0

5

10

15

20

25

30

35

1

38

75

112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

GPU memory usage (GB) - PREP (112->20 parts)

0

5

10

15

20

25

30

35

1

24

47

70

93

116

139

162

185

208

231

254

277

300

323

346

369

392

415

438

461

484

507

530

553

576

599

622

GPU memory usage (GB) - PREP (112->28 parts)

OOMCRASH

Tesla V100 limit – 32GB

Page 11: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

11

HOW MEMORY MANAGED IN RAPIDS

Page 12: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

12

RAPIDS MEMORY MANAGER

RAPIDS Memory Manager (RMM) is:

• A replacement allocator for CUDA Device Memory

• A pool allocator to make CUDA device memory allocation faster & asynchronous

• A central place for all device memory allocations in cuDF and other RAPIDS libraries

https://github.com/rapidsai/rmm

Page 13: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

13

WHY DO WE NEED MEMORY POOLS

cudaMalloc/cudaFree are synchronous

• block the device

cudaMalloc/cudaFree are expensive

• cudaFree must zero memory for security

• cudaMalloc creates peer mappings for all GPUs

Using cnmem memory pool improves RAPIDS ETL time by 10x

cudaMalloc(&buffer, size_in_bytes);

cudaFree(buffer);

Page 14: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

14

RAPIDS MEMORY MANAGER (RMM)

C/C++

Fast, Asynchronous Device Memory Management

RMM_ALLOC(&buffer, size_in_bytes, stream_id);

RMM_FREE(buffer, stream_id);

dev_ones = rmm.device_array(np.ones(count))dev_twos = rmm.device_array_like(dev_ones)# also rmm.to_device(), rmm.auto_device(), etc.

#include <rmm_thrust_allocator.h>rmm::device_vector<int> dvec(size);

thrust::sort(rmm::exec_policy(stream)->on(stream), …);

Python: drop-in replacement for Numba API

Thrust: device vector and execution policies

Page 15: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

15

MANAGING MEMORY IN THE E2E PIPELINEperf optimization

required to avoid OOM

At this point all ETL processing is done and

memory stored in arrow

Arrow

Page 16: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

16

KEY MEMORY MANAGEMENT QUESTIONS

• Can we make memory management easier?

• Can we avoid artificial pre-processing of input data?

• Can we train on larger datasets?

Page 17: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

17

SOLUTION: UNIFIED MEMORY

cudaMallocManaged

Page on GPU

...cudaMallocManaged

EmptyGPU memory

Partially OccupiedGPU memory

Fully OccupiedGPU memory

cudaMallocManaged

Oversubscription

Evict

CPU MemoryPage on GPU (oversubscribed)

......

cudaMallocManaged

Page 18: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

18

HOW TO USE UNIFIED MEMORY IN CUDF

from librmm_cffi import librmm_config as rmm_cfg

rmm_cfg.use_pool_allocator = True # default is Falsermm_cfg.use_managed_memory = True # default is False

Python

Page 19: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

19

IMPLEMENTATION DETAILS

if (mFlags & CNMEM_FLAGS_MANAGED) { CNMEM_DEBUG_INFO("cudaMallocManaged(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMallocManaged(&data, size)); CNMEM_CHECK_CUDA(cudaMemPrefetchAsync(data, size, mDevice));

} else { CNMEM_DEBUG_INFO("cudaMalloc(%lu)\n", size);CNMEM_CHECK_CUDA(cudaMalloc(&data, size));

}

Pool allocator (CNMEM):

Regular RMM allocation:

if (rmm::Manager::usePoolAllocator()) { RMM_CHECK(rmm::Manager::getInstance().registerStream(stream)); RMM_CHECK_CNMEM(cnmemMalloc(reinterpret_cast<void**>(ptr), size, stream));

} else if (rmm::Manager::useManagedMemory())RMM_CHECK_CUDA(cudaMallocManaged(reinterpret_cast<void**>(ptr), size));

else RMM_CHECK_CUDA(cudaMalloc(reinterpret_cast<void**>(ptr), size));

Page 20: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

20

1. UNSPLIT DATASET “JUST WORKS”

0

10

20

30

40

50

60

70

80

90

100

1

254

507

760

1013

1266

1519

1772

2025

2278

2531

2784

3037

3290

3543

3796

4049

4302

4555

4808

5061

5314

GPU memory usage (GB) - ETL (original

dataset) - cudaMallocManaged

mem used

pool size

0

10

20

30

40

50

60

70

80

90

100

1

77

153

229

305

381

457

533

609

685

761

837

913

989

1065

1141

1217

1293

1369

1445

1521

1597

1673

1749

1825

1901

1977

GPU memory usage (GB) - ETL

(original dataset) – cudaMalloc

OOM CRASHTesla V100 limit – 32GB

Page 21: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

21

2. SPEED-UP ON CONVERSION

4636

0

20

40

60

80

100

120

140

20 quarters

cudaMalloc

20 quarters

cudaMallocManaged

DGX-1 time (s)

ETL

PREP

ML

25% speed-up on PREP!

Page 22: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

22

3. LARGER ML TRAINING SET

OOM!

0

20

40

60

80

100

120

140

160

20 quarterscudaMalloc

20 quarterscudaMallocManaged

28 quarterscudaMalloc

28 quarterscudaMallocManaged

DGX-1 time (s)

ETL

PREP

ML

Page 23: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

23

UNIFIED MEMORY GOTCHAS

1. UVM doesn’t work with CUDA IPC - careful when sharing data between processes

Workaround - separate (small) cudaMalloc pool for communication buffers

In the future it will work transparently with Linux HMM

2. Yes, you can oversubscribe, but there is danger that it will just run very slowly

Capture Nsight or nvprof profiles to check eviction traffic

In the future RMM may show some warnings about this

Page 24: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

24

RECAP

Just to run the full pipeline on the GPU you need

Unified Memory

makes life easier for data scientists – less tweaking!

improves performance – sometimes it’s faster to allocate less often & oversubscribe

enables easy experiments with larger datasets

carefully partition input data

adjust memory pool options throughout the pipeline

limit training size to fit in memory

Page 25: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

25

MEMORY MANAGEMENT IN THE FUTURE

Contribute to RAPIDS: https://github.com/rapidsai/cudf

Contribute to RMM: https://github.com/rapidsai/rmm

BlazingDB

OmniSci XGBoost

cuDNN

DatabasesNEXT BIG

THING

cuDF cuML

Page 26: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

26

UNIFIED MEMORY FOR DEEP LEARNING

Page 27: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

27

FROM ANALYTICS TO DEEP LEARNING

Data Preparation Machine Learning Deep Learning

Page 28: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

28

PYTORCH INTEGRATION

PyTorch uses a caching allocator to manage GPU memory

Small allocations distributed from fixed buffer (for ex: 1 MB)

Large allocations are dedicated cudaMalloc’s

Trivial change

Replace cudaMalloc with cudaMallocManaged

Immediately call cudaMemPrefetchAsync to allocate pages on GPU

Otherwise cuDNN may select sub-optimal kernels

Page 29: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

29

PYTORCH ALLOCATOR VS RMM

Memory pool to avoid synchronization on malloc/free

Directly uses CUDA APIs for memory allocations

Pool size not fixed

Specific to PyTorch C++ library

PyTorch Caching Allocator

Memory pool to avoid synchronization on malloc/free

Uses Cnmem for memory allocation and management

Reserves half the available GPU memory for pool

Re-usable across projects and with interfaces for various languages

RMM

Page 30: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

30

WORKLOADS

ResNet-1001

DenseNet-264

VNet

Image Models

BN-ReLU-Conv 1x1 BN-ReLU-Conv 3x3 BN-ReLU-Conv 1x1

+

Page 31: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

31

WORKLOADS

Word Language Modelling

Dictionary Size = 33278

Embedding Size = 256

LSTM units = 256

Back propagation through time = 1408 and 2800

Language Models

LSTM

FC

Softmax

Loss

Embedding

Page 32: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

32

WORKLOADSBaseline Training Performance on V100-32GB

ModelFP16 FP32

Batch Size Samples/sec Batch Size Samples/sec

ResNet-1001 98 98.7 48 44.3

DenseNet-264 218 255.8 109 143.1

Vnet 30 3.56 15 3.4

Lang_Model-1408 32 94.9 40 77.9

Lang_Model-2800 16 46.5 18 35.7

Optimal Batch Size Selected for High Throughput

All results in this presentation are using PyTorch 1.0rc1, R418 driver, Tesla V100-32GB

Page 33: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

33

GPU OVERSUBSCRIPTIONUpto 3x Optimal Batch Size

0

20

40

60

80

100

120

2

14 26 38 50 62 74 86 98

110

122

134

146

158

170

182

194

206

218

230

242

254

266

278

290

Sam

ple

s/se

c

Batch Size

ResNet-1001

FP16 FP32

0

50

100

150

200

250

300

2

26

50

74

98

122

146

170

194

218

242

266

290

314

338

362

386

410

434

458

482

506

530

554

578

602

626

650

Sam

ple

s/se

c

Batch Size

DenseNet-264

FP16 FP32

Page 34: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

34

GPU OVERSUBSCRIPTIONFill

GPUCPU

Mem

Page 35: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

35

GPU OVERSUBSCRIPTIONEvict

GPUCPU

Mem

Page 36: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

36

GPU OVERSUBSCRIPTIONPage Fault-Evict-Fetch

GPUCPU

Mem

Page 37: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

37

GPU OVERSUBSCRIPTIONResults

Model

FP16 FP32

Batch Size Samples/sec Batch Size Samples/sec

ResNet-1001 202 10.1 98 5

DenseNet-264 430 22.3 218 12.1

VNet 32 3 32 1.1

Lang_Model-1408 44 8.4 44 10

Lang_Model-2800 22 4.1 22 4.9

Page 38: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

38

GPU OVERSUBSCRIPTIONPage Faults - ResNet-1001 Training Iteration

0

200000

400000

600000

800000

1000000

1200000

1 1.5 2 2.5 3

Pa

ge F

au

lt C

ou

nt

Over Subscription (Batch Size / Optimal Batch Size)

ResNet-1001

Page 39: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

39

GPU OVERSUBSCRIPTION

Add cudaMemPrefetchAsync before kernels are called

Manual API Prefetch

cudaMemPrefetchAsync(…) // input, output, wparam

cudnnConvolutionForward(…)

---------------------------------------------------

cudaMemPrefetchAsync(…) // A, B, C

kernelPointWiseApply3(…)

Page 40: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

40

GPU OVERSUBSCRIPTIONNo Prefetch vs Manual API Prefetch

Page 41: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

41

GPU OVERSUBSCRIPTIONSpeed up from Manual API Prefetch

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

ResNet-1001 DenseNet-264 VNet Lang_Model-1408 Lang_Model-2800

Spee

d-U

p(x

)

No Prefetch vs CudaMemPrefetchAsync

FP16_Prefetch FP32_Prefetch

Observe upto 1.6x speed-up

Page 42: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

42

GPU OVERSUBSCRIPTION

Prefetch memory before kernel to improve performance

cudaMemPrefetchAsync takes CPU cycles – degrades performance when not required

Automatic prefetching needed to achieve high performance

Prefetch Only When Needed

0

20

40

60

80

100

120

2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 242 254 266 278 290

Sam

ple

s/se

cBatch Size

ResNet-1001

FP16 FP16_prefetch

Page 43: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

43

DRIVER PREFETCH

Driver initiated (density) prefetching from CPU to GPU

GPU pages tracked as chunk of smaller sysmem page

Driver logic: Prefetch rest of the GPU page when 51% is migrated to GPU

Change to 5%

Observe up to 20% gain in performance vs default settings

Aggressive driver prefetching

Page 44: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

44

FRAMEWORK FUTURE

Framework can develop intelligence to insert prefetch before calling GPU kernels

Smart evictions: Activation’s only

Lazy Prefetch: Catch kernel calls right before execution and add prefetch calls

Eager Prefetch - Identify and add prefetch calls before the kernels are called

…nn.Conv2d(…) (Hook)

Replace:

nn.Prefetch(…)

nn.Conv2d(…)

x

W1

* y

W2

* z

Execute Parallelly

Page 45: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA

45

TAKEAWAY

Unified Memory oversubscription solves the memory pool fragmentation issue

Simple way to train bigger models and on larger input data

Minimal user effort, no change in framework programming

Frameworks can get better performance by adding prefetch’s

Try it out and contribute:

https://github.com/rapidsai/cudf

https://github.com/rapidsai/rmm

Page 46: UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING · 2019-03-29 · UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING. 2 RAPIDS CUDA-accelerated Data Science Libraries CUDA