unified memory for data analytics and deep learning · 2019-03-29 · unified memory for data...

Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING

2

RAPIDSCUDA-accelerated Data Science Libraries

CUDA

PYTHON

APACHE ARROW on GPU Memory

DASK /

SPARK

cuDNNcuMLcuDF

DL

FRAMEWORKSRAPIDS

cuGraph

3

MORTGAGE PIPELINE: ETLhttps://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb

CSV

DF

Arrow

read CSV

filterjoin

groupby

https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipyn

https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb

4

MORTGAGE PIPELINE: PREP + MLhttps://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb

Arrow Arrow Arrow

DF

concat

DMatrix

convert

XGboost

https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb

5

GTC EU KEYNOTE RESULTS ON DGX-1

0

20

40

60

80

100

120

140

ETL PREP ML

Mortage workflow time breakdown on DGX-1 (s)

6

MAXIMUM MEMORY USAGE ON DGX-1

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8

GB

GPU ID

Tesla V100 limit – 32GB

7

ETL INPUT

CSV

CSV

CSV

CSV

CSV

CSV

original input set 112 quarters (~2-3GB)

CSV

CSV

CSV

CSV

https://rapidsai.github.io/demos/datasets/mortgage-data

CSV

240 quarters (1GB)

CSV

CSV

CSV

CSV

CSV

CSV

CSV

…

https://rapidsai.github.io/demos/datasets/mortgage-data

8

CAN WE AVOID INPUT SPLITTING?

0

5

10

15

20

25

30

35

40

1

1388

2775

4162

5549

6936

8323

9710

11097

12484

13871

15258

16645

18032

19419

20806

22193

23580

24967

26354

27741

29128

30515

31902

33289

34676

36063

37450

GPU memory usage (GB) - ETL (112 parts)

0

5

10

15

20

25

30

35

40

1

75

149

223

297

371

445

519

593

667

741

815

889

963

1037

1111

1185

1259

1333

1407

1481

1555

1629

1703

1777

1851

1925

1999

GPU memory usage (GB) - ETL (original dataset)

OOMCRASH


9

ML INPUT

Arrow Arrow Arrow

DF

concat

DMatrix

convert

XGboost

Some # of quarters are used for ML training

10

CAN WE TRAIN ON MORE DATA?

0

5

10

15

20

25

30

35

1

38

75

112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

GPU memory usage (GB) - PREP (112->20 parts)

0

5

10

15

20

25

30

35

1

24

47

70

93

116

139

162

185

208

231

254

277

300

323

346

369

392

415

438

461

484

507

530

553

576

599

622

GPU memory usage (GB) - PREP (112->28 parts)

OOMCRASH


11

HOW MEMORY MANAGED IN RAPIDS

12

RAPIDS MEMORY MANAGER

RAPIDS Memory Manager (RMM) is:

• A replacement allocator for CUDA Device Memory

• A pool allocator to make CUDA device memory allocation faster & asynchronous

• A central place for all device memory allocations in cuDF and other RAPIDS libraries

https://github.com/rapidsai/rmm


13

WHY DO WE NEED MEMORY POOLS

cudaMalloc/cudaFree are synchronous

• block the device

cudaMalloc/cudaFree are expensive

• cudaFree must zero memory for security

• cudaMalloc creates peer mappings for all GPUs

Using cnmem memory pool improves RAPIDS ETL time by 10x

cudaMalloc(&buffer, size_in_bytes);

cudaFree(buffer);

14

RAPIDS MEMORY MANAGER (RMM)

C/C++

Fast, Asynchronous Device Memory Management

RMM_ALLOC(&buffer, size_in_bytes, stream_id);

RMM_FREE(buffer, stream_id);

dev_ones = rmm.device_array(np.ones(count))dev_twos = rmm.device_array_like(dev_ones)# also rmm.to_device(), rmm.auto_device(), etc.

#include <rmm_thrust_allocator.h>rmm::device_vector<int> dvec(size);

thrust::sort(rmm::exec_policy(stream)->on(stream), …);

Python: drop-in replacement for Numba API

Thrust: device vector and execution policies

15

MANAGING MEMORY IN THE E2E PIPELINEperf optimization

required to avoid OOM

At this point all ETL processing is done and

memory stored in arrow

Arrow

16

KEY MEMORY MANAGEMENT QUESTIONS

• Can we make memory management easier?

• Can we avoid artificial pre-processing of input data?

• Can we train on larger datasets?

17

SOLUTION: UNIFIED MEMORY

cudaMallocManaged

Page on GPU

...cudaMallocManaged

EmptyGPU memory

Partially OccupiedGPU memory

Fully OccupiedGPU memory

cudaMallocManaged

Oversubscription

Evict

CPU MemoryPage on GPU (oversubscribed)

......

cudaMallocManaged

18

HOW TO USE UNIFIED MEMORY IN CUDF

from librmm_cffi import librmm_config as rmm_cfg

rmm_cfg.use_pool_allocator = True # default is Falsermm_cfg.use_managed_memory = True # default is False

Python

19

IMPLEMENTATION DETAILS

if (mFlags & CNMEM_FLAGS_MANAGED) { CNMEM_DEBUG_INFO("cudaMallocManaged(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMallocManaged(&data, size)); CNMEM_CHECK_CUDA(cudaMemPrefetchAsync(data, size, mDevice));

} else { CNMEM_DEBUG_INFO("cudaMalloc(%lu)\n", size);CNMEM_CHECK_CUDA(cudaMalloc(&data, size));

}

Pool allocator (CNMEM):

Regular RMM allocation:

if (rmm::Manager::usePoolAllocator()) { RMM_CHECK(rmm::Manager::getInstance().registerStream(stream)); RMM_CHECK_CNMEM(cnmemMalloc(reinterpret_cast<void**>(ptr), size, stream));

} else if (rmm::Manager::useManagedMemory())RMM_CHECK_CUDA(cudaMallocManaged(reinterpret_cast<void**>(ptr), size));

else RMM_CHECK_CUDA(cudaMalloc(reinterpret_cast<void**>(ptr), size));

20

1. UNSPLIT DATASET “JUST WORKS”

0

10

20

30

40

50

60

70

80

90

100

1

254

507

760

1013

1266

1519

1772

2025

2278

2531

2784

3037

3290

3543

3796

4049

4302

4555

4808

5061

5314

GPU memory usage (GB) - ETL (original

dataset) - cudaMallocManaged

mem used

pool size

0

10

20

30

40

50

60

70

80

90

100

1

77

153

229

305

381

457

533

609

685

761

837

913

989

1065

1141

1217

1293

1369

1445

1521

1597

1673

1749

1825

1901

1977

GPU memory usage (GB) - ETL

(original dataset) – cudaMalloc

OOM CRASHTesla V100 limit – 32GB

21

2. SPEED-UP ON CONVERSION

4636

0

20

40

60

80

100

120

140

20 quarters

cudaMalloc

20 quarters

cudaMallocManaged

DGX-1 time (s)

ETL

PREP

ML

25% speed-up on PREP!

22

3. LARGER ML TRAINING SET

OOM!

0

20

40

60

80

100

120

140

160

20 quarterscudaMalloc

20 quarterscudaMallocManaged

28 quarterscudaMalloc

28 quarterscudaMallocManaged

DGX-1 time (s)

ETL

PREP

ML

23

UNIFIED MEMORY GOTCHAS

1. UVM doesn’t work with CUDA IPC - careful when sharing data between processes

Workaround - separate (small) cudaMalloc pool for communication buffers

In the future it will work transparently with Linux HMM

2. Yes, you can oversubscribe, but there is danger that it will just run very slowly

Capture Nsight or nvprof profiles to check eviction traffic

In the future RMM may show some warnings about this

24

RECAP

Just to run the full pipeline on the GPU you need

Unified Memory

makes life easier for data scientists – less tweaking!

improves performance – sometimes it’s faster to allocate less often & oversubscribe

enables easy experiments with larger datasets

carefully partition input data

adjust memory pool options throughout the pipeline

limit training size to fit in memory

25

MEMORY MANAGEMENT IN THE FUTURE

Contribute to RAPIDS: https://github.com/rapidsai/cudf

Contribute to RMM: https://github.com/rapidsai/rmm

BlazingDB

OmniSci XGBoost

cuDNN

DatabasesNEXT BIG

THING

cuDF cuML

https://github.com/rapidsai/cudf


26

UNIFIED MEMORY FOR DEEP LEARNING

27

FROM ANALYTICS TO DEEP LEARNING

Data Preparation Machine Learning Deep Learning

28

PYTORCH INTEGRATION

PyTorch uses a caching allocator to manage GPU memory

Small allocations distributed from fixed buffer (for ex: 1 MB)

Large allocations are dedicated cudaMalloc’s

Trivial change

Replace cudaMalloc with cudaMallocManaged

Immediately call cudaMemPrefetchAsync to allocate pages on GPU

Otherwise cuDNN may select sub-optimal kernels

29

PYTORCH ALLOCATOR VS RMM

Memory pool to avoid synchronization on malloc/free

Directly uses CUDA APIs for memory allocations

Pool size not fixed

Specific to PyTorch C++ library

PyTorch Caching Allocator

Memory pool to avoid synchronization on malloc/free

Uses Cnmem for memory allocation and management

Reserves half the available GPU memory for pool

Re-usable across projects and with interfaces for various languages

RMM

30

WORKLOADS

ResNet-1001

DenseNet-264

VNet

Image Models

BN-ReLU-Conv 1x1 BN-ReLU-Conv 3x3 BN-ReLU-Conv 1x1

+

31

WORKLOADS

Word Language Modelling

Dictionary Size = 33278

Embedding Size = 256

LSTM units = 256

Back propagation through time = 1408 and 2800

Language Models

LSTM

FC

Softmax

Loss

Embedding

32

WORKLOADSBaseline Training Performance on V100-32GB

ModelFP16 FP32

Batch Size Samples/sec Batch Size Samples/sec

ResNet-1001 98 98.7 48 44.3

DenseNet-264 218 255.8 109 143.1

Vnet 30 3.56 15 3.4

Lang_Model-1408 32 94.9 40 77.9

Lang_Model-2800 16 46.5 18 35.7

Optimal Batch Size Selected for High Throughput

All results in this presentation are using PyTorch 1.0rc1, R418 driver, Tesla V100-32GB

33

GPU OVERSUBSCRIPTIONUpto 3x Optimal Batch Size

0

20

40

60

80

100

120

2

14 26 38 50 62 74 86 98

110

122

134

146

158

170

182

194

206

218

230

242

254

266

278

290

Sam

ple

s/se

c

Batch Size

ResNet-1001

FP16 FP32

0

50

100

150

200

250

300

2

26

50

74

98

122

146

170

194

218

242

266

290

314

338

362

386

410

434

458

482

506

530

554

578

602

626

650

Sam

ple

s/se

c

Batch Size

DenseNet-264

FP16 FP32

34

GPU OVERSUBSCRIPTIONFill

GPUCPU

Mem

…

…

…

35

GPU OVERSUBSCRIPTIONEvict

GPUCPU

Mem

…

…

…

36

GPU OVERSUBSCRIPTIONPage Fault-Evict-Fetch

GPUCPU

Mem

…

…

…

37

GPU OVERSUBSCRIPTIONResults

Model

FP16 FP32

Batch Size Samples/sec Batch Size Samples/sec

ResNet-1001 202 10.1 98 5

DenseNet-264 430 22.3 218 12.1

VNet 32 3 32 1.1

Lang_Model-1408 44 8.4 44 10

Lang_Model-2800 22 4.1 22 4.9

38

GPU OVERSUBSCRIPTIONPage Faults - ResNet-1001 Training Iteration

0

200000

400000

600000

800000

1000000

1200000

1 1.5 2 2.5 3

Pa

ge F

au

lt C

ou

nt

Over Subscription (Batch Size / Optimal Batch Size)

ResNet-1001

39

GPU OVERSUBSCRIPTION

Add cudaMemPrefetchAsync before kernels are called

Manual API Prefetch

cudaMemPrefetchAsync(…) // input, output, wparam

cudnnConvolutionForward(…)

---------------------------------------------------

cudaMemPrefetchAsync(…) // A, B, C

kernelPointWiseApply3(…)

40

GPU OVERSUBSCRIPTIONNo Prefetch vs Manual API Prefetch

41

GPU OVERSUBSCRIPTIONSpeed up from Manual API Prefetch

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

ResNet-1001 DenseNet-264 VNet Lang_Model-1408 Lang_Model-2800

Spee

d-U

p(x

)

No Prefetch vs CudaMemPrefetchAsync

FP16_Prefetch FP32_Prefetch

Observe upto 1.6x speed-up

42

GPU OVERSUBSCRIPTION

Prefetch memory before kernel to improve performance

cudaMemPrefetchAsync takes CPU cycles – degrades performance when not required

Automatic prefetching needed to achieve high performance

Prefetch Only When Needed

0

20

40

60

80

100

120

2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 242 254 266 278 290

Sam

ple

s/se

cBatch Size

ResNet-1001

FP16 FP16_prefetch

43

DRIVER PREFETCH

Driver initiated (density) prefetching from CPU to GPU

GPU pages tracked as chunk of smaller sysmem page

Driver logic: Prefetch rest of the GPU page when 51% is migrated to GPU

Change to 5%

Observe up to 20% gain in performance vs default settings

Aggressive driver prefetching

44

FRAMEWORK FUTURE

Framework can develop intelligence to insert prefetch before calling GPU kernels

Smart evictions: Activation’s only

Lazy Prefetch: Catch kernel calls right before execution and add prefetch calls

Eager Prefetch - Identify and add prefetch calls before the kernels are called

…nn.Conv2d(…) (Hook)

Replace:

nn.Prefetch(…)

nn.Conv2d(…)

x

W1

* y

W2

* z

Execute Parallelly

45

TAKEAWAY

Unified Memory oversubscription solves the memory pool fragmentation issue

Simple way to train bigger models and on larger input data

Minimal user effort, no change in framework programming

Frameworks can get better performance by adding prefetch’s

Try it out and contribute:





unified memory for data analytics and deep learning · 2019-03-29 · unified memory for data...

Documents