unified memory for data analytics and deep learning · 2019-03-29 · unified memory for data...
TRANSCRIPT
Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM
UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING
2
RAPIDSCUDA-accelerated Data Science Libraries
CUDA
PYTHON
APACHE ARROW on GPU Memory
DASK /
SPARK
cuDNNcuMLcuDF
DL
FRAMEWORKSRAPIDS
cuGraph
3
MORTGAGE PIPELINE: ETLhttps://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb
CSV
DF
Arrow
read CSV
filterjoin
groupby
4
MORTGAGE PIPELINE: PREP + MLhttps://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb
Arrow Arrow Arrow
DF
concat
DMatrix
convert
XGboost
5
GTC EU KEYNOTE RESULTS ON DGX-1
0
20
40
60
80
100
120
140
ETL PREP ML
Mortage workflow time breakdown on DGX-1 (s)
6
MAXIMUM MEMORY USAGE ON DGX-1
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8
GB
GPU ID
Tesla V100 limit – 32GB
7
ETL INPUT
CSV
CSV
CSV
CSV
CSV
CSV
original input set 112 quarters (~2-3GB)
CSV
CSV
CSV
CSV
https://rapidsai.github.io/demos/datasets/mortgage-data
CSV
240 quarters (1GB)
CSV
CSV
CSV
CSV
CSV
CSV
CSV
…
8
CAN WE AVOID INPUT SPLITTING?
0
5
10
15
20
25
30
35
40
1
1388
2775
4162
5549
6936
8323
9710
11097
12484
13871
15258
16645
18032
19419
20806
22193
23580
24967
26354
27741
29128
30515
31902
33289
34676
36063
37450
GPU memory usage (GB) - ETL (112 parts)
0
5
10
15
20
25
30
35
40
1
75
149
223
297
371
445
519
593
667
741
815
889
963
1037
1111
1185
1259
1333
1407
1481
1555
1629
1703
1777
1851
1925
1999
GPU memory usage (GB) - ETL (original dataset)
OOMCRASH
Tesla V100 limit – 32GB
9
ML INPUT
Arrow Arrow Arrow
DF
concat
DMatrix
convert
XGboost
Some # of quarters are used for ML training
10
CAN WE TRAIN ON MORE DATA?
0
5
10
15
20
25
30
35
1
38
75
112
149
186
223
260
297
334
371
408
445
482
519
556
593
630
667
704
741
778
815
852
889
926
963
1000
GPU memory usage (GB) - PREP (112->20 parts)
0
5
10
15
20
25
30
35
1
24
47
70
93
116
139
162
185
208
231
254
277
300
323
346
369
392
415
438
461
484
507
530
553
576
599
622
GPU memory usage (GB) - PREP (112->28 parts)
OOMCRASH
Tesla V100 limit – 32GB
11
HOW MEMORY MANAGED IN RAPIDS
12
RAPIDS MEMORY MANAGER
RAPIDS Memory Manager (RMM) is:
• A replacement allocator for CUDA Device Memory
• A pool allocator to make CUDA device memory allocation faster & asynchronous
• A central place for all device memory allocations in cuDF and other RAPIDS libraries
https://github.com/rapidsai/rmm
13
WHY DO WE NEED MEMORY POOLS
cudaMalloc/cudaFree are synchronous
• block the device
cudaMalloc/cudaFree are expensive
• cudaFree must zero memory for security
• cudaMalloc creates peer mappings for all GPUs
Using cnmem memory pool improves RAPIDS ETL time by 10x
cudaMalloc(&buffer, size_in_bytes);
cudaFree(buffer);
14
RAPIDS MEMORY MANAGER (RMM)
C/C++
Fast, Asynchronous Device Memory Management
RMM_ALLOC(&buffer, size_in_bytes, stream_id);
RMM_FREE(buffer, stream_id);
dev_ones = rmm.device_array(np.ones(count))dev_twos = rmm.device_array_like(dev_ones)# also rmm.to_device(), rmm.auto_device(), etc.
#include <rmm_thrust_allocator.h>rmm::device_vector<int> dvec(size);
thrust::sort(rmm::exec_policy(stream)->on(stream), …);
Python: drop-in replacement for Numba API
Thrust: device vector and execution policies
15
MANAGING MEMORY IN THE E2E PIPELINEperf optimization
required to avoid OOM
At this point all ETL processing is done and
memory stored in arrow
Arrow
16
KEY MEMORY MANAGEMENT QUESTIONS
• Can we make memory management easier?
• Can we avoid artificial pre-processing of input data?
• Can we train on larger datasets?
17
SOLUTION: UNIFIED MEMORY
cudaMallocManaged
Page on GPU
...cudaMallocManaged
EmptyGPU memory
Partially OccupiedGPU memory
Fully OccupiedGPU memory
cudaMallocManaged
Oversubscription
Evict
CPU MemoryPage on GPU (oversubscribed)
......
cudaMallocManaged
18
HOW TO USE UNIFIED MEMORY IN CUDF
from librmm_cffi import librmm_config as rmm_cfg
rmm_cfg.use_pool_allocator = True # default is Falsermm_cfg.use_managed_memory = True # default is False
Python
19
IMPLEMENTATION DETAILS
if (mFlags & CNMEM_FLAGS_MANAGED) { CNMEM_DEBUG_INFO("cudaMallocManaged(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMallocManaged(&data, size)); CNMEM_CHECK_CUDA(cudaMemPrefetchAsync(data, size, mDevice));
} else { CNMEM_DEBUG_INFO("cudaMalloc(%lu)\n", size);CNMEM_CHECK_CUDA(cudaMalloc(&data, size));
}
Pool allocator (CNMEM):
Regular RMM allocation:
if (rmm::Manager::usePoolAllocator()) { RMM_CHECK(rmm::Manager::getInstance().registerStream(stream)); RMM_CHECK_CNMEM(cnmemMalloc(reinterpret_cast<void**>(ptr), size, stream));
} else if (rmm::Manager::useManagedMemory())RMM_CHECK_CUDA(cudaMallocManaged(reinterpret_cast<void**>(ptr), size));
else RMM_CHECK_CUDA(cudaMalloc(reinterpret_cast<void**>(ptr), size));
20
1. UNSPLIT DATASET “JUST WORKS”
0
10
20
30
40
50
60
70
80
90
100
1
254
507
760
1013
1266
1519
1772
2025
2278
2531
2784
3037
3290
3543
3796
4049
4302
4555
4808
5061
5314
GPU memory usage (GB) - ETL (original
dataset) - cudaMallocManaged
mem used
pool size
0
10
20
30
40
50
60
70
80
90
100
1
77
153
229
305
381
457
533
609
685
761
837
913
989
1065
1141
1217
1293
1369
1445
1521
1597
1673
1749
1825
1901
1977
GPU memory usage (GB) - ETL
(original dataset) – cudaMalloc
OOM CRASHTesla V100 limit – 32GB
21
2. SPEED-UP ON CONVERSION
4636
0
20
40
60
80
100
120
140
20 quarters
cudaMalloc
20 quarters
cudaMallocManaged
DGX-1 time (s)
ETL
PREP
ML
25% speed-up on PREP!
22
3. LARGER ML TRAINING SET
OOM!
0
20
40
60
80
100
120
140
160
20 quarterscudaMalloc
20 quarterscudaMallocManaged
28 quarterscudaMalloc
28 quarterscudaMallocManaged
DGX-1 time (s)
ETL
PREP
ML
23
UNIFIED MEMORY GOTCHAS
1. UVM doesn’t work with CUDA IPC - careful when sharing data between processes
Workaround - separate (small) cudaMalloc pool for communication buffers
In the future it will work transparently with Linux HMM
2. Yes, you can oversubscribe, but there is danger that it will just run very slowly
Capture Nsight or nvprof profiles to check eviction traffic
In the future RMM may show some warnings about this
24
RECAP
Just to run the full pipeline on the GPU you need
Unified Memory
makes life easier for data scientists – less tweaking!
improves performance – sometimes it’s faster to allocate less often & oversubscribe
enables easy experiments with larger datasets
carefully partition input data
adjust memory pool options throughout the pipeline
limit training size to fit in memory
25
MEMORY MANAGEMENT IN THE FUTURE
Contribute to RAPIDS: https://github.com/rapidsai/cudf
Contribute to RMM: https://github.com/rapidsai/rmm
BlazingDB
OmniSci XGBoost
cuDNN
DatabasesNEXT BIG
THING
cuDF cuML
26
UNIFIED MEMORY FOR DEEP LEARNING
27
FROM ANALYTICS TO DEEP LEARNING
Data Preparation Machine Learning Deep Learning
28
PYTORCH INTEGRATION
PyTorch uses a caching allocator to manage GPU memory
Small allocations distributed from fixed buffer (for ex: 1 MB)
Large allocations are dedicated cudaMalloc’s
Trivial change
Replace cudaMalloc with cudaMallocManaged
Immediately call cudaMemPrefetchAsync to allocate pages on GPU
Otherwise cuDNN may select sub-optimal kernels
29
PYTORCH ALLOCATOR VS RMM
Memory pool to avoid synchronization on malloc/free
Directly uses CUDA APIs for memory allocations
Pool size not fixed
Specific to PyTorch C++ library
PyTorch Caching Allocator
Memory pool to avoid synchronization on malloc/free
Uses Cnmem for memory allocation and management
Reserves half the available GPU memory for pool
Re-usable across projects and with interfaces for various languages
RMM
30
WORKLOADS
ResNet-1001
DenseNet-264
VNet
Image Models
BN-ReLU-Conv 1x1 BN-ReLU-Conv 3x3 BN-ReLU-Conv 1x1
+
31
WORKLOADS
Word Language Modelling
Dictionary Size = 33278
Embedding Size = 256
LSTM units = 256
Back propagation through time = 1408 and 2800
Language Models
LSTM
FC
Softmax
Loss
Embedding
32
WORKLOADSBaseline Training Performance on V100-32GB
ModelFP16 FP32
Batch Size Samples/sec Batch Size Samples/sec
ResNet-1001 98 98.7 48 44.3
DenseNet-264 218 255.8 109 143.1
Vnet 30 3.56 15 3.4
Lang_Model-1408 32 94.9 40 77.9
Lang_Model-2800 16 46.5 18 35.7
Optimal Batch Size Selected for High Throughput
All results in this presentation are using PyTorch 1.0rc1, R418 driver, Tesla V100-32GB
33
GPU OVERSUBSCRIPTIONUpto 3x Optimal Batch Size
0
20
40
60
80
100
120
2
14 26 38 50 62 74 86 98
110
122
134
146
158
170
182
194
206
218
230
242
254
266
278
290
Sam
ple
s/se
c
Batch Size
ResNet-1001
FP16 FP32
0
50
100
150
200
250
300
2
26
50
74
98
122
146
170
194
218
242
266
290
314
338
362
386
410
434
458
482
506
530
554
578
602
626
650
Sam
ple
s/se
c
Batch Size
DenseNet-264
FP16 FP32
34
GPU OVERSUBSCRIPTIONFill
GPUCPU
Mem
…
…
…
35
GPU OVERSUBSCRIPTIONEvict
GPUCPU
Mem
…
…
…
36
GPU OVERSUBSCRIPTIONPage Fault-Evict-Fetch
GPUCPU
Mem
…
…
…
37
GPU OVERSUBSCRIPTIONResults
Model
FP16 FP32
Batch Size Samples/sec Batch Size Samples/sec
ResNet-1001 202 10.1 98 5
DenseNet-264 430 22.3 218 12.1
VNet 32 3 32 1.1
Lang_Model-1408 44 8.4 44 10
Lang_Model-2800 22 4.1 22 4.9
38
GPU OVERSUBSCRIPTIONPage Faults - ResNet-1001 Training Iteration
0
200000
400000
600000
800000
1000000
1200000
1 1.5 2 2.5 3
Pa
ge F
au
lt C
ou
nt
Over Subscription (Batch Size / Optimal Batch Size)
ResNet-1001
39
GPU OVERSUBSCRIPTION
Add cudaMemPrefetchAsync before kernels are called
Manual API Prefetch
cudaMemPrefetchAsync(…) // input, output, wparam
cudnnConvolutionForward(…)
---------------------------------------------------
cudaMemPrefetchAsync(…) // A, B, C
kernelPointWiseApply3(…)
40
GPU OVERSUBSCRIPTIONNo Prefetch vs Manual API Prefetch
41
GPU OVERSUBSCRIPTIONSpeed up from Manual API Prefetch
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
ResNet-1001 DenseNet-264 VNet Lang_Model-1408 Lang_Model-2800
Spee
d-U
p(x
)
No Prefetch vs CudaMemPrefetchAsync
FP16_Prefetch FP32_Prefetch
Observe upto 1.6x speed-up
42
GPU OVERSUBSCRIPTION
Prefetch memory before kernel to improve performance
cudaMemPrefetchAsync takes CPU cycles – degrades performance when not required
Automatic prefetching needed to achieve high performance
Prefetch Only When Needed
0
20
40
60
80
100
120
2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 242 254 266 278 290
Sam
ple
s/se
cBatch Size
ResNet-1001
FP16 FP16_prefetch
43
DRIVER PREFETCH
Driver initiated (density) prefetching from CPU to GPU
GPU pages tracked as chunk of smaller sysmem page
Driver logic: Prefetch rest of the GPU page when 51% is migrated to GPU
Change to 5%
Observe up to 20% gain in performance vs default settings
Aggressive driver prefetching
44
FRAMEWORK FUTURE
Framework can develop intelligence to insert prefetch before calling GPU kernels
Smart evictions: Activation’s only
Lazy Prefetch: Catch kernel calls right before execution and add prefetch calls
Eager Prefetch - Identify and add prefetch calls before the kernels are called
…nn.Conv2d(…) (Hook)
Replace:
nn.Prefetch(…)
nn.Conv2d(…)
x
W1
* y
W2
* z
Execute Parallelly
45
TAKEAWAY
Unified Memory oversubscription solves the memory pool fragmentation issue
Simple way to train bigger models and on larger input data
Minimal user effort, no change in framework programming
Frameworks can get better performance by adding prefetch’s
Try it out and contribute:
https://github.com/rapidsai/cudf
https://github.com/rapidsai/rmm