running tensorflow at scale on gpus€¦ · ngc model containers (pytorch, tensorflow from 19.09)...
TRANSCRIPT
![Page 2: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/2.jpg)
AGENDA
● Introduction
● Why do we need to scale training
● How to achieve scaling
![Page 3: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/3.jpg)
3
2015
36000 Mins (25 Days)
1xK80 | 2015CUDA
2016
1200 Mins (20 Hours)DGX-1P | 2016
NVLink
2017
480 Mins (8 Hours)DGX-1V | 2017Tensor Core
6.3 Minutes on MLPerfAt Scale | 2018
DGX Cluster
2018
70 Minutes on MLPerfDGX-2H | 2018
NVSwitch
ResNet50 v1.5 training
2019
52.7 Minutes on MLPerf
DGX-2H | 2019NVSwitch
1.33 Minutes on MLPerf
At Scale | 2019DGX SuperPOD
DL Training: from single GPU to multi-node
![Page 4: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/4.jpg)
4
The whole stack must be considered
● Compute
● Network
● Storage
● Frameworks & Libraries
● Numerical methods
● Training recipes
![Page 5: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/5.jpg)
5
MLPerf: NVIDIA advancing AI training
Time to Train From 8 Hours to 80 Seconds
2019 MLPerf ID (in order from top to bottom of chart): ResNet-50: 0.6-30 | Transformer: 0.6-28 | GNMT: 0.6-14 | SSD: 0.6-27 | Mini-Go: 0.6-11 | Mask R-CNN: 0.6-23
![Page 6: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/6.jpg)
6
Largest TensorFlow model at scaleOak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs
Source: https://arxiv.org/pdf/1810.01993.pdf
2018 Gordon Bell Prize Winner
![Page 7: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/7.jpg)
AGENDA
● Introduction
● Why do we need to scale training
● How to achieve scaling
![Page 8: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/8.jpg)
8
● Unlabeled data:
○ Language model: BooksCorpus (800M words), English Wikipedia (2.5B words), WebText (8M
documents, 40 GB), C4 (Common Crawl, 745 GB)
○ GAN: unlabeled images and videos
○ Reinforcement learning: unsupervised self-play generates unlimited data
● Labeled data:
○ ImageNet (2012) - 1.3M images, 1000 categories Open Images (2019) - 9M images, 6000
categories
○ Semi-autonomous vehicles: 0.5-1.1TB of data for every 8h driving
Datasets getting larger
![Page 9: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/9.jpg)
9
DL models increasing in complexity
Image Recognition
NLP
NLP – Generative Tasks
ChatbotsE-mail auto-completionDocument Summarization
Autonomous VehiclesSocial TaggingVisual Search
Q&ASentimentTranslation
1.5Bn
26M340M
Next-level use-cases require gigantic models
https://github.com/NVIDIA/Megatron-LM
Project Megatron
8.3B parameters
8-way Model Parallel
64-way Data Parallel
24x larger than BERT
Speech Recognition
Translation
Object Detection
![Page 10: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/10.jpg)
AGENDA
● Introduction
● Why do we need to scale training
● How to achieve scaling
![Page 11: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/11.jpg)
11
Scaling == whack-a-mole ?
Solving one bottleneck and another one pops up
![Page 12: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/12.jpg)
12
Multi-node infrastructure requirements
System Design
Data Center
ManagementSW Stack
Multi-Node
Success
![Page 13: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/13.jpg)
13
● Hardware GPU cluster design:○ Compute: significant CPU to GPU ratio, interconnect with GPU
○ Storage: high speed NFS, multi-tier caching
○ Networking: topology and bandwidth, NVLINK, GPUDirect RDMA
● GPU cluster management:○ Scheduler: Slurm vs. Kubernetes
○ Container technologies: Docker, Enroot, Singularity, etc.
● Integrated software stack:○ NVIDIA libraries: CUDA, cuDNN, NCCL
○ DL Framework scale-out optimization
○ Model scale-out implementation & optimization
Challenges of multi-node DL training
![Page 14: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/14.jpg)
14
A basic recipe for deep learning scaling
Step 1: Optimize your single GPU model
Step 2: Scale to multiple GPUs on one node
Step 3: Scale to multiple nodes
![Page 15: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/15.jpg)
15
Case study
• BERT model scripts:https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERTConfigurations for convergence, from 8 to 1500 GPUs, multi-node ready
• Clone and train your own BERT model on multi-node Or download a pre-trained BERT model from NGC and fine-tune for your NLP task
Bidirectional Encoder Representations from Transformers
Super Human Question & Answering
NVIDIA Deep Learning Examples have many model scripts with best practices for accuracy and performance
![Page 16: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/16.jpg)
16
• Pre-training on non-labelled data opens up opportunities to using massive amounts of data:• BooksCorpus (800 million words)• English Wikipedia (2.5 billion words), multi-language Wikipedia• WebText (OpenAI, 8M documents, 40 GB of text)
• More data tends to lead to better accuracy
• BERT pre-training is computationally intensive and takes days to train even on the most powerful single node: BERT-Large (330M parameters) takes ~2.5 days to train on a single DGX-2 server with 16 V100 GPUs.
Why multi-node BERT training
![Page 17: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/17.jpg)
17
BERT multi-node pre-training performance
DGX-1
(16 GB)
GPUs Time to train
(Hrs)
1 8 153.6 (6.3
days)
4 32 39.3
16 128 10.4
DGX-2H
(32 GB)
GPUs Time to train
(Hrs)
1 16 58.4 (2.4 days)
4 64 15.4
16 256 3.9
64 1024 1.2
Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-loss-results
* Above time to train is measured for Mixed precision, training loss 1.3 in PyTorch; with LAMB optimizer
** Gradient accumulation is applied to DGX-2H 1,4,16 node
Metric: Time to train
![Page 18: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/18.jpg)
18
• Create efficient data pipeline
• Enable mixed precision training
• Enable XLA
• Ensure latest GPU libraries
• Develop model in container to facilitate scaling out
Step 1: Optimize model
![Page 19: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/19.jpg)
19
Step 1: Optimize model
• Use tf.data to create performant input pipelines
• Test I/O bottlenecks with a trivial model
• NVIDIA DALI accelerates image-based input pipelines
Data pipeline
![Page 20: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/20.jpg)
20
d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))d = d.repeat()d = d.shuffle(buffer_size=len(input_files))
# `cycle_length` is the number of parallel files that get read.cycle_length = min(num_cpu_threads, len(input_files))d = d.apply(
tf.contrib.data.parallel_interleave(tf.data.TFRecordDataset,cycle_length=cycle_length))
d = d.shuffle(buffer_size=100)
d = d.apply(tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),batch_size=batch_size,num_parallel_batches=num_cpu_threads,drop_remainder=True if is_training else False))
BERT
TFRecord - fast binary format
Parallel read, map, & batch
Fused map & batch op
Data pipeline
https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_pretraining.py
![Page 21: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/21.jpg)
21
Step 1: Optimize model
• 1-line optimizer wrapper:opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
• Up to 3x speed up in training on Tensor Cores with• Same accuracy• No change in hyperparameters• ½ memory bandwidth & footprint
• Optimal on Volta and Turing GPUs
Automatic Mixed Precision (AMP)
![Page 22: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/22.jpg)
22
Step 1: Optimize modelAutomatic Mixed Precision (AMP)
• Robust speedup across different TensorFlow workloads
• https://arxiv.org/abs/1710.03740
![Page 23: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/23.jpg)
23
Step 1: Optimize modelXLA (Accelerated Linear Algebra)
• TensorFlow XLA can accelerate models with minimal code changes
• XLA optimizes graph, mostly by fusing compatible kernels
• Set XLA optimization level:
https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageMo
deling/BERT/run_pretraining.py#L531
System config: Xeon E4-2698v4 CPU with 256GB system RAM, single V100 Tensor Core GPU 32GB. Tests
run using NVIDIA 18.11 TensorFlow container.
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
![Page 24: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/24.jpg)
24
Step 1: Optimize model
• Latest compatible features and tuning from CUDA toolkit and Deep Learning Libraries (cuDNN, cuBLAS, NCCL)
Latest GPU optimizations
![Page 25: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/25.jpg)
25
Step 1: Optimize model
• NGC containers: fully featured DL containers
• DL frameworks compiled with latest GPU libraries
• Portability of application libraries facilitates multi-node scale-out
Latest GPU optimizations
![Page 26: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/26.jpg)
26
![Page 27: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/27.jpg)
27
• Understand Data Parallel training concepts
• Ensure optimal inter-GPU communication
• Apply high level API for multi-GPU training
Step 2: Scale to multiple GPUs
![Page 28: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/28.jpg)
28
Step 2: Scale to multiple GPUs
• Single GPU
Under the hood
![Page 29: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/29.jpg)
29
Step 2: Scale to multiple GPUs
• Multiple GPU
• Data parallel training
Under the hood
• Allreduce algorithm
• NCCL: NVIDIA Collective Communication Library
![Page 30: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/30.jpg)
30
• Inter-GPU communication:
Step 2: Scale to multiple GPUsUnder the hood
Effective bandwidth in GB/s
![Page 31: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/31.jpg)
31
• Full non-blocking bandwidth
Step 2: Scale to multiple GPUsUnder the hood
![Page 32: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/32.jpg)
32
Step 2: Scale to multiple GPUs
• Popular approach to enable multi-GPU/multi-node in TensorFlow/Keras
• Strong NCCL integration
• Sample commands:
• Single-node (4 GPUs):
horovodrun -np 4 -H localhost:4 python train.py
• Multi-node (4 nodes with 4 GPUs each):
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
Approach 1: Horovod
![Page 33: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/33.jpg)
33
Step 2: Scale to multiple GPUs
import tensorflow as tfimport horovod.tensorflow as hvd
# Initialize Horovodhvd.init()
# Pin GPU to be usedconfig = tf.ConfigProto()config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...loss = ...opt = tf.train.AdamOptimizer(lr=0.01 * hvd.size())
# Add Horovod Distributed Optimizeropt = hvd.DistributedOptimizer(opt)
Approach 1: Horovod
# Add hook to synchronize initial statehooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operationtrain_op = opt.minimize(loss)
# Only checkpoint on rank 0ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None
# Session
with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir,config=config, hooks=hooks) as mon_sess:
while not mon_sess.should_stop():# Perform synchronous training.mon_sess.run(train_op)
![Page 34: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/34.jpg)
34
• Recently released native API that also support Allreduce with NCCL
• Multi-GPU:tf.distribute.MirrorStrategy
• Multi-node:tf.distribute.experimental.MultiWorkerMirroredStrategy
Step 2: Scale to multiple GPUsApproach 2: tf.distribute.Strategy
Source: https://www.tensorflow.org/guide/distributed_training
![Page 35: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/35.jpg)
35
• Adopt optimizer designed for large batch size
• Ensure effective inter-node communication
• Move data close to compute
• Consider full application & system software stack
Step 3: Scale to multiple nodes
![Page 36: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/36.jpg)
36
• Optimizer inspired by LARS• Layerwise Adaptive learning rate (You et al.)
• Allows training at huge global batch size• Originally, BERT+Adam (Devlin et al.) – global batch 256
• BERT+LAMB (You et al.) – global batch 64k
• Massive data parallelism
• Lower interconnect pressure with gradient accumulation
Step 3: Scale to multiple nodesLAMB optimizer
![Page 37: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/37.jpg)
37
BERT+LAMB
Robustly scale to large batch size
https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/optimization.py
class LAMBOptimizer(tf.train.Optimizer):"""A LAMB optimizer that includes "correct" L2 weight decay."""
def __init__(self,learning_rate,weight_decay_rate=0.0,beta_1=0.9,beta_2=0.999,epsilon=1e-6,exclude_from_weight_decay=None,name="LAMBOptimizer"):
"""Constructs a LAMBOptimizer."""super(LAMBOptimizer, self).__init__(False, name)
.
.
.
Step 3: Scale to multiple nodesLAMB optimizer
![Page 38: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/38.jpg)
38
• Inter-GPU communication (bigger picture):
Step 3: Scale to multiple nodesUnder the hood
Effective bandwidth in GB/s
![Page 39: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/39.jpg)
42
• Tensor Fusion
• Batch tensors together during allreduce
• HOROVOD_FUSION_THRESHOLD=<bytes> HOROVOD_CYCLE_TIME=<ms> horovodrun ...
• Gradient Compression (FP16 Allreduce):
• hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)
• Reduces network utilization
Step 3: Scale to multiple nodesFurther Horovod optimizations
![Page 40: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/40.jpg)
43
• DNN datasets are large
• Read-dominated at beginning of each epoch
• Keep data close to compute as much as possible:
• RAM disk, SSDs in RAID 0, Fast network attached storage
Step 3: Scale to multiple nodesStorage
![Page 41: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/41.jpg)
44
• Integrated software and hardware system for multi-node scaling
• State-of-the-art compute, GPU interconnect, node interconnect, and storage
Step 3: Scale to multiple nodesReference architecture: DGX SuperPOD
![Page 42: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/42.jpg)
45
NVIDIA DGX SuperPOD
Mellanox EDR 100G InfiniBand Network
Mellanox Smart Director Switches
In-Network Computing Acceleration Engines
Fast and Efficient Storage Access with RDMA
Up to 130Tb/s Switching Capacity per Switch
Ultra-Low Latency of 300ns
Integrated Network Manager
Terabit-Speed InfiniBand Networking per Node
…
Rack 1 Rack 16
ComputeBackplane
Switch
Storage Backplane
Switch
64 DGX-2
GPFS
200 Gb/s per node
800 Gb/s per node
White paper: https://www.nvidia.com/en-us/data-
center/resources/nvidia-dgx-superpod-reference-architecture/
![Page 43: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/43.jpg)
46
• Deep Learning Model:
• Hyperparameters tuned for multi-node scaling
• Multi-node launcher scripts
• Deep Learning Container:
• Optimized DL frameworks, GPU libraries, and multi-node software
• Host:
• Host OS, GPU driver, IB driver, container runtime engine (docker, enroot)
Step 3: Scale to multiple nodesSoftware stack - Application
![Page 44: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/44.jpg)
47
• Slurm: User job scheduling & management
• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes
• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm
• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks
Step 3: Scale to multiple nodesSoftware stack - System
Login nodes DGX Pod: DGX Servers w. DGX base OS
Slurm
controllerEnroot | DockerPyxis
NGC model containers (Pytorch, Tensorflow from 19.09)
DCGM
![Page 45: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/45.jpg)
48
DeepOps leverages Ansible for automated
large scale cluster deployment. Deployment doc
Deployment with DeepOps
Bootstrap all nodes
Prepare provisioning node
Provision all node(s)
Deploy Slurm on Slurm nodes
Deploy DL/ML development tools
Deploy Production AI applications
Deploy management services DeepO
ps
- Build your own GPU cluster following the DGX Pod and DGX
SuperPOD reference architectures.
- Clone the DeepOps repo and follow the cluster setup guide.
Open a GitHub issue if any problem.
Step 3: Scale to multiple nodes
![Page 46: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/46.jpg)
49
• Scaling requires careful consideration of algorithms and infrastructure at each step
• Optimized single-GPU model
• Efficient & scalable Allreduce library
• GPU interconnect, networking, storage
...
• NVIDIA platform makes scaling DL training easier and more efficient
• Deep Learning Examples with SOTA accuracy and performance
• NVIDIA NGC Container with optimized multi-GPU/multi-node software stack
• Accelerated compute platform designed for performance and scaling
SummaryScaling is important and we are here to help
![Page 47: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated](https://reader030.vdocuments.net/reader030/viewer/2022040608/5ec462a5754feb7aad16f7fe/html5/thumbnails/47.jpg)