deep learning on the saturnv cluster
TRANSCRIPT
The World’s Greenest SupercomputerHPC Advisory Council Lugano 2017 Gunter Roeth SA
DL ON SATURNV CLUSTER
2
A DECADE OF SCIENTIFIC COMPUTING WITH GPUS
2006 2008 2012 20162010 2014
Fermi: World’s First HPC GPU
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
World’s First Atomic Model of HIV Capsid
GPU-Trained AI Machine Beats World Champion in Go
Stanford Builds AI Machine using GPUs
World’s First 3-D Mapping of Human Genome
CUDA Launched
World’s First GPU Top500 System
Google Outperform Humans in ImageNet
Discovered How H1N1 Mutates to Resist Drugs
AlexNet beats expert code by huge margin using GPUs
3
“…around 2008 my group at Stanford started advocating
shifting deep learning to GPUs (this was really
controversial at that time; but now everyone does it); and
I'm now advocating shifting to HPC (High
Performance Computing/Supercomputing) tactics for
scaling up deep learning. Machine learning should
embrace HPC. These methods will make researchersmore efficient and help accelerate the progress of our
whole field.” –Andrew Ng, Feb 2016
++ +Neural Network
Design ExpertiseHPC Design
ExpertiseGPU Programming
Expertise
Image & Video RecognitionExpertise
DEEP LEARNING is the Next Frontier for HPC
4
MCBDA2016 – Louis Capps, [email protected] – Louis Capps, [email protected]
Hyperscale
Collects and analyzes data
HPC
Parallel simulations produce
huge data
BIG DATA – FROM HPC TO HYPERSCALE TO ...
Insight Autonomy
Huge
data out
Big
Data inSmall
Data in
Small
Data out
Discovery Prediction
Create data Analyze data
Similar engine
Two different workload
- HPC is parallel simulation that produces huge datasets
- Hyperscale collects data and produces analytics
Can they converge?
- One possible future
- HPC feeds Hyperscale
- Hyperscale analysis affects HPC model/data
- Move workloads onto same system (stacks side by side)
- Eventually merge stack
Big
ComputeHuge
Compute
Huge
Storage
Big
Storage
5
http://nvidianews.nvidia.com/news/nvidia-and-microsoft-boost-ai-cloud-computing-with-launch-of-industry-standard-hyperscale-gpu-accelerator
MICROSOFT AND NVIDIA HGX-1 ANNOUNCEMENT
6
CLOUD CHALLENGES
1 SKU, Multiple Instances
Integration into Existing Datacenter
INSTANCES
Granular, Latency Sensitive
High Throughput Batch
HPC: different CPU:GPU ratios
DevOps / Development
Production Deployment
PROJECT OLYMPUS HGX-1 HYPERSCALE GPU ACCELERATOR PARTNERSHIP + INTEROPERABILITY
7
Project Olympus
HGX-1Hyperscale GPU
Accelerator
Configurable PCIe Cable to host + Expansion slots
NVIDIA P100 GPU
NVLink Hybrid Cube Mesh Fabric
20 Gbyte/sec per link Duplex
Adapters for other GPUs
8
DEEP LEARNING
2 CPU : 8 GPU8x P100 SXM2 | 4x x16 PCIe
2 CPU : 16 GPU16x P100 SXM2 | 4x x16 PCIe
CPUCPU CPUCPU
9
HPC
4 CPU : 8 GPU8x P100 SXM2 | 4x x16 PCIe
8 CPU : 8 GPU8x P100 SXM2 | 8x x16 PCIe
CPUCPU CPUCPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
10
2 CPU : 8 GPU8x P100 SXM2 | 4x x16 PCIe
8 CPU : 32 GPU32x P4 PCIe | 8x x16 PCIe
INFERENCE
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPUCPUCPU
11
65x in 3 Years
K40
K80 + cuDNN
1
M40 + cuDNN4
P100 + cuDNN5
0x
10x
20x
30x
40x
50x
60x
70x
2013 2014 2015 2016
AlexNet Training Performance
WHY THE EXCITEMENT?GPUs as Enablers of Breakthrough Results
Paper: H.Zhang et al. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks, arXiv:1612.03242
We can generate photorealistic images from textual descriptions now!
Achieve super-human accuracy in classification
And we are getting faster fast
12
72%
74%
84%
88%
93%
96%
2010 2011 2012 2013 2014 2015
“SUPERHUMAN” RESULTSSPARK HYPERSCALE ADOPTION
Deep Learning
ImageNet — Accuracy %
Cloud Services with AI Powered by NVIDIA
Alibaba/Aliyun Amazon Baidu eBay Facebook
Flickr Google iFLYTEK iQIYI JD.com
Orange Periscope Pinterest Qihoo 360 Shazam
Skype Sogou Twitter Yahoo Supermarket Yandex YelpHand-coded CV
Human
74%76%
1313
DEEP LEARNING EVERYWHERE
INTERNET & CLOUD
Image ClassificationSpeech Recognition
Language TranslationLanguage ProcessingSentiment AnalysisRecommendation
MEDIA & ENTERTAINMENT
Video CaptioningVideo Search
Real Time Translation
AUTONOMOUS MACHINES
Pedestrian DetectionLane Tracking
Recognize Traffic Sign
SECURITY & DEFENSE
Face DetectionVideo SurveillanceSatellite Imagery
MEDICINE & BIOLOGY
Cancer Cell DetectionDiabetic GradingDrug Discovery
14
GPUS IN ARTIFICIAL INTELLIGENCE
Replace hand-tuned parameters of the feature extraction steps (e.g. in voice and image recognition)
Deep learning is a subset of machine learning that refers to artificial neural networks that are composed of many layers.
Artificial Neural Networks inspired by human brain and need lots of training data (ideal for Big Data).
NVIDIA GPUs and cuDNN software broadly adopted for machine learning.
Machine Learning
Neural
Networks
Deep
Learning
15
NVIDIA DEEP LEARNING SDKHigh Performance GPU-Acceleration for Deep Learning
COMPUTER VISION SPEECH AND AUDIO BEHAVIOR
Object Detection Voice Recognition TranslationRecommendation
EnginesSentiment Analysis
DEEP LEARNING
cuDNN
MATH LIBRARIES
cuBLAS cuSPARSE
MULTI-GPU
NCCL
cuFFT
Mocha.jl
Image Classification
DEEP LEARNING
SDK
FRAMEWORKS
APPLICATIONS
16
Platform Tensorflow CNTK MXNet Caffe Theano Torch
Release Date 2016 2016 2015 2014 2010 2011
Core Language C++ C++ C++ C++ C++ C
API C++
Python
NDL C++
Python, R, Scala,
Matlab, Javascript,
Go, Julia
Python, Matlab Python Lua
Synchronisation
Model
Sync or async Sync Sync or async Sync Async Sync
Communication
Mode
Parameter server MPI Parameter server N/A
(Spark/Custom)
N/A N/A
Multi-GPU ✓ ✓ ✓ ✓ ✓ ✓
Multi-node ✓ ✓ ✓ ✗ ✗ ✗
Data Parallelism ✓ ✓ ✓ ✓ ✓ ✓
Model Parallelism ✓ N/A ✓ ✗ ✓ ✓
Fault Tolerance Checkpoint and
recovery
Checkpoint and
resume
Checkpoint and
resume
N/A Checkpoint and
resume
Checkpoint and
resume
Visualisation Graph
(interactive),
training monitor,
debugging tools
Graph (static) None Summary Statistics
(custom)
Graph (static) Plots
Fox, James, Yiming Zou, and Judy Qiu. "Software Frameworks for Deep Learning at Scale."
17
NVID
IA
DEEP LEARNING PERFORMANCE GUIDE
18
TensorFlowDeep Learning Training
An open-source software library for numerical computation using data flow graphs.
VERSION1.0
ACCELERATED FEATURESFull framework accelerated
SCALABILITYMulti-GPU and multi-node
More Informationhttps://www.tensorflow.org/
TensorFlow Deep Learning FrameworkTraining on 8x P100 GPU Server vs 8 x K80 GPU Server
-
1.0
2.0
3.0
4.0
5.0
Spee
du
p v
s. S
erve
r w
ith
8 x
K8
0
AlexNet GoogleNet ResNet-50 ResNet-152 VGG16
2.5x Avg. Speedup
3x Avg. Speedup
GPU Servers: Single Xeon E5-2690 [email protected] with GPUs configs as shownUbuntu 14.04.5, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, data set: ImageNet; batch sizes: AlexNet (128), GoogleNet (256), ResNet-50 (64), ResNet-152 (32), VGG-16 (32)
Server with 8x P100 16GB NVLink
Server with 8x P100 PCIe 16GB
19
CNTK Deep Learning FrameworkTraining on 8x P100 GPU Server vs 8 x K80 GPU Server
-
1.0
2.0
3.0
4.0
Spee
du
p v
s. S
erve
r w
ith
8 x
K8
0
AlexNet ResNet-50
CNTKDeep Learning Training
A free, easy-to-use, open-source, commercial-grade toolkit that trains deep learning algorithms to learn like the human brain.
VERSION1.0
ACCELERATED FEATURESFull framework accelerated
SCALABILITYMulti-GPU and multi-node
More Informationwww.microsoft.com/en-us/research/product/cognitive-toolkit/GPU Servers: Single Xeon E5-2690 [email protected] with GPUs configs as shown
Ubuntu 14.04.5, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, data set: ImageNet; batch sizes: AlexNet (128), ResNet-50 (64)
1.6x Avg. Speedup
2.7x Avg. Speedup
Server with 8x P100 16GB NVLink
Server with 8x P100 PCIe 16GB
20
Deep Learning Performance
INFERENCE
21
-
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
Spee
du
p v
s. v
s. S
ingl
e-So
cke
t C
PU
Ser
ver
AlexNet GoogleNet ResNet-152 VGG-19
Deep Learning Inference1 x GPU Server Throughput Performance vs. Single-Socket CPU Server
CPU Server: Single Xeon E5-2690 [email protected] Servers: Single Xeon E5-2690 [email protected] with x1 P100 16GB or x1 P4 GPUUbuntu 14.04.5, TensorRT 2.0, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, batch size 128, precision as indicated.
Deep Learning Inference using TensorRT on CAFFE
Popular GPU-accelerated framework using NVIDIA’s TensorRT 1.0 Inference Engine
VERSIONCAFFE 1.0TensorRT 1.0
ACCELERATED FEATURESCPU & GPU versions available
SCALABILITYMulti-GPU
More Informationhttp://caffe.berkeleyvision.org/
https://developer.nvidia.com/tensorrt
1xP100 (FP32)
1xP100 (FP16)
1xP4 (INT8)
1xP4 (FP32)
6x Avg. Speedup
20x Avg. Speedup
13x Avg. Speedup
25x Avg. Speedup
22
Training
Device
Datacenter
GPU DEEP LEARNING IS A NEW COMPUTING MODEL
TRAINING
Billions of Trillions of Operations
GPU train larger models, accelerate
time to market
DATACENTER INFERENCING
10s of billions of image, voice, video
queries per day
GPU inference for fast response,
maximize datacenter throughput
Datacenter
23
WHAT IS SATURNV?
24
Fastest AI Supercomputer in TOP5004.9 Petaflops Peak FP6419.6 Petaflops Peak FP1613 DGX-1 to get into Top500
Most Energy Efficient Supercomputer#1 Green5009.5 GFLOPS per Watt
Rocket for Cancer MoonshotCANDLE Development Platform Common platform with DOE labs – ANL, LLNL,ORNL, LANL
NVIDIA DGX SATURNVGiant Leap Towards Exascale AI
25
HOW DID WE BUILD SATURNV?
26
Innovation needs a deep learning supercomputer!
Deep Learning scalability; move outside the box
Focus on research
Used internally for Deep Learning applied research
Multiple users, algorithms, networks, new approaches
Embedded, robotic, auto, hyperscale, HPC
Partner with university research, government and industry collaborations
Study convergence of data science and HPC
WHY NVIDIA DGX SATURNV124 Node Supercomputing Cluster
27
NVIDIA DGX-1 DEEP LEARNING SYSTEM
28
GPU-Accelerated Server AlexNet TrainingDGX-1 Faster than 128 Knights Landing Servers
GTC-P: Plasma TurbulenceDGX-1 Faster than 64 Knights Landing Servers
ONE ARCHITECTURE BUILT FOR BOTHDATA SCIENCE & COMPUTATIONAL SCIENCE
GTC-P, Grid Size A, Systems: NVIDIA DGX-1, 8xP100,
Intel KNL 7250 68 core Flat-Quadrant mode, Omnipath
Based on AlexNet Batch size 256, weak scaling up to 32 KNL servers, 64 & 128 estimated based on ideal scaling, Xeon Phi 7250 Nodes
0x
10x
20x
30x
40x
1 4 8 16 32 64 128
Speed-u
p v
s 1x K
NL S
erv
er
Knights Landing Servers 1x DGX1
0x
1x
2x
3x
4x
5x
6x
7x
8x
9x
1 4 8 16 32 64
Knights Landing Servers 1x DGX1
Speed-u
p v
s 1x K
NL S
erv
er
NVIDIA DGX-1
29
124 NVIDIA DGX-1 Nodes – 992 P100 GPUs
8x NVIDIA Tesla P100 SXM GPUs – NVLINK CubeMesh
2x Intel Xeon 20 core GPUs
512TB DDR4 System Memory
SSD – 7 TB scratch + 0.5 TB OS
Mellanox 36 port EDR L1 and L2 switches
4 ports per system
Partial Fat tree topology
Ubuntu 14.04, CUDA 8, OpenMPI 1.10.3
NVIDIA GPU BLAS + Intel MKL (NVIDIA GPU HPL)
Deep Learning applied research
Many users, frameworks, algorithms, networks, new approaches
Embedded, robotic, auto, hyperscale, HPC
NVIDIA DGX SATURNV124 node Cluster
nvidia.com/dgx1
30
DEEP LEARNING CLUSTER REFERENCE ARCHITECTURE
31
NOV2016 TOP GREEN500 SYSTEM
Green500.org Top500.org
SATURNV produced groundbreaking 9.4 GF/W at full scale
--> Sets the stage for future Exascale class computing
32
Monitoring Effects of Carbon and Greenhouse Gas Emissions
Minute-by-minute AI Weather Forecasting
insideHPC.com SurveyNovember 2016
92%believe AI will impact their work
93%using deep learning seeing positive results
DEEP LEARNING IS VITAL TO HPC
33
RESOURCESFor Executives, Developers and Data Scientists
CASE STUDIES
PARTNER COURSES
INTRO MATERIALS
ON-SITE WORKSHOPS
SELF-PACED LABS
TECHNICAL BLOGS
34
NVIDIA DEEP LEARNING INSTITUTE
Training organizations and individuals to solve challenging problems using Deep Learning
On-site workshops and online courses presented by certified experts
Covering complete workflows for proven application use casesSelf-driving cars, recommendation engines, medical image classification, intelligent video analytics and more
www.nvidia.com/dli
https://www.nvidia.com/en-us/deep-learning -ai/education/
Hands-on Training for Data Scientists and Software Engineers
QUESTIONS?